[go: up one dir, main page]

WO1991010954A1 - A risc vectorization system - Google Patents

A risc vectorization system Download PDF

Info

Publication number
WO1991010954A1
WO1991010954A1 PCT/US1991/000439 US9100439W WO9110954A1 WO 1991010954 A1 WO1991010954 A1 WO 1991010954A1 US 9100439 W US9100439 W US 9100439W WO 9110954 A1 WO9110954 A1 WO 9110954A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
slot
scheduling
instructions
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US1991/000439
Other languages
French (fr)
Inventor
Mark Buxbaum
Paul Hohensee
David L. Reese
David R. Wallace
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alliant Computer Systems Corp
Original Assignee
Alliant Computer Systems Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alliant Computer Systems Corp filed Critical Alliant Computer Systems Corp
Publication of WO1991010954A1 publication Critical patent/WO1991010954A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level

Definitions

  • a RISC VECTORIZATION SYSTEM Background of the Invention The invention relates to vectorizing iterative constructs to be run on a pipelined RISC processor.
  • a pipelined processor certain operations are performed by splitting them into smaller sequential subprocesses and then allocating each of the subprocesses to a corresponding piece of dedicated hardware, referred to as a stage.
  • the stages are separated by registers each of which receives the result from the previous stage and makes it available to the next stage.
  • the term pipe or pipeline refers to the sequence of stages and registers which are required to execute a particular operation. As the operation flows through its associated pipeline, it moves from one stage to the next, occupying only one stage at a time.
  • the invention is an apparatus for scheduling instructions of an iterative construct to run on a pipelined processor.
  • the apparatus includes means for creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler for scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct.
  • the scheduling apparatus further includes a compiler for generating an expression tree representation of the iterative construct.
  • the scheduler selects instructions for scheduling from the expression tree proceeding through the tree representation in a reverse execution order one instruction at a time starting with the last instruction.
  • the scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation.
  • the scheduling apparatus also includes computational logic for determining the total number of instructions, N Tot , in one iteration of the iterative construct and for determining for each one of a plurality of pipeline types the total number of instructions, N t , within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type.
  • the slot creating means uses N Tot to determine the number of instruction slots to create.
  • the scheduler also includes means for assigning an appropriate one of the pipeline types to each of the instruction slots and the assignment means limits the number of slots that are assigned a given pipeline type to the value of N t determined for that pipeline type.
  • the scheduler includes logic for determining an iteration offset for each instruction assigned to one of the plurality of instruction slots.
  • the scheduler includes logic for identifying for each selected instruction a corresponding output slot during which an output. from a pipeline associated with the selected instruction becomes available.
  • the scheduler further includes logic for identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.
  • the invention is a method for scheduling instructions of an iterative construct to run on a pipelined processor.
  • the method includes the steps of creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct.
  • One advantage of the invention is that it moves all fill and drain code to outside of the body of an iterative construct when it is run on a pipelined processor.
  • the invention schedules the operations of the iterative construct into their respective pipelines so that there are no gaps or wasted machine cycles during the execution of the iterative construct.
  • the invention achieves efficient register utilization by minimizing the time between when output from a pipeline becomes available and when the output is needed by a subsequent operation.
  • Fig. 1 shows the stages of a process for vectorizing a computer program
  • Fig. 2 shows an expression tree representation of an iterative construct
  • Fig. 3 is the data structure used in scheduling the operations of an expression tree for running on a RISC processor
  • Figs. 4 is a flow chart of the first section of a scheduling algorithm
  • Fig. 5 is a flow chart of the second section of the scheduling algorithm
  • Fig. 6 is an example of a simple tree expression
  • Figs. 7a-e show different development stages of a data structure used for scheduling the operations of the simple tree expression shown in Fig. 6;
  • Figs. 8a and 8b show the beginning and the end, respectively, of a schedule for executing the operations of an iterative construct represented by the expression tree shown in Fig. 6.
  • Fig. 1 shows the general stages of a process for vectorizing a computer program to run on a Reduced Instruction Set Computer (i.e., RISC processor) in which various operations are pipelined.
  • the computer program contains an iterative construct, such as a DO-loop, that includes a sequence of operations that is repeated for each iteration of the construct.
  • RISC processor used for the described embodiment, which may, for example, be an Intel 860, different operations are pipelined and the operations move through their respective pipelines by being pushed through by subsequent instructions introduced into that pipeline.
  • the vectorizing process exploits the direct access features of the RISC architecture to schedule operations within an iteration and across iterations of the iterative construct so as to optimally utilize the full capacity of the pipelines.
  • the process which shall be described in greater detail below, is referred to hereinafter as RISC vectorization.
  • the computer program is compiled to reduce the code into a simpler form (stage 102). Any one of a number of available or known compiler programs may be used to compile the program.
  • the compiler For each assignment statement within each iteration of the DO- loop in the program, the compiler generates a simplified expression tree representation of the code. In generating the tree expression, many compilers also perform certain transformations to eliminate cross- dependences that might exist between different iterations of the iterative construct. As an aid to visualizing the expression tree consider the following very simple computer program consisting of a single DO-loop:
  • Expression tree 110 includes nodes 112 representing the different operators of the program and for each node 112, pointers 114 identifying which operators supply the input values for that node.
  • the processor After expression tree 110 is generated for each iteration of the construct (and assuming no dependencies exist across iterations) , the processor identifies each pipeline type and for each pipeline type computes the total number N t of operators of that type within the expression tree (stage 104) . Once the processor has an N t for each of the operator types, it then computes the total number of all operators, N ⁇ ., within the tree. That is, it computes:
  • Stage 106 employs a Top-Down Left- o-Right Tree Walk to determine the order in which the scheduling algorithm processes the nodes.
  • the scheduling algorithm begins with the top most node, i.e., the "A" node in Fig.
  • a data structure 200 such as is shown in Fig. 3 is generated during stage 106 to aid in determining the proper scheduling of the nodes of the expression tree.
  • Data structure 200 includes N Tot schedule slots 202, one slot for each of the nodes in the expression tree, each slot representing a different instruction cycle. Slots 102 are designated slot(l) through slot(N ⁇ ot ) .
  • Each of slots 102 has six fields. There is a field 204(1) (labelled TYPE) for identifying the type of operator that is assigned to that slot. There is an instruction field 204(2) (labelled INST) for identifying the node that is assigned that schedule position. There are two input value fields 204(3) and 240(4) (labelled INI and IN2, respectively) for identifying the input values required by the node scheduled in that slot. There is an output field 204(5) (labelled OUT) for identifying the node which generates an output associated with that slot. And finally, there is an iteration number field 204(6) (labelled ITER) for identifying the iteration offset for the node scheduled in that slot. -
  • the scheduling algorithm keeps track of a two component position identifying that operations placement in data structure 200. Those two component positions are the iteration offset (IT_0FF) for that operation and the instruction position (IP), i.e., the slot for that operation.
  • IT_0FF iteration offset
  • IP instruction position
  • each operation will be identified as follows: 0P ⁇ [IT_0FF,IP] , where 0P ⁇ identifies the operation and the bracketed information identi ies the scheduling position for that operation.
  • the two component position is used to ensure that the two operations are executed in the correct order. More specifically, if an operation 0P ⁇ in a position ⁇ a,b> produces a result used by operation OP in position ⁇ c,d>, then ⁇ a,b> must be lexicographically before ⁇ c,d>. In other words, either OP ⁇ must be in an earlier iteration than OP so that a precedes c in execution order; or if they are in the same iteration, then the slot for OP ⁇ __ must come before that for OP_ Y_ so that b precedes d in execution order. Scheduling algorithm 300 shown in Figs.
  • Scheduling algorithm 300 has two distinct sections. The first section, which is shown in Fig. 4, determines when the output value from a particular operation pipeline is needed. And the second section, which is shown in Fig. 5, determines when the operation must be initiated in order to result in an output when it is needed. Data structure 200 is used to identify and keep track of slot assignments. Basically, the first section assigns a selected operation to the OUT field of the slot that corresponds to the time interval during which the output from the associated pipeline is required. Then, the second section counts back a number of slots equal to the length of the pipeline associated with the selected operation and assigns the operation to the INST field of the slot at that location.
  • the scheduling process begins by selecting the first operation for scheduling (step 280) . Since the selection step uses a Top-Down Left-to-Right Tree Walk, the selected operation is the top most node of the expression tree.
  • the processor calls scheduling algorithm 300 (step 290) . Initially, scheduling algorithm 300 sets an offset variable IO and an index variable S equal to IT_0FF and IP, respectively, of the selected operation (step 302) . For the first selected operation, both IT_OFF and S are set to zero.
  • algorithm 300 In addition to assigning the operation to the OUT field, algorithm 300 also performs several other operations during step 308. It assigns an operation type to Slot(l) by setting the TYPE field of Slot(l) to the same type as the selected operation. It decrements the variable N t to indicate that one fewer slot of that op- type is available for assignment. And, it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively. After the first operation has been assigned to the OUT field of the appropriate slot, algorithm 300 branches into its second section in which it assigns the selected operation to the INST field of a different appropriate slot. The second section begins by initializing a pipe stage counter W to zero (step 310) and incrementing S (step 312) .
  • algorithm 300 checks S to determine whether it is larger than N Tot , which is the maximum number of slots available for operation assignments (step 314) .
  • algorithm 300 may reach a point at which the value for N t in step 318 equals zero. In that case, algorithm 300 moves to the next iteration cycle by resetting S to one and incrementing 10 (step 326). After moving to the next iteration cycle, " algorithm 300 branches back to step 316. Since Slot(l) was assigned an op-type during the first pass of algorithm 300 through data structure 200, the test for an op-type assignment in step 316 results in an affirmative causing algorithm 300 to then determine whether that slot has the same op-type as the selected operation (step 328) . If the op-types are the same, algorithm 300 increments W (step 322) and then compares W to P t (step
  • step 328 causes a branch back to step 312 where S is again incremented.
  • algorithm 300 locates a slot that has an inappropriate op-type, it skips that slot and continues the search with the next slot.
  • algorithm 300 finds a slot with the appropriate op-type and for which W equals P t , then algorithm 300 assigns the selected operation to the INST field of that slot (step 330) .
  • algorithm 300 also performs several other tasks during step 330. It sets the value in the ITER field equal to IO. It identifies in the INI and IN2 fields the operations which supply the input values required by the selected operation. It sets the IT_OFF and IP values of the selected operation and the operations which supply the input values equal to 10 and S, respectively. And, it adds pointers to a stack from which subsequent operations are selected for slot assignment.
  • the pointers identify the operations which supply the input values for the selected operation and they are added to the stack in the order which results in the Top Down Left-to-Right Tree Walk.
  • the next operation is selected from the stack and the scheduling algorithm is again called to determine its proper scheduling (step 290) .
  • the value of IP associated with the selected operation will typically be non-zero.
  • algorithm 300 detects a non-zero value for S in step 304, it first increments S (step 305) and then compares the incremented value to N Tot to determine whether the index S is still within the allowed range of data structure 200 (step 332) .
  • algorithm 300 checks the TYPE field of slot(S) to determine whether it has an assigned op-type (step 334) . On the other hand, if S is greater than N Tot , algorithm 300 first moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing 10 by one to identify that iteration cycle (step 336) and then it checks whether slot(S) has an assigned op-type (step 334) .
  • algorithm 300 compares the ' value of N associated with the selected operation to zero (step 338) . If N t does equal zero, indicating that no more slots can be assigned that op-type, algorithm 300 branches back to step 336 where it moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing IO by one. However, if N t does not equal zero, algorithm 300 branches to step 308 where it assigns the selected operation to the OUT field of slot(S) and performs the other previously described functions.
  • step 334 if slot(S) does have an assigned op- type, algorithm 300 determines whether the assigned op- type is the same as the op-type of the selected operation (step 340) . If the assigned op-type and the op-type of the selected operation are not the same, indicating that the selected operation cannot be assigned to that slot, algorithm 300 branches back to step 305 where it increments S to continue the search for an appropriate slot. On the other hand, if the assigned op-type and the op-type of the selected operation are the same, algorithm 300 determines whether the OUT field of that slot already has an operation assigned to it (step 344) .
  • step 305 algorithm 300 increments S to continue the search for an available slot. If the OUT field is not occupied, algorithm 300 assigns the selected operation to the OUT field of that slot and it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively (step 346) . Once the selected operation has been assigned to an OUT field of the appropriate slot, algorithm 300 branches to step 310 to schedule the selected operation relative to the other operations in the expression tree.
  • algorithm 300 locates the first available slot that has the same op-type as the selected operation and that has no previous operation assigned to its OUT field. When it finds such a slot, it assigns the selected operation to the OUT field of that slot.
  • the steps are as previously described for assigning the first selected operation with some exceptions. For later selected operations, however, the incremented value of S from step 312 may become larger than N- ⁇ .
  • algorithm 300 moves to the next iteration cycle by branching to step 326 after step 314. That is, S is set equal to 1 and 10 is incremented by one. Thus, the search for the appropriate slot is carried into the next iteration cycle.
  • Algorithm 300 always fills data structure 200 with exactly the same number of instructions as are in the tree expression, namely, N ⁇ .. Therefore, there are no gaps in any of the pipelines once steady state operation is achieved and no operations are required during steady state operation for just flushing a pipe.
  • the sequence of operations appearing in the INST fields of slot(N Tot ) through slot(l) determines the order in which those operations are to be scheduled in the RISC processor. That is, since the operations are scheduled in reverse order (i.e., later operations before earlier operations) , the proper execution of the resulting sequence of operations proceeds from the highest numbered slot to the lowest numbered slot. And the particular iteration in which an operation is assigned is determined by the value in ITER field for that slot (or equivalently, the final IT_0FF value stored for that operation) .
  • the lifetime of a result from a particular pipeline is the time difference between the slot in which the operation is assigned to an OUT field and the slot in which it is used as an input value. Since the first section of algorithm 300 finds the first empty slot, these lifetimes are minimized.
  • Tree 400 involves five operations, namely, OP,, OP-, OP 3 , OP 4 , and OP 5 . In this example, the operations are assumed to be of two types, either Type 1 or Type 2.
  • the values computed for N- Eot - N ⁇ and N 2 are 5 * 3 * and » respectively.
  • the two component positions, i.e., ⁇ IT_OFF,IP>, of all operations were set to ⁇ 0,0>.
  • the top most node i.e., OP ⁇
  • algorithm 300 also assigns Type 1 to slot(l) by inserting a one in the TYPE field of slot(l) . After assigning the op-type to the slot, algorithm decrements N ⁇ to 2, indicating that only two more slots may be assigned that op-type.
  • algorithm 300 uses one of those assignments to designate slo (2) to be a Type 1 slot and then decrements N ⁇ to one (step 320) .
  • counter W is incremented to one and then compared to P ⁇ l to determine whether the separation between slot(l) and slot(2) is equivalent to the length of the associated pipeline (steps 322 and 324) . Since W is less than the associated pipeline length, algorithm 300 increments S to 3 (step 312) and then examines slot(3) .
  • slot(3) is assigned Type 1 as its op-type.
  • the incremented value of W equals 2 (i.e., the associated pipeline length)
  • OP ⁇ is assigned to the INST field of slot(3) (step 330).
  • it sets the value in the ITER field equal to 0 (since IO 0) . It identifies in the INI and IN2 fields the operations which supply the input values required by 0P ⁇ , namely OP 2 and 0P 3 .
  • the next operation to be scheduled is selected from the top of stack 600. It is 0P 2 .
  • Algorithm 300 looks for the first available slot that is before the slot in which the output from 0P 2 is needed and it assigns OP 2 to the OUT field of that slot.
  • the values for IO and S are set equal to the stored values that were stored for IT_0FF and IP of 0P 2 , namely, ⁇ 0,3>.
  • algorithm 300 increments S to 4 and then compares the incremented value to N Tot (step 332) . Since s still is within the valid range, algorithm 300 then checks whether slot(4) has an op-type assigned to it (step 334) .
  • algorithm 300 Detecting that slot(5) is a valid slot and it has no op- type assignment, algorithm 300 then assigns it to be a Type 2 slot, decrements N 2 to zero (indicating that no more Type 2 slots may be created) and increments W (step 320 and 322). Since W is less than P ⁇ 2 (i.e., the length of the pipeline associated with 0P 2 ) , 0P 2 cannot be assigned to slot(5) and the search must continue onto the next slot. Thus, S is incremented to 6 (step 312) . In step 314, algorithm 300 detects that S is greater than N Iot , so it resets S to one and increments 10 by one to indicate that it has moved into another iteration cycle to find the appropriate slot.
  • algorithm 300 After determining that slot(l) , slot(2) and slot(3) are Type 1 slots, algorithm 300 ends up at slot(4) which is a Type 2 slot.
  • algorithm 300 assigns OP 2 to the INST field of slot(5) .
  • it sets the value in the ITER field equal to the current value for IO (i.e., 1) .
  • It identifies in the INI and IN2 fields the operations which supply the input values required by 0P 2 , namely 0P 4 and OP 5 .
  • it adds pointers to OP 4 and 0P 5 to stack 600.
  • algorithm 300 Upon selecting 0P 4 , algorithm 300 again sets the values for 10 and S based upon the stored two component position for OP 4 , namely, ⁇ 1,5> (step 302) and then increments S (step 305) . Since S is now greater than N iot - algorithm 300 moves into the next iteration cycle to find the appropriate slot assignment for 0P 4 (i.e., it resets S to one and increments 10 to a value of 2) . Algorithm 300 searches up through data structure 500 one slot at a time, until it finds the first Type 1 slot that has no operation assigned to its OUT field.
  • slot(4) When it finds that slot, which in this case is slot(4) , it assigns OP 4 to the OUT field of that slot and updates the two component position for OP 4 to ⁇ 2,2>. Then, algorithm 300 determines when OP 4 must enter its associated pipeline in order to produce an output in slot(2). After setting W to zero (step 310), algorithm 300 searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0P 4 . The search continues until W equals P ⁇ l . Thus, at slot(3), which is a Type 1 slot, W is incremented to one. At slot(4) and slot(5) , both of which are Type 2 slots, W is not incremented.
  • W is incremented to a value which equals P ⁇ l , so algorithm 300 assigns OP 4 to slot(l).
  • algorithm 300 also sets the value in the ITER field equal to the current value for IO (i.e., 3) and it identifies in the INI and IN2 fields the operations which supply the input values required by OP 2 . In this instance, since OP 4 does not use input values from other operators, no operators are identified in the INI and IN2 fields.
  • Algorithm 300 also updates the stored two component position of OP 4 to equal ⁇ 3,1>.
  • the entries in data structure 500 are as shown in Fig. 7c; and the next operation available for scheduling is OP 5 .
  • Algorithm 300 searches up through data structure 500 one slot at a time until it finds the first Type 2 slot that has no operation assigned to its OUT field. When it finds that slot, which in this case is slot(5) , it assigns 0P 5 to the OUT field of that slot and updates the two component position for OP 5 to ⁇ 2,5>.
  • algorithm 300 determines when 0P 5 must enter its associated pipeline in order to produce an output in slot(5) .
  • algorithm 300 again searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0P g .
  • W equals P ⁇ 2
  • Algorithm 300 sets the value in the ITER field of slot(4) equal to the current value for 10 (i.e., 4) and it identifies in the INI and IN2 fields the operations which supply the input values required by 0P 5 . Since 0P 5 does not use input values supplied by other operators, no operators are identified in the INI and IN2 fields. Algorithm 300 also updates the stored two component position of OP 5 to equal ⁇ 4,4>.
  • 0P 3 is first assigned to the OUT field of slot(3) at which point its stored position is updated to equal ⁇ 1,3>. Then, OP 3 is assigned to the INST field of slot(2) and its final stored position becomes ⁇ 2,2>.
  • the entries in data structure 500 at the completion of the scheduling process are as shown in Fig. 7e.
  • the resulting scheduling order as determined by algorithm 300 is 0P 2 ⁇ 1 ⁇ , OP 5 ⁇ 4 ⁇ , OP 1 ⁇ 0 ⁇ , OP 3 ⁇ 2 ⁇ and OP 4 ⁇ 3 ⁇ , where the number in brackets ⁇ is the iteration offset.
  • Figs. 8a and 8b illustrate the beginning and the end, respectively, of a schedule for running the iterative construct illustrated by tree 400 shown in Fig. 6.
  • the schedule is for 101 iterations of the construct.
  • the numbers along the horizontal axis identify the iteration cycle of the RISC processor.
  • the vertical axis identify the instruction cycle within an iteration cycle. Thus, each iteration cycle consists of 5 instruction cycles.
  • the convention is as follows.
  • the entry above the slash (/) identifies the operation which is begun during that time interval and the entry below the slash identifies the output which becomes available form its associated pipeline during that time interval.
  • the operations are designated as 0P ⁇ [i], where x identifies the operation and i identifies the iteration number.
  • FILL and DRAIN refers to fill code and drain code.
  • Fill code is used at the beginning of running to fill any gaps that may exist until steady state operation is achieved.
  • drain code is used to push out the results at the end of the process.
  • Both fill and drain code can be other operations which either precede or follow the iterative construct or they can be NOP operations which produce no result other than to clear the pipelines.
  • a complete tree representation of the iterative construct will also include a network of nodes descending down from OP 3 representing the sub-tree.
  • the operations of the complete iterative construct can be scheduled by applying the scheduling algorithm to the complete tree as described above.
  • the scheduling algorithm can be applied to the main tree and the sub-tree separately and the two schedules are later combined by taking into account the slot position and iteration offset associated with OP 3 . That is, all of the scheduled operations of the sub-tree can be incorporated into the other schedule by adjusting their two component positions based on the the slot position and iteration offset associated with

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

An apparatus for scheduling instructions of an iterative construct (110) to run on a pipelined processor, including logic for creating a plurality of instruction slots (202) equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler (300) for scheduling the instructions of the iterative construct (110) into the plurality of slots (300) so that every one of the plurality of slots has a different one of the instructions of the iterative construct (110) scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct (110).

Description

A RISC VECTORIZATION SYSTEM Background of the Invention The invention relates to vectorizing iterative constructs to be run on a pipelined RISC processor. In a pipelined processor, certain operations are performed by splitting them into smaller sequential subprocesses and then allocating each of the subprocesses to a corresponding piece of dedicated hardware, referred to as a stage. Typically, the stages are separated by registers each of which receives the result from the previous stage and makes it available to the next stage. The term pipe or pipeline refers to the sequence of stages and registers which are required to execute a particular operation. As the operation flows through its associated pipeline, it moves from one stage to the next, occupying only one stage at a time. Thus, if an operation has k subprocesses and each stage requires t seconds to execute its corresponding subprocess, the result of the operation becomes available kt seconds after the operation was introduced into the pipeline. Since an operation only occupies one stage at a time as it passes through the pipeline, other operations may be introduced into that same pipeline. In the above example, k different operations may be in the pipeline at one time, each operation occupying a different one of the k stages. Full utilization of a pipeline implies that all stages are executing subprocesses of different operations and thus, there are no wasted machine cycles. Conversely, if there is a gap in the pipeline, i.e., at least one stage is not processing an operation, machine cycles are being wasted.
The times at which operations may be introduced into their respective pipelines are constrained by dependencies that exist between operations. A dependency exists, for example, if one operation requires as one of its input values, the output of an earlier operation. In that case, the operation cannot begin until the earlier operation has passed through its pipeline. Such constraints greatly add to the difficulty in achieving full utilization of pipelines in a pipelined processor.
Summary of the Invention In general, in one aspect, the invention is an apparatus for scheduling instructions of an iterative construct to run on a pipelined processor. The apparatus includes means for creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler for scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct.
Preferred embodiments of the invention include the following features. The scheduling apparatus further includes a compiler for generating an expression tree representation of the iterative construct. The scheduler selects instructions for scheduling from the expression tree proceeding through the tree representation in a reverse execution order one instruction at a time starting with the last instruction. The scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation. The scheduling apparatus also includes computational logic for determining the total number of instructions, NTot, in one iteration of the iterative construct and for determining for each one of a plurality of pipeline types the total number of instructions, Nt, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type. The slot creating means uses NTot to determine the number of instruction slots to create. The scheduler also includes means for assigning an appropriate one of the pipeline types to each of the instruction slots and the assignment means limits the number of slots that are assigned a given pipeline type to the value of Nt determined for that pipeline type.
Also in preferred embodiments, the scheduler includes logic for determining an iteration offset for each instruction assigned to one of the plurality of instruction slots. In addition, the scheduler includes logic for identifying for each selected instruction a corresponding output slot during which an output. from a pipeline associated with the selected instruction becomes available. The scheduler further includes logic for identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.
In general, in another aspect, the invention is a method for scheduling instructions of an iterative construct to run on a pipelined processor. The method includes the steps of creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct. One advantage of the invention is that it moves all fill and drain code to outside of the body of an iterative construct when it is run on a pipelined processor. The invention schedules the operations of the iterative construct into their respective pipelines so that there are no gaps or wasted machine cycles during the execution of the iterative construct. In addition, the invention achieves efficient register utilization by minimizing the time between when output from a pipeline becomes available and when the output is needed by a subsequent operation.
Other advantages and features will become apparent from the following description of the preferred embodiment and from the claims.
Description of the Preferred Embodiment
Fig. 1 shows the stages of a process for vectorizing a computer program;
Fig. 2 shows an expression tree representation of an iterative construct; Fig. 3 is the data structure used in scheduling the operations of an expression tree for running on a RISC processor;
Figs. 4 is a flow chart of the first section of a scheduling algorithm; Fig. 5 is a flow chart of the second section of the scheduling algorithm;
Fig. 6 is an example of a simple tree expression;
Figs. 7a-e show different development stages of a data structure used for scheduling the operations of the simple tree expression shown in Fig. 6; and
Figs. 8a and 8b show the beginning and the end, respectively, of a schedule for executing the operations of an iterative construct represented by the expression tree shown in Fig. 6. Fig. 1 shows the general stages of a process for vectorizing a computer program to run on a Reduced Instruction Set Computer (i.e., RISC processor) in which various operations are pipelined. The computer program contains an iterative construct, such as a DO-loop, that includes a sequence of operations that is repeated for each iteration of the construct. In the RISC processor used for the described embodiment, which may, for example, be an Intel 860, different operations are pipelined and the operations move through their respective pipelines by being pushed through by subsequent instructions introduced into that pipeline. In addition, the user has direct access to the different pipelines within the processor. The vectorizing process exploits the direct access features of the RISC architecture to schedule operations within an iteration and across iterations of the iterative construct so as to optimally utilize the full capacity of the pipelines. The process, which shall be described in greater detail below, is referred to hereinafter as RISC vectorization. During the first stage of the RISC vectorization process, the computer program is compiled to reduce the code into a simpler form (stage 102). Any one of a number of available or known compiler programs may be used to compile the program. In general, for each assignment statement within each iteration of the DO- loop in the program, the compiler generates a simplified expression tree representation of the code. In generating the tree expression, many compilers also perform certain transformations to eliminate cross- dependences that might exist between different iterations of the iterative construct. As an aid to visualizing the expression tree consider the following very simple computer program consisting of a single DO-loop:
DO i = 1 TO 1000 E(I) = A(I)*[B(I)-C(I)] + D(I)
This equation involves 5 different operators, namely, +, *, -, load and store. From this program, the compiler may generate an expression tree 110 such as is shown in Fig. 2. Expression tree 110 includes nodes 112 representing the different operators of the program and for each node 112, pointers 114 identifying which operators supply the input values for that node.
After expression tree 110 is generated for each iteration of the construct (and assuming no dependencies exist across iterations) , the processor identifies each pipeline type and for each pipeline type computes the total number Nt of operators of that type within the expression tree (stage 104) . Once the processor has an Nt for each of the operator types, it then computes the total number of all operators, N^^., within the tree. That is, it computes:
NTot - ∑Nt'
where the summation is over all operator types, t. Then, for each node of the tree, the processor executes a scheduling algorithm to identify a scheduling slot for that node (stage 106) . The assigned scheduling slots for the nodes determines the relative order in which the nodes are to be scheduled for execution on the processor when the program is finally executed. Stage 106 employs a Top-Down Left- o-Right Tree Walk to determine the order in which the scheduling algorithm processes the nodes. In other words, the scheduling algorithm begins with the top most node, i.e., the "A" node in Fig. 2, which represents the last operator that would be executed if the tree were run on a serial processor, and then progresses down the tree first to the left until the bottom is reached and then to the right. After all nodes have been assigned scheduling slots, the program is run on the RISC processor with the nodes scheduled for execution according to the schedule derived during stage 106.
A data structure 200 such as is shown in Fig. 3 is generated during stage 106 to aid in determining the proper scheduling of the nodes of the expression tree. Data structure 200 includes NTot schedule slots 202, one slot for each of the nodes in the expression tree, each slot representing a different instruction cycle. Slots 102 are designated slot(l) through slot(Nχot) .
Each of slots 102 has six fields. There is a field 204(1) (labelled TYPE) for identifying the type of operator that is assigned to that slot. There is an instruction field 204(2) (labelled INST) for identifying the node that is assigned that schedule position. There are two input value fields 204(3) and 240(4) (labelled INI and IN2, respectively) for identifying the input values required by the node scheduled in that slot. There is an output field 204(5) (labelled OUT) for identifying the node which generates an output associated with that slot. And finally, there is an iteration number field 204(6) (labelled ITER) for identifying the iteration offset for the node scheduled in that slot. -
For each operation, the scheduling algorithm keeps track of a two component position identifying that operations placement in data structure 200. Those two component positions are the iteration offset (IT_0FF) for that operation and the instruction position (IP), i.e., the slot for that operation. Thus, for the following description, each operation will be identified as follows: 0Pχ[IT_0FF,IP] , where 0Pχ identifies the operation and the bracketed information identi ies the scheduling position for that operation.
When there is a data flow dependence from one operation to another, the two component position is used to ensure that the two operations are executed in the correct order. More specifically, if an operation 0Pχ in a position <a,b> produces a result used by operation OP in position <c,d>, then <a,b> must be lexicographically before <c,d>. In other words, either OPχ must be in an earlier iteration than OP so that a precedes c in execution order; or if they are in the same iteration, then the slot for OP χ__ must come before that for OP_ Y_ so that b precedes d in execution order. Scheduling algorithm 300 shown in Figs. 4 and 5 determines the appropriate slot assignment in data structure 200 for each of the nodes of the expression tree. Scheduling algorithm 300 has two distinct sections. The first section, which is shown in Fig. 4, determines when the output value from a particular operation pipeline is needed. And the second section, which is shown in Fig. 5, determines when the operation must be initiated in order to result in an output when it is needed. Data structure 200 is used to identify and keep track of slot assignments. Basically, the first section assigns a selected operation to the OUT field of the slot that corresponds to the time interval during which the output from the associated pipeline is required. Then, the second section counts back a number of slots equal to the length of the pipeline associated with the selected operation and assigns the operation to the INST field of the slot at that location. This process is repeated for each operation until all have been assigned slot positions. At the time that the scheduling process begins, the two component positions of all of the operations in the expression tree are equal to zero. Then, as shown in Fig. 4, the scheduling process begins by selecting the first operation for scheduling (step 280) . Since the selection step uses a Top-Down Left-to-Right Tree Walk, the selected operation is the top most node of the expression tree. After selecting the first operation, the processor calls scheduling algorithm 300 (step 290) . Initially, scheduling algorithm 300 sets an offset variable IO and an index variable S equal to IT_0FF and IP, respectively, of the selected operation (step 302) . For the first selected operation, both IT_OFF and S are set to zero. Next, algorithm 300 checks S to determine whether it equals zero (step 304) . If it does equal zero, as it does for the first operation, then algorithm 300 increments S by one to identify the slot to which the operation will be assigned. Next, algorithm 300 assigns the first operation to the OUT field of Slot(S=l) (step 308).
In addition to assigning the operation to the OUT field, algorithm 300 also performs several other operations during step 308. It assigns an operation type to Slot(l) by setting the TYPE field of Slot(l) to the same type as the selected operation. It decrements the variable Nt to indicate that one fewer slot of that op- type is available for assignment. And, it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively. After the first operation has been assigned to the OUT field of the appropriate slot, algorithm 300 branches into its second section in which it assigns the selected operation to the INST field of a different appropriate slot. The second section begins by initializing a pipe stage counter W to zero (step 310) and incrementing S (step 312) . Then, algorithm 300 checks S to determine whether it is larger than NTot, which is the maximum number of slots available for operation assignments (step 314) . For the first operation, S (which equals two at this point) is not greater than NTot, so algorithm 300 then checks the TYPE field of Slot(S=2) to determine whether the slot has an assigned op-type (step 316) . If no op-type is assigned (as would be the case for this pass through the algorithm) , algorithm 300 then compares the Nt for the selected operation to zero to determine whether additional assignments of the op-type may be made (step 318) .
If Nt does not equal zero, algorithm 300 assigns an operation type to Slot(S=2) by setting the TYPE field of Slot(2) to the same type as the selected operation and it decrements the variable Nt to indicate that one fewer slot of that op-type is available for assignment (step 320) . Afterwards, algorithm 300 increments W (step 322) and then compares W to Pt, the length of the pipeline for the op-type (step 324) . If W is less than Pt, algorithm 300 branches back to step 312 where it increments S and repeats the above-described sequence of steps until it finds a slot of the same op-type as the selected operation and for which W equals Pt. During subsequent passes through the above- described sequence of steps, algorithm 300 may reach a point at which the value for Nt in step 318 equals zero. In that case, algorithm 300 moves to the next iteration cycle by resetting S to one and incrementing 10 (step 326). After moving to the next iteration cycle," algorithm 300 branches back to step 316. Since Slot(l) was assigned an op-type during the first pass of algorithm 300 through data structure 200, the test for an op-type assignment in step 316 results in an affirmative causing algorithm 300 to then determine whether that slot has the same op-type as the selected operation (step 328) . If the op-types are the same, algorithm 300 increments W (step 322) and then compares W to Pt (step
324) . If during subsequent passes through this phase of algorithm 300, a slot is found which has been assigned a different op-type than that of the selected operation, step 328 causes a branch back to step 312 where S is again incremented. In other words, when algorithm 300 locates a slot that has an inappropriate op-type, it skips that slot and continues the search with the next slot.
Once algorithm 300 finds a slot with the appropriate op-type and for which W equals Pt, then algorithm 300 assigns the selected operation to the INST field of that slot (step 330) . In addition to making the INST field assignment, algorithm 300 also performs several other tasks during step 330. It sets the value in the ITER field equal to IO. It identifies in the INI and IN2 fields the operations which supply the input values required by the selected operation. It sets the IT_OFF and IP values of the selected operation and the operations which supply the input values equal to 10 and S, respectively. And, it adds pointers to a stack from which subsequent operations are selected for slot assignment. The pointers identify the operations which supply the input values for the selected operation and they are added to the stack in the order which results in the Top Down Left-to-Right Tree Walk. After the first operation has been scheduled, the next operation is selected from the stack and the scheduling algorithm is again called to determine its proper scheduling (step 290) . On this pass through algorithm 300, the value of IP associated with the selected operation will typically be non-zero. Thus, when algorithm 300 detects a non-zero value for S in step 304, it first increments S (step 305) and then compares the incremented value to NTot to determine whether the index S is still within the allowed range of data structure 200 (step 332) . If S is not greater than NTot, algorithm 300 checks the TYPE field of slot(S) to determine whether it has an assigned op-type (step 334) . On the other hand, if S is greater than NTot, algorithm 300 first moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing 10 by one to identify that iteration cycle (step 336) and then it checks whether slot(S) has an assigned op-type (step 334) .
If slot(S) does not have an assigned op-type, indicating that an operation assignment has not been made to any field of that slot, algorithm 300 compares the' value of N associated with the selected operation to zero (step 338) . If Nt does equal zero, indicating that no more slots can be assigned that op-type, algorithm 300 branches back to step 336 where it moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing IO by one. However, if Nt does not equal zero, algorithm 300 branches to step 308 where it assigns the selected operation to the OUT field of slot(S) and performs the other previously described functions.
In step 334, if slot(S) does have an assigned op- type, algorithm 300 determines whether the assigned op- type is the same as the op-type of the selected operation (step 340) . If the assigned op-type and the op-type of the selected operation are not the same, indicating that the selected operation cannot be assigned to that slot, algorithm 300 branches back to step 305 where it increments S to continue the search for an appropriate slot. On the other hand, if the assigned op-type and the op-type of the selected operation are the same, algorithm 300 determines whether the OUT field of that slot already has an operation assigned to it (step 344) . If the OUT field is occupied, algorithm 300 branches to step 305 where it increments S to continue the search for an available slot. If the OUT field is not occupied, algorithm 300 assigns the selected operation to the OUT field of that slot and it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively (step 346) . Once the selected operation has been assigned to an OUT field of the appropriate slot, algorithm 300 branches to step 310 to schedule the selected operation relative to the other operations in the expression tree.
In summary, during the first phase of the scheduling process, algorithm 300 locates the first available slot that has the same op-type as the selected operation and that has no previous operation assigned to its OUT field. When it finds such a slot, it assigns the selected operation to the OUT field of that slot. During the second phase of scheduling algorithm 300, the steps are as previously described for assigning the first selected operation with some exceptions. For later selected operations, however, the incremented value of S from step 312 may become larger than N-^. When that happens, algorithm 300 moves to the next iteration cycle by branching to step 326 after step 314. That is, S is set equal to 1 and 10 is incremented by one. Thus, the search for the appropriate slot is carried into the next iteration cycle. The assignment process during this second phase moves back through data structure 200 one slot at a time, counting (with index W) the number of slots having an op- type which is the same as the op-type of the selected operation for which an assignment is being sought. Due to the nature of the assignment rules as described above, when algorithm 300 identifies the slot representing the Pt-th slot having the same op-type (i.e., the slot for which W=Pt) , the INST field for that slot will be unoccupied. Thus, the selected operation is assigned to its INST field. In essence, the second phase of scheduling algorithm 300 places the selected instruction precisely at the slot location which will cause an output to materialize during the slot in which that operation was assigned to an OUT field. Algorithm 300 always fills data structure 200 with exactly the same number of instructions as are in the tree expression, namely, N^^.. Therefore, there are no gaps in any of the pipelines once steady state operation is achieved and no operations are required during steady state operation for just flushing a pipe.
The sequence of operations appearing in the INST fields of slot(NTot) through slot(l) , determines the order in which those operations are to be scheduled in the RISC processor. That is, since the operations are scheduled in reverse order (i.e., later operations before earlier operations) , the proper execution of the resulting sequence of operations proceeds from the highest numbered slot to the lowest numbered slot. And the particular iteration in which an operation is assigned is determined by the value in ITER field for that slot (or equivalently, the final IT_0FF value stored for that operation) .
The lifetime of a result from a particular pipeline (i.e., the time it must be stored in a register) is the time difference between the slot in which the operation is assigned to an OUT field and the slot in which it is used as an input value. Since the first section of algorithm 300 finds the first empty slot, these lifetimes are minimized. The just-described procedure is further illustrated by using a simple tree 400 such as is shown in Fig. 6. Tree 400 involves five operations, namely, OP,, OP-, OP3, OP4, and OP5. In this example, the operations are assumed to be of two types, either Type 1 or Type 2. Operations OPj,, OP3, and 0P4 are Type 1 and have an associated pipeline length of 2 (Pτl=2) . And operations 0P2 and OP5 are Type 2 and have an associated pipeline length of 3 (Pχ2=3) . The values computed for N-Eot- Nι and N 2 are 5* 3* and » respectively. Before beginning the scheduling process, the two component positions, i.e., <IT_OFF,IP>, of all operations were set to <0,0>.
At the start of the scheduling process, the top most node, i.e., OPχ, is selected and scheduling algorithm 300 is executed to schedule that operator in the appropriate slots of a data structure 500 (as shown in Figs. 7a-e) . Since the two component position associated with OPχ equals <0,0>, the values of 10 and S are both set equal to zero in step 302. Since S=0, when algorithm 300 executes step 304, it branches to step 306 where it increments S (S=l) and then assigns OPχ to the OUT field of slot(l) (step 308) . In other words, the output for OPj from the associated pipeline is scheduled to materialize during the time interval associated with slot(l). Since OPλ is a Type 1 op-type, algorithm 300 also assigns Type 1 to slot(l) by inserting a one in the TYPE field of slot(l) . After assigning the op-type to the slot, algorithm decrements Nλ to 2, indicating that only two more slots may be assigned that op-type.
In the second phase, algorithm 300 determines when OPχ must enter its associated pipeline in order to produce an output in slot(l) . After setting W to 0 and incrementing S to 2, algorithm 300 makes sure that it is dealing with a valid slot number by comparing S to NTot = 5 (step 314) . Since S is less than 5, algorithm 300 then checks whether slot(2) has been assigned an op-type (step 316) . Upon determining that slot(2) has no assigned op- type, algorithm 300 determines whether more Type 1 assignments may be made (i.e., does Nχ=0? step 318).
There being two remaining Type 1 assignments, algorithm 300 uses one of those assignments to designate slo (2) to be a Type 1 slot and then decrements Nχ to one (step 320) . Following the op-type assignment, counter W is incremented to one and then compared to Pτl to determine whether the separation between slot(l) and slot(2) is equivalent to the length of the associated pipeline (steps 322 and 324) . Since W is less than the associated pipeline length, algorithm 300 increments S to 3 (step 312) and then examines slot(3) .
During this next pass through the second section of algorithm 300, the same sequence of steps is repeated and slot(3) is assigned Type 1 as its op-type. This time, however, the incremented value of W equals 2 (i.e., the associated pipeline length) , so OPχ is assigned to the INST field of slot(3) (step 330). In addition, it sets the value in the ITER field equal to 0 (since IO=0) . It identifies in the INI and IN2 fields the operations which supply the input values required by 0Pχ, namely OP2 and 0P3. It sets the IT_0FF and IP values of 0Pχ, 0P2, and OP3 equal to the current values of IO and S, respectively (10=0 and S=3) . And, it adds pointers to 0P2 and 0P3 to a stack 600 from which subsequent operations are selected for slot assignment.
Consequently, at the conclusion of executing scheduling algorithm 300 for 0Pχ l the entries in data structure 500 are as shown in Fig. 7a.
The next operation to be scheduled is selected from the top of stack 600. It is 0P2. Algorithm 300 looks for the first available slot that is before the slot in which the output from 0P2 is needed and it assigns OP2 to the OUT field of that slot. In this pass through algorithm 300, the values for IO and S are set equal to the stored values that were stored for IT_0FF and IP of 0P2, namely, <0,3>. After detecting that S is not equal to zero (step 302), algorithm 300 increments S to 4 and then compares the incremented value to NTot (step 332) . Since s still is within the valid range, algorithm 300 then checks whether slot(4) has an op-type assigned to it (step 334) . After determining that no such assignment has been made, algorithm 300 checks N2 to determine whether the slot may be assigned a Type 2 op- type. Since no Type 2 op-type assignments have yet been made (i.e., N2 still equals 2), algorithm 300 designates slot(4) as a Type 2 slot and it assigns 0P2 to its OUT field. Algorithm 300 also decrements N2 to one and updates the two component position for 0P2 so that it is <0,4>. Next, algorithm 300 determines when 0P2 must enter its associated pipeline in order to produce an output in slot(4) . After setting W to 0 and incrementing S to 5, algorithm 300 again makes sure that it is dealing with a valid slot number by comparing S to NTot = 5 (step 314). Detecting that slot(5) is a valid slot and it has no op- type assignment, algorithm 300 then assigns it to be a Type 2 slot, decrements N2 to zero (indicating that no more Type 2 slots may be created) and increments W (step 320 and 322). Since W is less than Pτ2 (i.e., the length of the pipeline associated with 0P2) , 0P2 cannot be assigned to slot(5) and the search must continue onto the next slot. Thus, S is incremented to 6 (step 312) . In step 314, algorithm 300 detects that S is greater than NIot, so it resets S to one and increments 10 by one to indicate that it has moved into another iteration cycle to find the appropriate slot. After determining that slot(l) , slot(2) and slot(3) are Type 1 slots, algorithm 300 ends up at slot(4) which is a Type 2 slot. Upon detecting that the op-type for slot(4) is the same as the op-type for 0P2, algorithm 300 increments W (so that W=2) and then compares W to Pτ2 (steps 322 and 324) . Since W is still less than Pτ2, OP2 cannot be assigned to slot(4) and the search continues onto the next slot (i.e., S is incremented in step 312). Since slot(5) is a Type 2 slot, algorithm 300 again increments W and compares it to Pτ2. This time, however, W equals Pτ2, so algorithm 300 assigns OP2 to the INST field of slot(5) . In addition, it sets the value in the ITER field equal to the current value for IO (i.e., 1) . " It identifies in the INI and IN2 fields the operations which supply the input values required by 0P2, namely 0P4 and OP5. It sets the IT_0FF and IP values of OP2, 0P4, and OP5 equal to the current values of 10 and S, respectively (10=1 and S=5) . And, it adds pointers to OP4 and 0P5 to stack 600.
At the conclusion of executing scheduling algorithm 300 for 0P2, the entries in data structure 500 are as shown in Fig. 7b. So, the next operation available for scheduling is 0P4.
Upon selecting 0P4, algorithm 300 again sets the values for 10 and S based upon the stored two component position for OP4, namely, <1,5> (step 302) and then increments S (step 305) . Since S is now greater than Niot- algorithm 300 moves into the next iteration cycle to find the appropriate slot assignment for 0P4 (i.e., it resets S to one and increments 10 to a value of 2) . Algorithm 300 searches up through data structure 500 one slot at a time, until it finds the first Type 1 slot that has no operation assigned to its OUT field. When it finds that slot, which in this case is slot(4) , it assigns OP4 to the OUT field of that slot and updates the two component position for OP4 to <2,2>. Then, algorithm 300 determines when OP4 must enter its associated pipeline in order to produce an output in slot(2). After setting W to zero (step 310), algorithm 300 searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0P4. The search continues until W equals Pτl. Thus, at slot(3), which is a Type 1 slot, W is incremented to one. At slot(4) and slot(5) , both of which are Type 2 slots, W is not incremented. After slot(5), the search moves into the next iteration cycle (10=3) and starts over again at slot(l) . At slot(l) of this iteration cycle W is incremented to a value which equals Pτl, so algorithm 300 assigns OP4 to slot(l). As before, algorithm 300 also sets the value in the ITER field equal to the current value for IO (i.e., 3) and it identifies in the INI and IN2 fields the operations which supply the input values required by OP2. In this instance, since OP4 does not use input values from other operators, no operators are identified in the INI and IN2 fields. Algorithm 300 also updates the stored two component position of OP4 to equal <3,1>.
At the conclusion of executing scheduling algorithm 300 for OP4, the entries in data structure 500 are as shown in Fig. 7c; and the next operation available for scheduling is OP5. Upon selecting 0P5, algorithm 300 again sets the values for IO and S based upon the stored two component position for OP5, namely, <1,5> (step 302) and then increments S (step 305) . Since S is now greater than NTot, algorithm 300 moves into the next iteration cycle to find the appropriate slot assignment for 0P4 (i.e., it resets S to one and increments IO=2) . Algorithm 300 searches up through data structure 500 one slot at a time until it finds the first Type 2 slot that has no operation assigned to its OUT field. When it finds that slot, which in this case is slot(5) , it assigns 0P5 to the OUT field of that slot and updates the two component position for OP5 to <2,5>.
Next, algorithm 300 determines when 0P5 must enter its associated pipeline in order to produce an output in slot(5) . After setting W to zero (step 310) , algorithm 300 again searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0Pg. Each time the search goes over the top of data structure 500 (i.e., S=6) , the search moves into the next iteration cycle. When W equals Pτ2, algorithm 300 assigns OP5 to the INST field of that slot. In this case, W reaches 3 at slot(4) of the fourth iteration cycle (i.e., 10=4). Algorithm 300 sets the value in the ITER field of slot(4) equal to the current value for 10 (i.e., 4) and it identifies in the INI and IN2 fields the operations which supply the input values required by 0P5. Since 0P5 does not use input values supplied by other operators, no operators are identified in the INI and IN2 fields. Algorithm 300 also updates the stored two component position of OP5 to equal <4,4>.
At the conclusion of executing scheduling algorithm 300 for 0P5, the entries in data structure 500 are as shown in Fig. 7d; and the next operation available for scheduling is 0P3.
Using the above-described procedures it should be readily apparent that 0P3 is first assigned to the OUT field of slot(3) at which point its stored position is updated to equal <1,3>. Then, OP3 is assigned to the INST field of slot(2) and its final stored position becomes <2,2>. Thus, the entries in data structure 500 at the completion of the scheduling process are as shown in Fig. 7e. The resulting scheduling order as determined by algorithm 300 is 0P2{1}, OP5{4}, OP1{0}, OP3{2} and OP4{3}, where the number in brackets {} is the iteration offset.
Figs. 8a and 8b illustrate the beginning and the end, respectively, of a schedule for running the iterative construct illustrated by tree 400 shown in Fig. 6. The schedule is for 101 iterations of the construct. The numbers along the horizontal axis identify the iteration cycle of the RISC processor. The vertical axis identify the instruction cycle within an iteration cycle. Thus, each iteration cycle consists of 5 instruction cycles. Within the charts, the convention is as follows. The entry above the slash (/) identifies the operation which is begun during that time interval and the entry below the slash identifies the output which becomes available form its associated pipeline during that time interval. The operations are designated as 0Pχ[i], where x identifies the operation and i identifies the iteration number. FILL and DRAIN refers to fill code and drain code. Fill code is used at the beginning of running to fill any gaps that may exist until steady state operation is achieved. And drain code is used to push out the results at the end of the process. Both fill and drain code can be other operations which either precede or follow the iterative construct or they can be NOP operations which produce no result other than to clear the pipelines.
As shown in Fig. 8a, during the first iteration cycle (i.e., iteration cycle number 0) only the 0th iteration of 0P5 (i.e., OP5[0]) is begun. OP5[0] is begun during the second instruction cycle of that iteration cycle. And for each successive iteration cycle thereafter until all iterations of OP5 have been run, another succeeding iteration of OP5 is begun during this same instruction cycle. All other instruction cycles within the first iteration cycle receive fill code. In iteration cycle number 1, OPg[l] is begun during the second instruction cycle, as previously stated, and OP4[0] is begun during the fifth instruction cycle. All other instruction cycles receive fill code. In iteration cycle number 2, the output values corresponding to OPS[0] and to OP4[0] materialize out of their associated pipelines during instruction cycles 1 and 4, respectively. In addition, OP5[2] is begun during the second instruction cycle, OP3[0] is begun during the fourth instruction cycle and 0P4[1] is begun during the fifth instruction cycle. Instruction cycles 1 and 3 receive fill code.
In iteration cycle number 3, the output values corresponding to OP5[l], OP3[0], and 0P4[1] materialize out of their associated pipelines during instruction cycles 1, 3 and 4, respectively. In addition, operations OP2[0], OP5[3], OP3[l] and OP4[2] are begun during instruction cycles 1, 2, 4, and 5, respectively. Instruction cycle 3 receives fill code. Note that the input values for OP2[0], namely, outputs from the pipelines associated with OP4[0] and OP5[0], were made available during the previous iteration cycle.
Finally, in iteration cycle number*4, the output values corresponding to 0P5[2], OP2[0], 0P3[1], OP4[2] and OP^O] materialize out of their associated pipelines during instruction cycles 1, 2, 3, 4, and 5, respectively. In addition, operations 0P2[l], 0P_[4], op χ[°]/ op 3[ ] and OP4[3] are begun during instruction cycles 1, 2, 3, 4, and 5, respectively. Note that both input values for OPj O], namely, outputs from the pipelines associated with OP2[0] and OP3[0], were made available prior to instruction cycle 3, as required. In particular, the output of OP2[0] was made available during the immediately preceding instruction cycle and the output of OP3[0] was made available during the immediately preceding iteration cycle.
Note that during iteration cycle 4, there are no gaps in any instruction cycles requiring the use of fill code. That is, the associated pipelines are fully utilized until throughout the rest of the processing of the iterative construct through iteration cycle 100, as shown if Fig. 8b. After iteration 100, the different pipelines empty out at different time until they are all emptied of pending operations from the iterative construct by the end of iteration cycle 104. At that point, processing for the iterative construct is complete. The above described procedure easily accommodates iterative constructs with multiple assignment statements. For example, OP3 in Fig. 6 may also be defined by another assignment statement that is represented by a different tree, i.e., a sub-tree. In that case, a complete tree representation of the iterative construct will also include a network of nodes descending down from OP3 representing the sub-tree. The operations of the complete iterative construct can be scheduled by applying the scheduling algorithm to the complete tree as described above. Or the scheduling algorithm can be applied to the main tree and the sub-tree separately and the two schedules are later combined by taking into account the slot position and iteration offset associated with OP3. That is, all of the scheduled operations of the sub-tree can be incorporated into the other schedule by adjusting their two component positions based on the the slot position and iteration offset associated with
OP,
Other embodiments are within the following claims.
What is claimed is:

Claims

Claims
1. An apparatus for scheduling instructions of an iterative construct to run on a pipelined processor, the apparatus comprising: means for creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler for scheduling the instructions of the iterative construct into said plurality of slots so that every one of said plurality of slots has a different one of the instructions of the iterative construct scheduled therein, said scheduled instructions being drawn from more than one iteration of the iterative construct.
2. The scheduling apparatus of claim 1 further comprising a compiler for generating an expression tree representation of the iterative construct, and wherein said scheduler uses said tree representation to select instructions for scheduling.
3. The scheduling apparatus of claim 2 wherein the scheduler proceeds through the tree representation in a reverse execution order one instruction at a time starting with the last instruction.
4. The scheduling apparatus of claim 3 wherein the scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation.
5. The scheduling apparatus of claim 1 further comprising computational logic for determining the total number of instructions, NTot, in one iteration of the iterative construct, and wherein said slot creating means uses NTot to determine the number of instruction slots to create.
6. The scheduling apparatus of claim 5 wherein said computational logic is also for determining for each one of a plurality of pipeline types the total number of instructions, Nt, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type.
7. The scheduling apparatus of claim 6 wherein said scheduler comprises means for assigning an appropriate one of said pipeline types to each of said instruction slots and said assignment means limits the number of slots that are assigned a given pipeline type to the value of Nt determined for that pipeline type.
8. The scheduling apparatus of claim 1 wherein said scheduler comprises logic for determining an iteration offset for each instruction assigned to one of said plurality of instruction slots.
9. The scheduling apparatus of claim 1 wherein said scheduler comprises logic for identifying for each selected instruction a corresponding output slot during which an output from a pipeline associated with the selected instruction becomes available.
10. The scheduling apparatus of claim 9 wherein said scheduler further comprises logic for identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.
11. The scheduling apparatus of claim 1 wherein the iterative construct is available in the form of an expression tree representation of the iterative construct, and wherein said scheduler uses said tree representation to select instructions for scheduling.
12. The scheduling apparatus of claim 11 wherein the scheduler proceeds through the tree representation in a reverse execution order one instruction at a time starting with the last instruction.
13. The scheduling apparatus of claim 12 wherein the scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation.
14. The scheduling apparatus of claim 1 wherein the number of instruction slots that are created equals the number of instructions in one iteration of the iterative construct; and
15. A method for scheduling instructions of an iterative construct to run on a pipelined processor, the method comprising: creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and scheduling the instructions of the iterative construct into said plurality of slots so that every one of said plurality of slots has a different one of the instructions of the iterative construct scheduled therein, said scheduled instructions being drawn from more than one iteration of the iterative construct.
16. The method of claim 15 further comprising generating an expression tree representation of the iterative construct, and wherein said scheduling step uses said tree representation to select instructions for scheduling.
17. The method of claim 15 further comprising determining the total number of instructions, NTot, in one iteration of the iterative construct, and wherein said slot creating step uses NTot to determine the number of instruction slots to create.
18. The method of claim 15 further comprising determining for each one of a plurality of pipeline types the total number of instructions, Nt, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type.
19. The method of claim 15 further comprising determining an iteration offset for each instruction assigned to one of said plurality of instruction slots.
20. The method of claim 15 wherein the scheduling step comprises identifying for each selected instruction a corresponding output slot during which an output from a pipeline associated with the selected instruction becomes available.
21. The method of claim 20 wherein said scheduling step further comprises identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.
PCT/US1991/000439 1990-01-19 1991-01-18 A risc vectorization system Ceased WO1991010954A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US46765190A 1990-01-19 1990-01-19
US467,651 1990-01-19

Publications (1)

Publication Number Publication Date
WO1991010954A1 true WO1991010954A1 (en) 1991-07-25

Family

ID=23856563

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/000439 Ceased WO1991010954A1 (en) 1990-01-19 1991-01-18 A risc vectorization system

Country Status (1)

Country Link
WO (1) WO1991010954A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4782444A (en) * 1985-12-17 1988-11-01 International Business Machine Corporation Compilation using two-colored pebbling register allocation method such that spill code amount is invariant with basic block's textual ordering
US4794518A (en) * 1979-07-28 1988-12-27 Fujitsu Limited Pipeline control system for an execution section of a pipeline computer with multiple selectable control registers in an address control stage
US4858115A (en) * 1985-07-31 1989-08-15 Unisys Corporation Loop control mechanism for scientific processor
US4961141A (en) * 1988-12-16 1990-10-02 International Business Machines Corporation Generating efficient code for a computer with dissimilar register spaces
US4965724A (en) * 1987-03-05 1990-10-23 Oki Electric Industry Co., Ltd. Compiler system using reordering of microoperations to eliminate interlocked instructions for pipelined processing of assembler source program
US4965882A (en) * 1987-10-01 1990-10-23 Digital Equipment Corporation Method for operating a parallel processing system and related apparatus
US4967343A (en) * 1983-05-18 1990-10-30 International Business Machines Corp. Pipelined parallel vector processor including parallel configured element processors for processing vector elements in parallel fashion
US4980824A (en) * 1986-10-29 1990-12-25 United Technologies Corporation Event driven executive
US4982361A (en) * 1984-10-26 1991-01-01 Hitachi, Ltd. Multiple loop parallel pipelined logic simulation system
US4992934A (en) * 1986-12-15 1991-02-12 United Technologies Corporation Reduced instruction set computing apparatus and methods

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4794518A (en) * 1979-07-28 1988-12-27 Fujitsu Limited Pipeline control system for an execution section of a pipeline computer with multiple selectable control registers in an address control stage
US4967343A (en) * 1983-05-18 1990-10-30 International Business Machines Corp. Pipelined parallel vector processor including parallel configured element processors for processing vector elements in parallel fashion
US4982361A (en) * 1984-10-26 1991-01-01 Hitachi, Ltd. Multiple loop parallel pipelined logic simulation system
US4858115A (en) * 1985-07-31 1989-08-15 Unisys Corporation Loop control mechanism for scientific processor
US4782444A (en) * 1985-12-17 1988-11-01 International Business Machine Corporation Compilation using two-colored pebbling register allocation method such that spill code amount is invariant with basic block's textual ordering
US4980824A (en) * 1986-10-29 1990-12-25 United Technologies Corporation Event driven executive
US4992934A (en) * 1986-12-15 1991-02-12 United Technologies Corporation Reduced instruction set computing apparatus and methods
US4965724A (en) * 1987-03-05 1990-10-23 Oki Electric Industry Co., Ltd. Compiler system using reordering of microoperations to eliminate interlocked instructions for pipelined processing of assembler source program
US4965882A (en) * 1987-10-01 1990-10-23 Digital Equipment Corporation Method for operating a parallel processing system and related apparatus
US4961141A (en) * 1988-12-16 1990-10-02 International Business Machines Corporation Generating efficient code for a computer with dissimilar register spaces

Similar Documents

Publication Publication Date Title
US5930510A (en) Method and apparatus for an improved code optimizer for pipelined computers
JPH0814817B2 (en) Automatic vectorization method
EP0214751B1 (en) A method for vectorizing and compiling object code
Llosa et al. Swing modulo scheduling: A lifetime-sensitive approach
US5894576A (en) Method and apparatus for instruction scheduling to reduce negative effects of compensation code
US5339420A (en) Partitioning case statements for optimal execution performance
US5828886A (en) Compiling apparatus and method for promoting an optimization effect of a program
US7140019B2 (en) Scheduler of program instructions for streaming vector processor having interconnected functional units
CA2010056C (en) Method for improving the efficiency of arithmetic code generation in an optimizing compiler using machine independent update instruction generation
US5598561A (en) Optimizing compiler which generates multiple instruction streams to be executed in parallel
JPH09282179A (en) Method and apparatus for instruction scheduling in optimizing compiler that minimizes overhead instructions
US5450588A (en) Reducing pipeline delays in compilers by code hoisting
US7376818B2 (en) Program translator and processor
EP0402118A2 (en) Expert system for performing beta-token partitioning in a RETE network
JPH04213118A (en) Program translation processor
US5781777A (en) Optimization method for computation partitioning oriented to a distributed memory
Korbaa et al. FMS Cyclic Scheduling with Overlapping production cycles
US20040064811A1 (en) Optimal register allocation in compilers
WO1991010954A1 (en) A risc vectorization system
Rajagopalan et al. Specification of software pipelining using petri nets
JP2000020482A (en) Loop parallelization method
JPH07121381A (en) Loop optimization method
JP2801193B2 (en) Vectorization processing device for induction variables
Chang et al. Bidirectional scheduling: A new global code scheduling approach
Jovanovic Software pipelining of loops by pipelining strongly connected components

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA