WO1991010954A1

WO1991010954A1 - A risc vectorization system

Info

Publication number: WO1991010954A1
Application number: PCT/US1991/000439
Authority: WO
Inventors: Mark Buxbaum; Paul Hohensee; David L. Reese; David R. Wallace
Original assignee: Alliant Computer Systems Corp
Current assignee: Alliant Computer Systems Corp
Priority date: 1990-01-19
Filing date: 1991-01-18
Publication date: 1991-07-25
Anticipated expiration: 1992-07-19

Abstract

An apparatus for scheduling instructions of an iterative construct (110) to run on a pipelined processor, including logic for creating a plurality of instruction slots (202) equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler (300) for scheduling the instructions of the iterative construct (110) into the plurality of slots (300) so that every one of the plurality of slots has a different one of the instructions of the iterative construct (110) scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct (110).

Description

A RISC VECTORIZATION SYSTEM Background of the Invention The invention relates to vectorizing iterative constructs to be run on a pipelined RISC processor. In a pipelined processor, certain operations are performed by splitting them into smaller sequential subprocesses and then allocating each of the subprocesses to a corresponding piece of dedicated hardware, referred to as a stage. Typically, the stages are separated by registers each of which receives the result from the previous stage and makes it available to the next stage. The term pipe or pipeline refers to the sequence of stages and registers which are required to execute a particular operation. As the operation flows through its associated pipeline, it moves from one stage to the next, occupying only one stage at a time. Thus, if an operation has k subprocesses and each stage requires t seconds to execute its corresponding subprocess, the result of the operation becomes available kt seconds after the operation was introduced into the pipeline. Since an operation only occupies one stage at a time as it passes through the pipeline, other operations may be introduced into that same pipeline. In the above example, k different operations may be in the pipeline at one time, each operation occupying a different one of the k stages. Full utilization of a pipeline implies that all stages are executing subprocesses of different operations and thus, there are no wasted machine cycles. Conversely, if there is a gap in the pipeline, i.e., at least one stage is not processing an operation, machine cycles are being wasted.

The times at which operations may be introduced into their respective pipelines are constrained by dependencies that exist between operations. A dependency exists, for example, if one operation requires as one of its input values, the output of an earlier operation. In that case, the operation cannot begin until the earlier operation has passed through its pipeline. Such constraints greatly add to the difficulty in achieving full utilization of pipelines in a pipelined processor.

Summary of the Invention In general, in one aspect, the invention is an apparatus for scheduling instructions of an iterative construct to run on a pipelined processor. The apparatus includes means for creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler for scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct.

Preferred embodiments of the invention include the following features. The scheduling apparatus further includes a compiler for generating an expression tree representation of the iterative construct. The scheduler selects instructions for scheduling from the expression tree proceeding through the tree representation in a reverse execution order one instruction at a time starting with the last instruction. The scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation. The scheduling apparatus also includes computational logic for determining the total number of instructions, N_Tot, in one iteration of the iterative construct and for determining for each one of a plurality of pipeline types the total number of instructions, N_t, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type. The slot creating means uses N_Tot to determine the number of instruction slots to create. The scheduler also includes means for assigning an appropriate one of the pipeline types to each of the instruction slots and the assignment means limits the number of slots that are assigned a given pipeline type to the value of N_t determined for that pipeline type.

Also in preferred embodiments, the scheduler includes logic for determining an iteration offset for each instruction assigned to one of the plurality of instruction slots. In addition, the scheduler includes logic for identifying for each selected instruction a corresponding output slot during which an output. from a pipeline associated with the selected instruction becomes available. The scheduler further includes logic for identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.

In general, in another aspect, the invention is a method for scheduling instructions of an iterative construct to run on a pipelined processor. The method includes the steps of creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and scheduling the instructions of the iterative construct into the plurality of slots so that every one of the plurality of slots has a different one of the instructions of the iterative construct scheduled therein, the scheduled instructions being drawn from more than one iteration of the iterative construct. One advantage of the invention is that it moves all fill and drain code to outside of the body of an iterative construct when it is run on a pipelined processor. The invention schedules the operations of the iterative construct into their respective pipelines so that there are no gaps or wasted machine cycles during the execution of the iterative construct. In addition, the invention achieves efficient register utilization by minimizing the time between when output from a pipeline becomes available and when the output is needed by a subsequent operation.

Other advantages and features will become apparent from the following description of the preferred embodiment and from the claims.

Description of the Preferred Embodiment

Fig. 1 shows the stages of a process for vectorizing a computer program;

Fig. 2 shows an expression tree representation of an iterative construct; Fig. 3 is the data structure used in scheduling the operations of an expression tree for running on a RISC processor;

Figs. 4 is a flow chart of the first section of a scheduling algorithm; Fig. 5 is a flow chart of the second section of the scheduling algorithm;

Fig. 6 is an example of a simple tree expression;

Figs. 7a-e show different development stages of a data structure used for scheduling the operations of the simple tree expression shown in Fig. 6; and

Figs. 8a and 8b show the beginning and the end, respectively, of a schedule for executing the operations of an iterative construct represented by the expression tree shown in Fig. 6. Fig. 1 shows the general stages of a process for vectorizing a computer program to run on a Reduced Instruction Set Computer (i.e., RISC processor) in which various operations are pipelined. The computer program contains an iterative construct, such as a DO-loop, that includes a sequence of operations that is repeated for each iteration of the construct. In the RISC processor used for the described embodiment, which may, for example, be an Intel 860, different operations are pipelined and the operations move through their respective pipelines by being pushed through by subsequent instructions introduced into that pipeline. In addition, the user has direct access to the different pipelines within the processor. The vectorizing process exploits the direct access features of the RISC architecture to schedule operations within an iteration and across iterations of the iterative construct so as to optimally utilize the full capacity of the pipelines. The process, which shall be described in greater detail below, is referred to hereinafter as RISC vectorization. During the first stage of the RISC vectorization process, the computer program is compiled to reduce the code into a simpler form (stage 102). Any one of a number of available or known compiler programs may be used to compile the program. In general, for each assignment statement within each iteration of the DO- loop in the program, the compiler generates a simplified expression tree representation of the code. In generating the tree expression, many compilers also perform certain transformations to eliminate cross- dependences that might exist between different iterations of the iterative construct. As an aid to visualizing the expression tree consider the following very simple computer program consisting of a single DO-loop:

DO i = 1 TO 1000 E(I) = A(I)*[B(I)-C(I)] + D(I)

This equation involves 5 different operators, namely, +, *, -, load and store. From this program, the compiler may generate an expression tree 110 such as is shown in Fig. 2. Expression tree 110 includes nodes 112 representing the different operators of the program and for each node 112, pointers 114 identifying which operators supply the input values for that node.

After expression tree 110 is generated for each iteration of the construct (and assuming no dependencies exist across iterations) , the processor identifies each pipeline type and for each pipeline type computes the total number N_t of operators of that type within the expression tree (stage 104) . Once the processor has an N_t for each of the operator types, it then computes the total number of all operators, N^^., within the tree. That is, it computes:

^NTot - ∑^Nt'

where the summation is over all operator types, t. Then, for each node of the tree, the processor executes a scheduling algorithm to identify a scheduling slot for that node (stage 106) . The assigned scheduling slots for the nodes determines the relative order in which the nodes are to be scheduled for execution on the processor when the program is finally executed. Stage 106 employs a Top-Down Left- o-Right Tree Walk to determine the order in which the scheduling algorithm processes the nodes. In other words, the scheduling algorithm begins with the top most node, i.e., the "A" node in Fig. 2, which represents the last operator that would be executed if the tree were run on a serial processor, and then progresses down the tree first to the left until the bottom is reached and then to the right. After all nodes have been assigned scheduling slots, the program is run on the RISC processor with the nodes scheduled for execution according to the schedule derived during stage 106.

A data structure 200 such as is shown in Fig. 3 is generated during stage 106 to aid in determining the proper scheduling of the nodes of the expression tree. Data structure 200 includes N_Tot schedule slots 202, one slot for each of the nodes in the expression tree, each slot representing a different instruction cycle. Slots 102 are designated slot(l) through slot(N_χot) .

Each of slots 102 has six fields. There is a field 204(1) (labelled TYPE) for identifying the type of operator that is assigned to that slot. There is an instruction field 204(2) (labelled INST) for identifying the node that is assigned that schedule position. There are two input value fields 204(3) and 240(4) (labelled INI and IN2, respectively) for identifying the input values required by the node scheduled in that slot. There is an output field 204(5) (labelled OUT) for identifying the node which generates an output associated with that slot. And finally, there is an iteration number field 204(6) (labelled ITER) for identifying the iteration offset for the node scheduled in that slot. -

For each operation, the scheduling algorithm keeps track of a two component position identifying that operations placement in data structure 200. Those two component positions are the iteration offset (IT_0FF) for that operation and the instruction position (IP), i.e., the slot for that operation. Thus, for the following description, each operation will be identified as follows: 0P_χ[IT_0FF,IP] , where 0P_χ identifies the operation and the bracketed information identi ies the scheduling position for that operation.

When there is a data flow dependence from one operation to another, the two component position is used to ensure that the two operations are executed in the correct order. More specifically, if an operation 0P_χ in a position <a,b> produces a result used by operation OP in position <c,d>, then <a,b> must be lexicographically before <c,d>. In other words, either OP_χ must be in an earlier iteration than OP so that a precedes c in execution order; or if they are in the same iteration, then the slot for OP ^χ__ must come before that for OP_ Y_ so that b precedes d in execution order. Scheduling algorithm 300 shown in Figs. 4 and 5 determines the appropriate slot assignment in data structure 200 for each of the nodes of the expression tree. Scheduling algorithm 300 has two distinct sections. The first section, which is shown in Fig. 4, determines when the output value from a particular operation pipeline is needed. And the second section, which is shown in Fig. 5, determines when the operation must be initiated in order to result in an output when it is needed. Data structure 200 is used to identify and keep track of slot assignments. Basically, the first section assigns a selected operation to the OUT field of the slot that corresponds to the time interval during which the output from the associated pipeline is required. Then, the second section counts back a number of slots equal to the length of the pipeline associated with the selected operation and assigns the operation to the INST field of the slot at that location. This process is repeated for each operation until all have been assigned slot positions. At the time that the scheduling process begins, the two component positions of all of the operations in the expression tree are equal to zero. Then, as shown in Fig. 4, the scheduling process begins by selecting the first operation for scheduling (step 280) . Since the selection step uses a Top-Down Left-to-Right Tree Walk, the selected operation is the top most node of the expression tree. After selecting the first operation, the processor calls scheduling algorithm 300 (step 290) . Initially, scheduling algorithm 300 sets an offset variable IO and an index variable S equal to IT_0FF and IP, respectively, of the selected operation (step 302) . For the first selected operation, both IT_OFF and S are set to zero. Next, algorithm 300 checks S to determine whether it equals zero (step 304) . If it does equal zero, as it does for the first operation, then algorithm 300 increments S by one to identify the slot to which the operation will be assigned. Next, algorithm 300 assigns the first operation to the OUT field of Slot(S=l) (step 308).

In addition to assigning the operation to the OUT field, algorithm 300 also performs several other operations during step 308. It assigns an operation type to Slot(l) by setting the TYPE field of Slot(l) to the same type as the selected operation. It decrements the variable N_t to indicate that one fewer slot of that op- type is available for assignment. And, it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively. After the first operation has been assigned to the OUT field of the appropriate slot, algorithm 300 branches into its second section in which it assigns the selected operation to the INST field of a different appropriate slot. The second section begins by initializing a pipe stage counter W to zero (step 310) and incrementing S (step 312) . Then, algorithm 300 checks S to determine whether it is larger than N_Tot, which is the maximum number of slots available for operation assignments (step 314) . For the first operation, S (which equals two at this point) is not greater than N_Tot, so algorithm 300 then checks the TYPE field of Slot(S=2) to determine whether the slot has an assigned op-type (step 316) . If no op-type is assigned (as would be the case for this pass through the algorithm) , algorithm 300 then compares the N_t for the selected operation to zero to determine whether additional assignments of the op-type may be made (step 318) .

If N_t does not equal zero, algorithm 300 assigns an operation type to Slot(S=2) by setting the TYPE field of Slot(2) to the same type as the selected operation and it decrements the variable N_t to indicate that one fewer slot of that op-type is available for assignment (step 320) . Afterwards, algorithm 300 increments W (step 322) and then compares W to P_t, the length of the pipeline for the op-type (step 324) . If W is less than P_t, algorithm 300 branches back to step 312 where it increments S and repeats the above-described sequence of steps until it finds a slot of the same op-type as the selected operation and for which W equals P_t. During subsequent passes through the above- described sequence of steps, algorithm 300 may reach a point at which the value for N_t in step 318 equals zero. In that case, algorithm 300 moves to the next iteration cycle by resetting S to one and incrementing 10 (step 326). After moving to the next iteration cycle,^" algorithm 300 branches back to step 316. Since Slot(l) was assigned an op-type during the first pass of algorithm 300 through data structure 200, the test for an op-type assignment in step 316 results in an affirmative causing algorithm 300 to then determine whether that slot has the same op-type as the selected operation (step 328) . If the op-types are the same, algorithm 300 increments W (step 322) and then compares W to P_t (step

324) . If during subsequent passes through this phase of algorithm 300, a slot is found which has been assigned a different op-type than that of the selected operation, step 328 causes a branch back to step 312 where S is again incremented. In other words, when algorithm 300 locates a slot that has an inappropriate op-type, it skips that slot and continues the search with the next slot.

Once algorithm 300 finds a slot with the appropriate op-type and for which W equals P_t, then algorithm 300 assigns the selected operation to the INST field of that slot (step 330) . In addition to making the INST field assignment, algorithm 300 also performs several other tasks during step 330. It sets the value in the ITER field equal to IO. It identifies in the INI and IN2 fields the operations which supply the input values required by the selected operation. It sets the IT_OFF and IP values of the selected operation and the operations which supply the input values equal to 10 and S, respectively. And, it adds pointers to a stack from which subsequent operations are selected for slot assignment. The pointers identify the operations which supply the input values for the selected operation and they are added to the stack in the order which results in the Top Down Left-to-Right Tree Walk. After the first operation has been scheduled, the next operation is selected from the stack and the scheduling algorithm is again called to determine its proper scheduling (step 290) . On this pass through algorithm 300, the value of IP associated with the selected operation will typically be non-zero. Thus, when algorithm 300 detects a non-zero value for S in step 304, it first increments S (step 305) and then compares the incremented value to N_Tot to determine whether the index S is still within the allowed range of data structure 200 (step 332) . If S is not greater than N_Tot, algorithm 300 checks the TYPE field of slot(S) to determine whether it has an assigned op-type (step 334) . On the other hand, if S is greater than N_Tot, algorithm 300 first moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing 10 by one to identify that iteration cycle (step 336) and then it checks whether slot(S) has an assigned op-type (step 334) .

If slot(S) does not have an assigned op-type, indicating that an operation assignment has not been made to any field of that slot, algorithm 300 compares the^' value of N associated with the selected operation to zero (step 338) . If N_t does equal zero, indicating that no more slots can be assigned that op-type, algorithm 300 branches back to step 336 where it moves to a next iteration assignment cycle by setting S equal to 1 and by incrementing IO by one. However, if N_t does not equal zero, algorithm 300 branches to step 308 where it assigns the selected operation to the OUT field of slot(S) and performs the other previously described functions.

In step 334, if slot(S) does have an assigned op- type, algorithm 300 determines whether the assigned op- type is the same as the op-type of the selected operation (step 340) . If the assigned op-type and the op-type of the selected operation are not the same, indicating that the selected operation cannot be assigned to that slot, algorithm 300 branches back to step 305 where it increments S to continue the search for an appropriate slot. On the other hand, if the assigned op-type and the op-type of the selected operation are the same, algorithm 300 determines whether the OUT field of that slot already has an operation assigned to it (step 344) . If the OUT field is occupied, algorithm 300 branches to step 305 where it increments S to continue the search for an available slot. If the OUT field is not occupied, algorithm 300 assigns the selected operation to the OUT field of that slot and it sets IT_OFF and IP for the selected operation equal to the current values for IO and S, respectively (step 346) . Once the selected operation has been assigned to an OUT field of the appropriate slot, algorithm 300 branches to step 310 to schedule the selected operation relative to the other operations in the expression tree.

In summary, during the first phase of the scheduling process, algorithm 300 locates the first available slot that has the same op-type as the selected operation and that has no previous operation assigned to its OUT field. When it finds such a slot, it assigns the selected operation to the OUT field of that slot. During the second phase of scheduling algorithm 300, the steps are as previously described for assigning the first selected operation with some exceptions. For later selected operations, however, the incremented value of S from step 312 may become larger than N-^. When that happens, algorithm 300 moves to the next iteration cycle by branching to step 326 after step 314. That is, S is set equal to 1 and 10 is incremented by one. Thus, the search for the appropriate slot is carried into the next iteration cycle. The assignment process during this second phase moves back through data structure 200 one slot at a time, counting (with index W) the number of slots having an op- type which is the same as the op-type of the selected operation for which an assignment is being sought. Due to the nature of the assignment rules as described above, when algorithm 300 identifies the slot representing the P_t-th slot having the same op-type (i.e., the slot for which W=P_t) , the INST field for that slot will be unoccupied. Thus, the selected operation is assigned to its INST field. In essence, the second phase of scheduling algorithm 300 places the selected instruction precisely at the slot location which will cause an output to materialize during the slot in which that operation was assigned to an OUT field. Algorithm 300 always fills data structure 200 with exactly the same number of instructions as are in the tree expression, namely, N^^.. Therefore, there are no gaps in any of the pipelines once steady state operation is achieved and no operations are required during steady state operation for just flushing a pipe.

The sequence of operations appearing in the INST fields of slot(N_Tot) through slot(l) , determines the order in which those operations are to be scheduled in the RISC processor. That is, since the operations are scheduled in reverse order (i.e., later operations before earlier operations) , the proper execution of the resulting sequence of operations proceeds from the highest numbered slot to the lowest numbered slot. And the particular iteration in which an operation is assigned is determined by the value in ITER field for that slot (or equivalently, the final IT_0FF value stored for that operation) .

The lifetime of a result from a particular pipeline (i.e., the time it must be stored in a register) is the time difference between the slot in which the operation is assigned to an OUT field and the slot in which it is used as an input value. Since the first section of algorithm 300 finds the first empty slot, these lifetimes are minimized. The just-described procedure is further illustrated by using a simple tree 400 such as is shown in Fig. 6. Tree 400 involves five operations, namely, OP,, OP-, OP₃, OP₄, and OP₅. In this example, the operations are assumed to be of two types, either Type 1 or Type 2. Operations OP_j,, OP₃, and 0P₄ are Type 1 and have an associated pipeline length of 2 (P_τl=2) . And operations 0P₂ and OP₅ are Type 2 and have an associated pipeline length of 3 (P_χ2=3) . The values computed for N-_Eot- ^Nι ^{and N} ₂ ^{are 5}* ³* ^and » respectively. Before beginning the scheduling process, the two component positions, i.e., <IT_OFF,IP>, of all operations were set to <0,0>.

At the start of the scheduling process, the top most node, i.e., OP_χ, is selected and scheduling algorithm 300 is executed to schedule that operator in the appropriate slots of a data structure 500 (as shown in Figs. 7a-e) . Since the two component position associated with OP_χ equals <0,0>, the values of 10 and S are both set equal to zero in step 302. Since S=0, when algorithm 300 executes step 304, it branches to step 306 where it increments S (S=l) and then assigns OP_χ to the OUT field of slot(l) (step 308) . In other words, the output for OP_j from the associated pipeline is scheduled to materialize during the time interval associated with slot(l). Since OP_λ is a Type 1 op-type, algorithm 300 also assigns Type 1 to slot(l) by inserting a one in the TYPE field of slot(l) . After assigning the op-type to the slot, algorithm decrements N_λ to 2, indicating that only two more slots may be assigned that op-type.

In the second phase, algorithm 300 determines when OP_χ must enter its associated pipeline in order to produce an output in slot(l) . After setting W to 0 and incrementing S to 2, algorithm 300 makes sure that it is dealing with a valid slot number by comparing S to N_Tot = 5 (step 314) . Since S is less than 5, algorithm 300 then checks whether slot(2) has been assigned an op-type (step 316) . Upon determining that slot(2) has no assigned op- type, algorithm 300 determines whether more Type 1 assignments may be made (i.e., does N_χ=0? step 318).

There being two remaining Type 1 assignments, algorithm 300 uses one of those assignments to designate slo (2) to be a Type 1 slot and then decrements N_χ to one (step 320) . Following the op-type assignment, counter W is incremented to one and then compared to P_τl to determine whether the separation between slot(l) and slot(2) is equivalent to the length of the associated pipeline (steps 322 and 324) . Since W is less than the associated pipeline length, algorithm 300 increments S to 3 (step 312) and then examines slot(3) .

During this next pass through the second section of algorithm 300, the same sequence of steps is repeated and slot(3) is assigned Type 1 as its op-type. This time, however, the incremented value of W equals 2 (i.e., the associated pipeline length) , so OP_χ is assigned to the INST field of slot(3) (step 330). In addition, it sets the value in the ITER field equal to 0 (since IO=0) . It identifies in the INI and IN2 fields the operations which supply the input values required by 0P_χ, namely OP₂ and 0P₃. It sets the IT_0FF and IP values of 0P_χ, 0P₂, and OP₃ equal to the current values of IO and S, respectively (10=0 and S=3) . And, it adds pointers to 0P₂ and 0P₃ to a stack 600 from which subsequent operations are selected for slot assignment.

Consequently, at the conclusion of executing scheduling algorithm 300 for 0P_{χ l} the entries in data structure 500 are as shown in Fig. 7a.

The next operation to be scheduled is selected from the top of stack 600. It is 0P₂. Algorithm 300 looks for the first available slot that is before the slot in which the output from 0P₂ is needed and it assigns OP₂ to the OUT field of that slot. In this pass through algorithm 300, the values for IO and S are set equal to the stored values that were stored for IT_0FF and IP of 0P₂, namely, <0,3>. After detecting that S is not equal to zero (step 302), algorithm 300 increments S to 4 and then compares the incremented value to N_Tot (step 332) . Since s still is within the valid range, algorithm 300 then checks whether slot(4) has an op-type assigned to it (step 334) . After determining that no such assignment has been made, algorithm 300 checks N₂ to determine whether the slot may be assigned a Type 2 op- type. Since no Type 2 op-type assignments have yet been made (i.e., N₂ still equals 2), algorithm 300 designates slot(4) as a Type 2 slot and it assigns 0P₂ to its OUT field. Algorithm 300 also decrements N₂ to one and updates the two component position for 0P₂ so that it is <0,4>. Next, algorithm 300 determines when 0P₂ must enter its associated pipeline in order to produce an output in slot(4) . After setting W to 0 and incrementing S to 5, algorithm 300 again makes sure that it is dealing with a valid slot number by comparing S to N_Tot = 5 (step 314). Detecting that slot(5) is a valid slot and it has no op- type assignment, algorithm 300 then assigns it to be a Type 2 slot, decrements N₂ to zero (indicating that no more Type 2 slots may be created) and increments W (step 320 and 322). Since W is less than P_τ2 (i.e., the length of the pipeline associated with 0P₂) , 0P₂ cannot be assigned to slot(5) and the search must continue onto the next slot. Thus, S is incremented to 6 (step 312) . In step 314, algorithm 300 detects that S is greater than N_Iot, so it resets S to one and increments 10 by one to indicate that it has moved into another iteration cycle to find the appropriate slot. After determining that slot(l) , slot(2) and slot(3) are Type 1 slots, algorithm 300 ends up at slot(4) which is a Type 2 slot. Upon detecting that the op-type for slot(4) is the same as the op-type for 0P₂, algorithm 300 increments W (so that W=2) and then compares W to P_τ2 (steps 322 and 324) . Since W is still less than P_τ2, OP₂ cannot be assigned to slot(4) and the search continues onto the next slot (i.e., S is incremented in step 312). Since slot(5) is a Type 2 slot, algorithm 300 again increments W and compares it to P_τ2. This time, however, W equals P_τ2, so algorithm 300 assigns OP₂ to the INST field of slot(5) . In addition, it sets the value in the ITER field equal to the current value for IO (i.e., 1) . ^" It identifies in the INI and IN2 fields the operations which supply the input values required by 0P₂, namely 0P₄ and OP₅. It sets the IT_0FF and IP values of OP₂, 0P₄, and OP₅ equal to the current values of 10 and S, respectively (10=1 and S=5) . And, it adds pointers to OP₄ and 0P₅ to stack 600.

At the conclusion of executing scheduling algorithm 300 for 0P₂, the entries in data structure 500 are as shown in Fig. 7b. So, the next operation available for scheduling is 0P₄.

Upon selecting 0P₄, algorithm 300 again sets the values for 10 and S based upon the stored two component position for OP₄, namely, <1,5> (step 302) and then increments S (step 305) . Since S is now greater than N_iot- algorithm 300 moves into the next iteration cycle to find the appropriate slot assignment for 0P₄ (i.e., it resets S to one and increments 10 to a value of 2) . Algorithm 300 searches up through data structure 500 one slot at a time, until it finds the first Type 1 slot that has no operation assigned to its OUT field. When it finds that slot, which in this case is slot(4) , it assigns OP₄ to the OUT field of that slot and updates the two component position for OP₄ to <2,2>. Then, algorithm 300 determines when OP₄ must enter its associated pipeline in order to produce an output in slot(2). After setting W to zero (step 310), algorithm 300 searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0P₄. The search continues until W equals P_τl. Thus, at slot(3), which is a Type 1 slot, W is incremented to one. At slot(4) and slot(5) , both of which are Type 2 slots, W is not incremented. After slot(5), the search moves into the next iteration cycle (10=3) and starts over again at slot(l) . At slot(l) of this iteration cycle W is incremented to a value which equals P_τl, so algorithm 300 assigns OP₄ to slot(l). As before, algorithm 300 also sets the value in the ITER field equal to the current value for IO (i.e., 3) and it identifies in the INI and IN2 fields the operations which supply the input values required by OP₂. In this instance, since OP₄ does not use input values from other operators, no operators are identified in the INI and IN2 fields. Algorithm 300 also updates the stored two component position of OP₄ to equal <3,1>.

At the conclusion of executing scheduling algorithm 300 for OP₄, the entries in data structure 500 are as shown in Fig. 7c; and the next operation available for scheduling is OP₅. Upon selecting 0P₅, algorithm 300 again sets the values for IO and S based upon the stored two component position for OP₅, namely, <1,5> (step 302) and then increments S (step 305) . Since S is now greater than N_Tot, algorithm 300 moves into the next iteration cycle to find the appropriate slot assignment for 0P₄ (i.e., it resets S to one and increments IO=2) . Algorithm 300 searches up through data structure 500 one slot at a time until it finds the first Type 2 slot that has no operation assigned to its OUT field. When it finds that slot, which in this case is slot(5) , it assigns 0P₅ to the OUT field of that slot and updates the two component position for OP₅ to <2,5>.

Next, algorithm 300 determines when 0P₅ must enter its associated pipeline in order to produce an output in slot(5) . After setting W to zero (step 310) , algorithm 300 again searches up through the slots of data structure 500 while counting with counter W the number of slots having the same op-type as 0P_g. Each time the search goes over the top of data structure 500 (i.e., S=6) , the search moves into the next iteration cycle. When W equals P_τ2, algorithm 300 assigns OP₅ to the INST field of that slot. In this case, W reaches 3 at slot(4) of the fourth iteration cycle (i.e., 10=4). Algorithm 300 sets the value in the ITER field of slot(4) equal to the current value for 10 (i.e., 4) and it identifies in the INI and IN2 fields the operations which supply the input values required by 0P₅. Since 0P₅ does not use input values supplied by other operators, no operators are identified in the INI and IN2 fields. Algorithm 300 also updates the stored two component position of OP₅ to equal <4,4>.

At the conclusion of executing scheduling algorithm 300 for 0P₅, the entries in data structure 500 are as shown in Fig. 7d; and the next operation available for scheduling is 0P₃.

Using the above-described procedures it should be readily apparent that 0P₃ is first assigned to the OUT field of slot(3) at which point its stored position is updated to equal <1,3>. Then, OP₃ is assigned to the INST field of slot(2) and its final stored position becomes <2,2>. Thus, the entries in data structure 500 at the completion of the scheduling process are as shown in Fig. 7e. The resulting scheduling order as determined by algorithm 300 is 0P₂{1}, OP₅{4}, OP₁{0}, OP₃{2} and OP₄{3}, where the number in brackets {} is the iteration offset.

Figs. 8a and 8b illustrate the beginning and the end, respectively, of a schedule for running the iterative construct illustrated by tree 400 shown in Fig. 6. The schedule is for 101 iterations of the construct. The numbers along the horizontal axis identify the iteration cycle of the RISC processor. The vertical axis identify the instruction cycle within an iteration cycle. Thus, each iteration cycle consists of 5 instruction cycles. Within the charts, the convention is as follows. The entry above the slash (/) identifies the operation which is begun during that time interval and the entry below the slash identifies the output which becomes available form its associated pipeline during that time interval. The operations are designated as 0P_χ[i], where x identifies the operation and i identifies the iteration number. FILL and DRAIN refers to fill code and drain code. Fill code is used at the beginning of running to fill any gaps that may exist until steady state operation is achieved. And drain code is used to push out the results at the end of the process. Both fill and drain code can be other operations which either precede or follow the iterative construct or they can be NOP operations which produce no result other than to clear the pipelines.

As shown in Fig. 8a, during the first iteration cycle (i.e., iteration cycle number 0) only the 0^th iteration of 0P₅ (i.e., OP₅[0]) is begun. OP₅[0] is begun during the second instruction cycle of that iteration cycle. And for each successive iteration cycle thereafter until all iterations of OP₅ have been run, another succeeding iteration of OP₅ is begun during this same instruction cycle. All other instruction cycles within the first iteration cycle receive fill code. In iteration cycle number 1, OP_g[l] is begun during the second instruction cycle, as previously stated, and OP₄[0] is begun during the fifth instruction cycle. All other instruction cycles receive fill code. In iteration cycle number 2, the output values corresponding to OP_S[0] and to OP₄[0] materialize out of their associated pipelines during instruction cycles 1 and 4, respectively. In addition, OP₅[2] is begun during the second instruction cycle, OP₃[0] is begun during the fourth instruction cycle and 0P₄[1] is begun during the fifth instruction cycle. Instruction cycles 1 and 3 receive fill code.

In iteration cycle number 3, the output values corresponding to OP₅[l], OP₃[0], and 0P₄[1] materialize out of their associated pipelines during instruction cycles 1, 3 and 4, respectively. In addition, operations OP₂[0], OP₅[3], OP₃[l] and OP₄[2] are begun during instruction cycles 1, 2, 4, and 5, respectively. Instruction cycle 3 receives fill code. Note that the input values for OP₂[0], namely, outputs from the pipelines associated with OP₄[0] and OP₅[0], were made available during the previous iteration cycle.

Finally, in iteration cycle number^*4, the output values corresponding to 0P₅[2], OP₂[0], 0P₃[1], OP₄[2] and OP^O] materialize out of their associated pipelines during instruction cycles 1, 2, 3, 4, and 5, respectively. In addition, operations 0P₂[l], 0P_[4], ^op _χ[°]_/ ^op ₃[ ] and OP₄[3] are begun during instruction cycles 1, 2, 3, 4, and 5, respectively. Note that both input values for OP_j O], namely, outputs from the pipelines associated with OP₂[0] and OP₃[0], were made available prior to instruction cycle 3, as required. In particular, the output of OP₂[0] was made available during the immediately preceding instruction cycle and the output of OP₃[0] was made available during the immediately preceding iteration cycle.

Note that during iteration cycle 4, there are no gaps in any instruction cycles requiring the use of fill code. That is, the associated pipelines are fully utilized until throughout the rest of the processing of the iterative construct through iteration cycle 100, as shown if Fig. 8b. After iteration 100, the different pipelines empty out at different time until they are all emptied of pending operations from the iterative construct by the end of iteration cycle 104. At that point, processing for the iterative construct is complete. The above described procedure easily accommodates iterative constructs with multiple assignment statements. For example, OP₃ in Fig. 6 may also be defined by another assignment statement that is represented by a different tree, i.e., a sub-tree. In that case, a complete tree representation of the iterative construct will also include a network of nodes descending down from OP₃ representing the sub-tree. The operations of the complete iterative construct can be scheduled by applying the scheduling algorithm to the complete tree as described above. Or the scheduling algorithm can be applied to the main tree and the sub-tree separately and the two schedules are later combined by taking into account the slot position and iteration offset associated with OP₃. That is, all of the scheduled operations of the sub-tree can be incorporated into the other schedule by adjusting their two component positions based on the the slot position and iteration offset associated with

OP,

Other embodiments are within the following claims.

What is claimed is:

Claims

1. An apparatus for scheduling instructions of an iterative construct to run on a pipelined processor, the apparatus comprising: means for creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and a scheduler for scheduling the instructions of the iterative construct into said plurality of slots so that every one of said plurality of slots has a different one of the instructions of the iterative construct scheduled therein, said scheduled instructions being drawn from more than one iteration of the iterative construct.

2. The scheduling apparatus of claim 1 further comprising a compiler for generating an expression tree representation of the iterative construct, and wherein said scheduler uses said tree representation to select instructions for scheduling.

3. The scheduling apparatus of claim 2 wherein the scheduler proceeds through the tree representation in a reverse execution order one instruction at a time starting with the last instruction.

4. The scheduling apparatus of claim 3 wherein the scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation.

5. The scheduling apparatus of claim 1 further comprising computational logic for determining the total number of instructions, N_Tot, in one iteration of the iterative construct, and wherein said slot creating means uses N_Tot to determine the number of instruction slots to create.

6. The scheduling apparatus of claim 5 wherein said computational logic is also for determining for each one of a plurality of pipeline types the total number of instructions, N_t, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type.

7. The scheduling apparatus of claim 6 wherein said scheduler comprises means for assigning an appropriate one of said pipeline types to each of said instruction slots and said assignment means limits the number of slots that are assigned a given pipeline type to the value of N_t determined for that pipeline type.

8. The scheduling apparatus of claim 1 wherein said scheduler comprises logic for determining an iteration offset for each instruction assigned to one of said plurality of instruction slots.

9. The scheduling apparatus of claim 1 wherein said scheduler comprises logic for identifying for each selected instruction a corresponding output slot during which an output from a pipeline associated with the selected instruction becomes available.

10. The scheduling apparatus of claim 9 wherein said scheduler further comprises logic for identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.

11. The scheduling apparatus of claim 1 wherein the iterative construct is available in the form of an expression tree representation of the iterative construct, and wherein said scheduler uses said tree representation to select instructions for scheduling.

12. The scheduling apparatus of claim 11 wherein the scheduler proceeds through the tree representation in a reverse execution order one instruction at a time starting with the last instruction.

13. The scheduling apparatus of claim 12 wherein the scheduler employs a top-down, left-to-right tree walk to proceed through the tree representation.

14. The scheduling apparatus of claim 1 wherein the number of instruction slots that are created equals the number of instructions in one iteration of the iterative construct; and

15. A method for scheduling instructions of an iterative construct to run on a pipelined processor, the method comprising: creating a plurality of instruction slots equal in number to at least the number of instructions in one iteration of the iterative construct; and scheduling the instructions of the iterative construct into said plurality of slots so that every one of said plurality of slots has a different one of the instructions of the iterative construct scheduled therein, said scheduled instructions being drawn from more than one iteration of the iterative construct.

16. The method of claim 15 further comprising generating an expression tree representation of the iterative construct, and wherein said scheduling step uses said tree representation to select instructions for scheduling.

17. The method of claim 15 further comprising determining the total number of instructions, N_Tot, in one iteration of the iterative construct, and wherein said slot creating step uses N_Tot to determine the number of instruction slots to create.

18. The method of claim 15 further comprising determining for each one of a plurality of pipeline types the total number of instructions, N_t, within one iteration of the iterative construct that are of that type, where t is an index corresponding to the relevant pipeline type.

19. The method of claim 15 further comprising determining an iteration offset for each instruction assigned to one of said plurality of instruction slots.

20. The method of claim 15 wherein the scheduling step comprises identifying for each selected instruction a corresponding output slot during which an output from a pipeline associated with the selected instruction becomes available.

21. The method of claim 20 wherein said scheduling step further comprises identifying for each selected instruction the corresponding instruction slot into which that selected instruction will be scheduled based upon the output slot corresponding to that instruction.