US20170147347A1 - Method of synchronizing independent functional unit - Google Patents
Method of synchronizing independent functional unit Download PDFInfo
- Publication number
- US20170147347A1 US20170147347A1 US14/950,452 US201514950452A US2017147347A1 US 20170147347 A1 US20170147347 A1 US 20170147347A1 US 201514950452 A US201514950452 A US 201514950452A US 2017147347 A1 US2017147347 A1 US 2017147347A1
- Authority
- US
- United States
- Prior art keywords
- program
- functional processing
- processing unit
- stream
- instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/30087—Synchronisation or serialisation instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/3009—Thread control instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/32—Address formation of the next instruction, e.g. by incrementing the instruction counter
- G06F9/321—Program or instruction counter, e.g. incrementing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/452—Instruction code
Definitions
- This disclosure relates to parallel processing and particularly to synchronization of multiple functional units.
- a fast synchronization method is needed.
- One known method of parallel usage of multiple functional units is decoupled access/execute architecture (DAE arch), which includes two independent units communicating using two queues and synchronization is achieved by the same queuing mechanism
- DAE arch decoupled access/execute architecture
- Modem arch modern out-of-order architecture
- VLIW architecture in which all functional units proceed in lock-step.
- the system includes a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit.
- the system includes a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit.
- the first functional processing unit being in communication with the second functional processing unit and configured to synchronize the issuance of the first stream of program instructions to the second stream of program instructions and the second functional processing unit being in communication with the first functional processing unit and configured to synchronize the issuance of the second stream program instructions to the first stream of program instructions.
- each functional processing unit places a limit on the program counter of other functional unit.
- At least one of the first and second program instruction buffers include ‘set limit’ instructions inserted in the respective first and second program instruction streams.
- the first and second program instruction buffers include at least one pair of wait-go instructions in which one instruction of the wait-go pair is inserted in the first program instruction stream and the other instruction of the wait-go pair is inserted in the second program instruction stream.
- the first program instruction stream includes at least one ‘wait’ instruction and a matching ‘go’ instruction is included in the second instruction stream.
- the first and second program instruction buffers include at least one pair of instructions inserted with wait-go bits in which one instruction of the pair has one of the wait-go bits inserted in the first program instruction stream and the other instruction of the pair has the other of the wait-go bits is inserted in the second program instruction stream.
- each instruction comes with attributes, such as additional bit fields, that indicate ‘wait’ or ‘go’. Instructions in the first program instruction stream may have ‘wait’ attributes and instructions in the second program instruction stream have matching ‘go’ attributes.
- the disclosure is directed to a method for synchronizing parallel processing in a system having a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit, and a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit.
- the method includes synchronizing at least one of the issuance of the first stream of program instructions to the second stream of program instructions through communication from the first functional processing unit communication to the second functional processing unit and the issuance of the second stream program instructions to the first stream of program instructions through communication from the second functional processing unit to the first functional processing unit.
- the disclosure is directed to a non-transitory article of manufacture tangibly embodying computer readable instructions, which when implemented, cause a computer to perform the steps of a method for synchronizing parallel processing system having a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit, and a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit.
- a first synchronization setting unit is in communication with the first and second functional processing units and a second synchronization setting unit is in communication with the first and second functional processing units.
- the method includes synchronizing at least one of the issuance of the first stream of program instructions to the second stream of program instructions through communication from the first functional processing unit communication to the second functional processing unit to and the issuance of the second stream program instructions to the first stream of program instructions through communication from the second functional processing unit to the first functional processing unit.
- FIG. 1A is a block diagram of one embodiment of a counter limit implementation.
- FIG. 1B is a schematic diagram of the instructions streams of two functional units of the embodiment of FIG. 1A .
- FIG. 2A is a block diagram of one embodiment of a wait-go instruction pair implementation.
- FIG. 2B is a schematic diagram of the instructions streams of two functional units in the embodiment of FIG. 2A .
- FIG. 3A is a block diagram of one embodiment of a wait-go bit pair implementation.
- FIG. 3B is a schematic diagram of the instructions streams of two functional units in the embodiment of FIG. 3A .
- FIG. 4 is a block diagram of an exemplary computing system suitable for implementation of this invention.
- one embodiment of this disclosure includes a system for synchronizing parallel processing of a plurality of functional processing units.
- the system 10 includes a first functional processing unit APU 12 having a first program counter 14 .
- the first program counter unit 14 is configured to control timing of program instructions issued to the first functional processing unit APU 12 by advancement of the first program counter 14 .
- a first program instruction buffer 16 is used by the first functional processing unit 12 .
- the first program counter 14 is configured to point current instruction in the first program instruction buffer 16 which is read and issued by the first functional processing unit APU 12 .
- a second functional processing unit LD 18 has a second program counter 20 .
- the second program counter 20 is configured to control timing of program instructions issued to the second functional processing unit LD 18 by advancement of the second program counter 20 .
- a second program instruction buffer 22 is used by the second functional processing unit 18 .
- the second program counter 20 is configured to point to a current instruction in the second program instruction buffer 22 which is read and issued by the second functional processing unit LD 18 .
- the first functional processing unit APU 12 is in communication with the second functional processing unit LD 18 to control the issuance of program instructions of the second functional processing unit LD 18 .
- the second functional processing unit LD 18 is in communication with the first functional processing unit APU 12 to control the issuance of program instructions of the first functional processing unit APU 12 .
- synchronization between the program instructions of the functional processing units is provided by placing a limit on program counter advancement.
- ‘set limit’ instructions 30 are inserted in instruction buffer 16 to set a limit 26 on the advancement of program counter 20 of LD unit 18 .
- set limit instructions 28 are inserted in instruction buffer 22 to set a limit 24 on the advancement of program counter 14 of APU 12 .
- the limit instruction 30 in the program instruction buffer 16 limits the advancement of the program counter 20 to synchronize with advancement with the program counter 14 .
- the limit instruction 28 in the program instruction buffer 22 limits the advancement of the program counter 14 to synchronize with advancement with the program counter 20 .
- instruction stream 32 is for APU 12 and instruction stream 34 is for LD 18 .
- instructions 3 , 5 , 7 of stream 34 depend on instructions 2 , 4 , 6 of stream 32 .
- Set limit instructions 30 a, 30 b and 30 c are inserted into instruction stream 32 . Initially, APU 12 sets the limit 26 for LD 18 to 2. Set limit instruction 30 a causes APU 12 to set the limit of LD unit 18 to 4 after executing instruction 2 of stream 32 .
- LD 18 stream 34 waits at instruction 2 until the APU 12 changes the limit of LD 18 to 4.
- set limit instruction 30 b causes APU 12 to set the limit of LD unit 18 to 6 after executing instruction 4 of stream 32 . If LD 18 stream 34 reaches 4 before the APU 12 stream 32 passes 4, LD 18 stream 34 waits at instruction 4 until the APU 12 changes the limit of LD 18 to 4.
- synchronization between the program instructions of the functional processing units is provided by inserting one or more wait-go instruction pairs in the instruction streams.
- communication in the direction from APU 12 to LD 18 is through an APU-LD counter 36 .
- Communication in the direction from LD 18 to APU 12 is through an LD-APU counter 38 .
- One or more wait-go instruction pairs 40 , 42 are inserted in program instruction buffers 16 and 18 , respectively and one or more wait-go instruction pairs 44 , 46 are inserted in program instruction buffers 22 and 16 , respectively.
- the first program instruction stream includes at least one a ‘wait’ instruction and matching ‘go’ instruction is included in the second instruction stream. As shown in one exemplary embodiment in FIG.
- instruction stream 48 is for APU 12 and instruction stream 50 is for LD 18 . Instructions 3 , 5 , 7 of stream 50 depend on instructions 2 , 4 , 6 of stream 48 . Initially, counters 14 , 20 , 36 and 38 are set to zero. Wait instructions 44 a, 44 b, 44 c are inserted before instructions 3 , 5 , 7 of stream 50 . Go instructions 46 a, 46 b, 46 c are inserted after instructions 2 , 4 , 6 of stream 48 . If the go instruction 46 a of wait-go pair 44 a - 46 a reaches first, the APU-LD counter 36 is incremented.
- LD 18 checks the APU-LD counter 36 and if zero, the LD 18 stream 50 waits until the APU-LD counter 36 is incremented by APU 12 . If the APU-LD counter 36 is incremented when checked, the APU-LD counter 36 is decremented and stream 50 proceeds.
- synchronization between the program instructions of the functional processing units is provided by inserting one or more wait-go bits to the instruction streams.
- each instruction comes with attributes, such as additional bit fields, that indicate ‘wait’ or ‘go’.
- Instructions in the first program instruction stream may have ‘wait’ attributes and instructions in the second program instruction stream have matching ‘go’ attributes.
- Instructions 3 , 5 , 7 in LD 18 stream 62 depend on instructions 2 , 4 , 6 of APU 12 stream 60 . Initially, all counters are set to zero. Go bits 56 a , 56 b, 56 c are injected into instructions 2 , 4 , 6 of stream 60 . Wait bits 58 a, 58 b, 58 c are injected into stream 62 .
- APU-LD counter 36 is incremented and stream 60 proceeds to instruction 3 . If instruction 3 with injected wait bit 58 a reaches, LD 18 unit checks APU-LD counter 36 . If the counter 36 is zero, LD unit 18 waits until APU 12 increments counter 36 . If counter 36 is incremented, APU 12 decrements counter 36 and stream 62 proceeds.
- This invention achieves parallel usage of multiple functional units while being more flexible than VLIW arch in that functional units are not in lockstep.
- the invention is also more flexible than Modern arch in that each FU is independent by having its own program counter, is more lightweight mechanism than DAE arch and Modern arch. Unlike DAE arch, register file can be shared among FUs and unlike Modern arch, complex register renaming is not needed.
- a plurality of counters for each direction that is APU-LD and LD-APU, can be used.
- the wait-go pair have bit vector specifying counters.
- FIG. 4 shows an exemplary computer system 100 , which is applicable to implement embodiments of the present invention.
- computer system 100 can include: CPU (Central Process Unit) 101 , RAM (Random Access Memory) 102 , ROM (Read Only Memory) 103 , System Bus 104 , Hard Drive Controller 105 , Keyboard Controller 106 , Serial Interface Controller 107 , Parallel Interface Controller 108 , Display Controller 109 , Hard Drive 110 , Keyboard 111 , Serial Peripheral Equipment 112 , Parallel Peripheral Equipment 113 and Display 114 .
- CPU Central Process Unit
- RAM Random Access Memory
- ROM Read Only Memory
- CPU 101 CPU 101 , RAM 102 , ROM 103 , Hard Drive Controller 105 , Keyboard Controller 106 , Serial Interface Controller 107 , Parallel Interface Controller 108 and Display Controller 109 are coupled to System Bus 104 .
- Hard Drive 110 is coupled to Hard Drive Controller 105 .
- Keyboard 111 is coupled to Keyboard Controller 106 .
- Serial Peripheral Equipment 112 is coupled to Serial Interface Controller 107 .
- Parallel Peripheral Equipment 113 is coupled to Parallel Interface Controller 108 .
- Display 114 is coupled to Display Controller 109 . It should be understood that the structure as shown in FIG. 4 is only for exemplary purposes rather than any limitation to the present invention. In some cases, some devices can be added to or removed from computer system 100 based on specific situations.
- aspects of the present invention can be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that can all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the present invention can take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium can be a computer readable signal medium or a computer readable storage medium.
- a computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the above.
- a computer readable storage medium can include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- a computer readable storage medium can be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer readable signal medium can include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal can take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a computer readable signal medium can be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium can be transmitted using any appropriate medium, including, but not limited to, wireless, wire line, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention can be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code can execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- These computer program instructions can also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture, including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented method such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Multi Processors (AREA)
Abstract
Description
- This invention was made with Government support under Contract No.: B599858 awarded by Department of Energy. The Government has certain rights in this invention.
- This disclosure relates to parallel processing and particularly to synchronization of multiple functional units.
- In many parallel processing systems, there are multiple functional units working independently but sharing a register file. In such a system, a fast synchronization method is needed. One known method of parallel usage of multiple functional units is decoupled access/execute architecture (DAE arch), which includes two independent units communicating using two queues and synchronization is achieved by the same queuing mechanism another known method is modern out-of-order architecture (Modem arch), in which several functional units are working in parallel but based on single program counter. In this method, dependency is enforced by complex register renaming and an interlocking pipeline is used. A third known method is VLIW architecture, in which all functional units proceed in lock-step.
- This disclosure is directed to system for synchronizing parallel processing of a plurality of functional processing units. In one embodiment, the system includes a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit. The system includes a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit. The first functional processing unit being in communication with the second functional processing unit and configured to synchronize the issuance of the first stream of program instructions to the second stream of program instructions and the second functional processing unit being in communication with the first functional processing unit and configured to synchronize the issuance of the second stream program instructions to the first stream of program instructions.
- In one embodiment each functional processing unit places a limit on the program counter of other functional unit. At least one of the first and second program instruction buffers include ‘set limit’ instructions inserted in the respective first and second program instruction streams.
- In one embodiment, the first and second program instruction buffers include at least one pair of wait-go instructions in which one instruction of the wait-go pair is inserted in the first program instruction stream and the other instruction of the wait-go pair is inserted in the second program instruction stream. In one example, the first program instruction stream includes at least one ‘wait’ instruction and a matching ‘go’ instruction is included in the second instruction stream.
- In one embodiment the first and second program instruction buffers include at least one pair of instructions inserted with wait-go bits in which one instruction of the pair has one of the wait-go bits inserted in the first program instruction stream and the other instruction of the pair has the other of the wait-go bits is inserted in the second program instruction stream. In one example, each instruction comes with attributes, such as additional bit fields, that indicate ‘wait’ or ‘go’. Instructions in the first program instruction stream may have ‘wait’ attributes and instructions in the second program instruction stream have matching ‘go’ attributes.
- In one embodiment the disclosure is directed to a method for synchronizing parallel processing in a system having a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit, and a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit. The method includes synchronizing at least one of the issuance of the first stream of program instructions to the second stream of program instructions through communication from the first functional processing unit communication to the second functional processing unit and the issuance of the second stream program instructions to the first stream of program instructions through communication from the second functional processing unit to the first functional processing unit.
- In one embodiment the disclosure is directed to a non-transitory article of manufacture tangibly embodying computer readable instructions, which when implemented, cause a computer to perform the steps of a method for synchronizing parallel processing system having a first functional processing unit, a first program counter and a first program instruction buffer used by the first functional processing unit, and a second functional processing unit, a second program counter and a second program instruction buffer used by the second functional processing unit. A first synchronization setting unit is in communication with the first and second functional processing units and a second synchronization setting unit is in communication with the first and second functional processing units. The method includes synchronizing at least one of the issuance of the first stream of program instructions to the second stream of program instructions through communication from the first functional processing unit communication to the second functional processing unit to and the issuance of the second stream program instructions to the first stream of program instructions through communication from the second functional processing unit to the first functional processing unit.
- These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings, in which:
-
FIG. 1A is a block diagram of one embodiment of a counter limit implementation. -
FIG. 1B is a schematic diagram of the instructions streams of two functional units of the embodiment ofFIG. 1A . -
FIG. 2A is a block diagram of one embodiment of a wait-go instruction pair implementation. -
FIG. 2B is a schematic diagram of the instructions streams of two functional units in the embodiment ofFIG. 2A . -
FIG. 3A is a block diagram of one embodiment of a wait-go bit pair implementation. -
FIG. 3B is a schematic diagram of the instructions streams of two functional units in the embodiment ofFIG. 3A . -
FIG. 4 is a block diagram of an exemplary computing system suitable for implementation of this invention. - As shown in
FIG. 1 , one embodiment of this disclosure includes a system for synchronizing parallel processing of a plurality of functional processing units. Thesystem 10 includes a first functional processing unit APU 12 having afirst program counter 14. The firstprogram counter unit 14 is configured to control timing of program instructions issued to the first functional processing unit APU 12 by advancement of thefirst program counter 14. A firstprogram instruction buffer 16 is used by the firstfunctional processing unit 12. Thefirst program counter 14 is configured to point current instruction in the firstprogram instruction buffer 16 which is read and issued by the first functional processing unit APU 12. - A second functional processing unit LD 18 has a
second program counter 20. Thesecond program counter 20 is configured to control timing of program instructions issued to the second functional processing unit LD 18 by advancement of thesecond program counter 20. A secondprogram instruction buffer 22 is used by the secondfunctional processing unit 18. Thesecond program counter 20 is configured to point to a current instruction in the secondprogram instruction buffer 22 which is read and issued by the second functional processing unit LD 18. - The first functional processing unit APU 12 is in communication with the second functional
processing unit LD 18 to control the issuance of program instructions of the second functionalprocessing unit LD 18. The second functionalprocessing unit LD 18 is in communication with the first functional processing unit APU 12 to control the issuance of program instructions of the first functionalprocessing unit APU 12. - In the embodiment of
FIG. 1A , synchronization between the program instructions of the functional processing units is provided by placing a limit on program counter advancement. In this embodiment, ‘set limit’instructions 30 are inserted ininstruction buffer 16 to set alimit 26 on the advancement ofprogram counter 20 ofLD unit 18. Likewise, setlimit instructions 28 are inserted ininstruction buffer 22 to set alimit 24 on the advancement ofprogram counter 14 of APU 12. Thelimit instruction 30 in theprogram instruction buffer 16 limits the advancement of theprogram counter 20 to synchronize with advancement with theprogram counter 14. Thelimit instruction 28 in theprogram instruction buffer 22 limits the advancement of theprogram counter 14 to synchronize with advancement with theprogram counter 20. - The program counters 14 and 20 constantly check the instruction stream for when the limit is reached and when the program counter determines that the instruction limit has been reached, the corresponding functional processing unit temporarily stops receiving instructions until the limit is changed. As shown in one exemplary embodiment in
FIG. 1B ,instruction stream 32 is for APU 12 andinstruction stream 34 is for LD 18. In this example, 3, 5, 7 ofinstructions stream 34 depend on 2, 4, 6 ofinstructions stream 32. 30 a, 30 b and 30 c are inserted intoSet limit instructions instruction stream 32. Initially,APU 12 sets thelimit 26 forLD 18 to 2.Set limit instruction 30 acauses APU 12 to set the limit ofLD unit 18 to 4 after executinginstruction 2 ofstream 32. IfLD 18stream 34reaches 2 before theAPU 12stream 32passes 2,LD 18stream 34 waits atinstruction 2 until theAPU 12 changes the limit ofLD 18 to 4. Similarly, setlimit instruction 30 b causesAPU 12 to set the limit ofLD unit 18 to 6 after executinginstruction 4 ofstream 32. IfLD 18stream 34reaches 4 before theAPU 12stream 32passes 4,LD 18stream 34 waits atinstruction 4 until theAPU 12 changes the limit ofLD 18 to 4. - In one embodiment synchronization between the program instructions of the functional processing units is provided by inserting one or more wait-go instruction pairs in the instruction streams. As shown in
FIG. 2A , communication in the direction fromAPU 12 toLD 18 is through an APU-LD counter 36. Communication in the direction fromLD 18 toAPU 12 is through an LD-APU counter 38. One or more wait-go instruction pairs 40, 42 are inserted in program instruction buffers 16 and 18, respectively and one or more wait-go instruction pairs 44, 46 are inserted in program instruction buffers 22 and 16, respectively. In this embodiment, the first program instruction stream includes at least one a ‘wait’ instruction and matching ‘go’ instruction is included in the second instruction stream. As shown in one exemplary embodiment inFIG. 2B ,instruction stream 48 is forAPU 12 andinstruction stream 50 is forLD 18. 3, 5, 7 ofInstructions stream 50 depend on 2, 4, 6 ofinstructions stream 48. Initially, counters 14, 20, 36 and 38 are set to zero. Waitinstructions 44 a, 44 b, 44 c are inserted before 3, 5, 7 ofinstructions stream 50. Go 46 a, 46 b, 46 c are inserted afterinstructions 2, 4, 6 ofinstructions stream 48. If thego instruction 46 a of wait-gopair 44 a-46 a reaches first, the APU-LD counter 36 is incremented. If thewait instruction 44 a of the wait-gopair 44 a-46 a reaches first,LD 18 checks the APU-LD counter 36 and if zero, theLD 18stream 50 waits until the APU-LD counter 36 is incremented byAPU 12. If the APU-LD counter 36 is incremented when checked, the APU-LD counter 36 is decremented andstream 50 proceeds. - In one embodiment synchronization between the program instructions of the functional processing units is provided by inserting one or more wait-go bits to the instruction streams. In this embodiment, each instruction comes with attributes, such as additional bit fields, that indicate ‘wait’ or ‘go’. Instructions in the first program instruction stream may have ‘wait’ attributes and instructions in the second program instruction stream have matching ‘go’ attributes.
3,5,7 inInstructions LD 18stream 62 depend on 2,4,6 ofinstructions APU 12 stream 60. Initially, all counters are set to zero. Go 56 a, 56 b, 56 c are injected intobits 2, 4, 6 of stream 60. Waitinstructions bits 58 a, 58 b, 58 c are injected intostream 62. Ifinstruction 2 with injected gobit 56 a reaches, APU-LD counter 36 is incremented and stream 60 proceeds toinstruction 3. Ifinstruction 3 with injectedwait bit 58 a reaches,LD 18 unit checks APU-LD counter 36. If thecounter 36 is zero,LD unit 18 waits untilAPU 12 increments counter 36. Ifcounter 36 is incremented,APU 12 decrements counter 36 andstream 62 proceeds. - This invention achieves parallel usage of multiple functional units while being more flexible than VLIW arch in that functional units are not in lockstep. The invention is also more flexible than Modern arch in that each FU is independent by having its own program counter, is more lightweight mechanism than DAE arch and Modern arch. Unlike DAE arch, register file can be shared among FUs and unlike Modern arch, complex register renaming is not needed.
- If the dependency between the first program instruction stream and the second program instruction streams cannot be determined at compiler time, a plurality of counters for each direction, that is APU-LD and LD-APU, can be used. In such case, the wait-go pair have bit vector specifying counters.
-
FIG. 4 shows anexemplary computer system 100, which is applicable to implement embodiments of the present invention. As shown inFIG. 4 ,computer system 100 can include: CPU (Central Process Unit) 101, RAM (Random Access Memory) 102, ROM (Read Only Memory) 103,System Bus 104,Hard Drive Controller 105,Keyboard Controller 106,Serial Interface Controller 107,Parallel Interface Controller 108,Display Controller 109,Hard Drive 110,Keyboard 111, SerialPeripheral Equipment 112, ParallelPeripheral Equipment 113 andDisplay 114. Among the above devices,CPU 101,RAM 102,ROM 103,Hard Drive Controller 105,Keyboard Controller 106,Serial Interface Controller 107,Parallel Interface Controller 108 andDisplay Controller 109 are coupled toSystem Bus 104.Hard Drive 110 is coupled toHard Drive Controller 105.Keyboard 111 is coupled toKeyboard Controller 106. SerialPeripheral Equipment 112 is coupled toSerial Interface Controller 107. ParallelPeripheral Equipment 113 is coupled to ParallelInterface Controller 108.Display 114 is coupled toDisplay Controller 109. It should be understood that the structure as shown inFIG. 4 is only for exemplary purposes rather than any limitation to the present invention. In some cases, some devices can be added to or removed fromcomputer system 100 based on specific situations. - As will be appreciated by one skilled in the art, aspects of the present invention can be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that can all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the present invention can take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) can be utilized. The computer readable medium can be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the above. More specific examples (a non-exhaustive list) of the computer readable storage medium can include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the context of this invention, a computer readable storage medium can be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- A computer readable signal medium can include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal can take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium can be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium can be transmitted using any appropriate medium, including, but not limited to, wireless, wire line, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention can be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code can execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems), and computer program products according to embodiments of the present invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions can also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture, including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented method such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- While the invention has been particularly shown and described with respect to illustrative and preformed embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form and details may be made therein without departing from the spirit and scope of the invention which should be limited only by the scope of the appended claims.
Claims (20)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/950,452 US9652235B1 (en) | 2015-11-24 | 2015-11-24 | Method of synchronizing independent functional unit |
| US15/237,026 US9569215B1 (en) | 2015-11-24 | 2016-08-15 | Method of synchronizing independent functional unit |
| US15/401,204 US9916163B2 (en) | 2015-11-24 | 2017-01-09 | Method of synchronizing independent functional unit |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/950,452 US9652235B1 (en) | 2015-11-24 | 2015-11-24 | Method of synchronizing independent functional unit |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/237,026 Continuation US9569215B1 (en) | 2015-11-24 | 2016-08-15 | Method of synchronizing independent functional unit |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US9652235B1 US9652235B1 (en) | 2017-05-16 |
| US20170147347A1 true US20170147347A1 (en) | 2017-05-25 |
Family
ID=57964482
Family Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US14/950,452 Expired - Fee Related US9652235B1 (en) | 2015-11-24 | 2015-11-24 | Method of synchronizing independent functional unit |
| US15/237,026 Expired - Fee Related US9569215B1 (en) | 2015-11-24 | 2016-08-15 | Method of synchronizing independent functional unit |
| US15/401,204 Expired - Fee Related US9916163B2 (en) | 2015-11-24 | 2017-01-09 | Method of synchronizing independent functional unit |
Family Applications After (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/237,026 Expired - Fee Related US9569215B1 (en) | 2015-11-24 | 2016-08-15 | Method of synchronizing independent functional unit |
| US15/401,204 Expired - Fee Related US9916163B2 (en) | 2015-11-24 | 2017-01-09 | Method of synchronizing independent functional unit |
Country Status (1)
| Country | Link |
|---|---|
| US (3) | US9652235B1 (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230409336A1 (en) * | 2022-06-17 | 2023-12-21 | Advanced Micro Devices, Inc. | VLIW Dynamic Communication |
| CN120631311B (en) * | 2025-08-12 | 2025-10-10 | 中国人民解放军国防科技大学 | Dual-decoupling parallel random number generator and random computing system |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6317820B1 (en) * | 1998-06-05 | 2001-11-13 | Texas Instruments Incorporated | Dual-mode VLIW architecture providing a software-controlled varying mix of instruction-level and task-level parallelism |
| US6212621B1 (en) | 1998-06-24 | 2001-04-03 | Advanced Micro Devices Inc | Method and system using tagged instructions to allow out-of-program-order instruction decoding |
| US6792525B2 (en) | 2000-04-19 | 2004-09-14 | Hewlett-Packard Development Company, L.P. | Input replicator for interrupts in a simultaneous and redundantly threaded processor |
| US7421693B1 (en) * | 2002-04-04 | 2008-09-02 | Applied Micro Circuits Corporation | Logic for synchronizing multiple tasks at multiple locations in an instruction stream |
| US7493615B2 (en) * | 2003-05-01 | 2009-02-17 | Sun Microsystems, Inc. | Apparatus and method for synchronizing multiple threads in an out-of-order microprocessor |
| US20060026388A1 (en) * | 2004-07-30 | 2006-02-02 | Karp Alan H | Computer executing instructions having embedded synchronization points |
| US8817029B2 (en) | 2005-10-26 | 2014-08-26 | Via Technologies, Inc. | GPU pipeline synchronization and control system and method |
| US8656143B2 (en) | 2006-03-13 | 2014-02-18 | Laurence H. Cooke | Variable clocked heterogeneous serial array processor |
| US7882307B1 (en) | 2006-04-14 | 2011-02-01 | Tilera Corporation | Managing cache memory in a parallel processing environment |
| US8633936B2 (en) | 2008-04-21 | 2014-01-21 | Qualcomm Incorporated | Programmable streaming processor with mixed precision instruction execution |
| US9529596B2 (en) * | 2011-07-01 | 2016-12-27 | Intel Corporation | Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits |
-
2015
- 2015-11-24 US US14/950,452 patent/US9652235B1/en not_active Expired - Fee Related
-
2016
- 2016-08-15 US US15/237,026 patent/US9569215B1/en not_active Expired - Fee Related
-
2017
- 2017-01-09 US US15/401,204 patent/US9916163B2/en not_active Expired - Fee Related
Also Published As
| Publication number | Publication date |
|---|---|
| US9652235B1 (en) | 2017-05-16 |
| US20170147352A1 (en) | 2017-05-25 |
| US9916163B2 (en) | 2018-03-13 |
| US9569215B1 (en) | 2017-02-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9395996B2 (en) | Pipelining out-of-order instructions | |
| US8316219B2 (en) | Synchronizing commands and dependencies in an asynchronous command queue | |
| US10129018B2 (en) | Hybrid SM3 and SHA acceleration processors | |
| US10691456B2 (en) | Vector store instruction having instruction-specified byte count to be stored supporting big and little endian processing | |
| US10970079B2 (en) | Parallel dispatching of multi-operation instructions in a multi-slice computer processor | |
| US10691453B2 (en) | Vector load with instruction-specified byte count less than a vector size for big and little endian processing | |
| US9916163B2 (en) | Method of synchronizing independent functional unit | |
| US10282207B2 (en) | Multi-slice processor issue of a dependent instruction in an issue queue based on issue of a producer instruction | |
| US10127047B2 (en) | Operation of a multi-slice processor with selective producer instruction types | |
| CN113407351A (en) | Method, apparatus, chip, device, medium and program product for performing operations | |
| US9606891B2 (en) | Tracing data from an asynchronous interface | |
| US10936321B2 (en) | Instruction chaining | |
| US10338921B2 (en) | Asynchronous instruction execution apparatus with execution modules invoking external calculation resources | |
| US10671399B2 (en) | Low-overhead, low-latency operand dependency tracking for instructions operating on register pairs in a processor core | |
| US9716646B2 (en) | Using thresholds to gate timing packet generation in a tracing system | |
| US8375155B2 (en) | Managing concurrent serialized interrupt broadcast commands in a multi-node, symmetric multiprocessing computer | |
| US9563484B2 (en) | Concurrent computing with reduced locking requirements for shared data | |
| US20110153702A1 (en) | Multiplication of a vector by a product of elementary matrices |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KIM, CHANGHOAN;REEL/FRAME:037132/0137 Effective date: 20151118 |
|
| AS | Assignment |
Owner name: U.S. DEPARTMENT OF ENERGY, DISTRICT OF COLUMBIA Free format text: CONFIRMATORY LICENSE;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:039891/0525 Effective date: 20160322 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
| FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210516 |