CN114816529B - Apparatus and method for configuring cooperative thread bundles in a vector computing system - Google Patents
Apparatus and method for configuring cooperative thread bundles in a vector computing systemInfo
- Publication number
- CN114816529B CN114816529B CN202210479765.2A CN202210479765A CN114816529B CN 114816529 B CN114816529 B CN 114816529B CN 202210479765 A CN202210479765 A CN 202210479765A CN 114816529 B CN114816529 B CN 114816529B
- Authority
- CN
- China
- Prior art keywords
- warp
- thread
- warps
- instruction
- thread bundle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Advance Control (AREA)
Abstract
The invention relates to a device and a method for configuring a cooperative thread bundle in a vector operation system, wherein the device comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or appointed local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. The invention can be more widely suitable for different applications, such as big data, artificial intelligent operation and the like by enabling software to dynamically adjust and configure the general registers to components of different thread bundles as described above.
Description
The application relates to a split application of Chinese patent application with the application date of 2020, 10-21 and the application number of 202011131448.9, and the name of 'a device and a method for configuring a cooperative thread bundle in a vector operation system'.
Technical Field
The present invention relates to vector computing devices, and more particularly, to a device and method for configuring a cooperative thread bundle in a vector computing system.
Background
A vector computer is a computer equipped with specialized vector instructions for increasing the speed of vector processing. Vector computers are capable of processing data computation of multiple thread bundles (Warp) simultaneously, and therefore, vector computers are much faster than scalar computers in terms of processing data of thread bundles. However, multiple thread bundles may conflict with access to a General-Purpose REGISTER FILE, GPR File, and therefore the present invention proposes a method of configuring a device for cooperating thread bundles in a vector computing system.
Disclosure of Invention
In view of this, how to alleviate or eliminate the above-mentioned drawbacks of the related art is a real problem to be solved.
An embodiment of the invention relates to a device for configuring a cooperative thread bundle in a vector operation system, which comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or specified local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. Each thread bundle resource register is associated with one thread bundle, and is used for enabling each thread bundle to match the content of the corresponding thread bundle resource register, mapping data access to specified local parts in the general registers, wherein the specified local parts in the general registers mapped among different thread bundles are not overlapped.
Embodiments of the present invention also relate to a method of configuring a cooperative thread bundle in a vector computing system, comprising letting each of a plurality of thread bundles contain a portion of relatively independent instructions in a program core according to a thread bundle allocation instruction in the program core, letting each of the plurality of thread bundles access data in a general purpose register or specified portions thereof through an arithmetic logic unit according to a configuration of software when executing, and completing an operation of each of the thread bundles through the arithmetic logic unit. Wherein the method further comprises having the data access of each of the thread bundles mapped to a specified local in the general purpose register in dependence upon the contents of a plurality of respective thread bundle resource registers, and wherein the specified local in the general purpose registers mapped between different ones of the thread bundles do not overlap.
One of the advantages of the above embodiments is that by enabling software to dynamically adjust and configure general purpose registers to components and operations of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, etc.
One of the advantages of the above embodiment is that by dynamically configuring multiple sections of relatively independent instructions in the program core to different thread bundles, the inter-thread bundles can be prevented from interfering with each other to improve the utilization rate of the pipeline.
Other advantages of the present invention will be explained in more detail in connection with the following description and accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application.
FIG. 1 is a block diagram of a vector operation system according to an embodiment of the present invention.
Fig. 2 is a block diagram of a streaming multiprocessor according to an embodiment of the present invention.
FIG. 3 is a split schematic diagram of general registers of some embodiments.
FIG. 4 is a diagram illustrating dynamic partitioning of a general purpose register with thread bundle resource registers according to an embodiment of the present invention.
FIG. 5 is a flow chart of a cooperative thread bundle applied to execute tasks in parallel, in accordance with an embodiment of the present invention.
FIG. 6 is a schematic diagram of a collaboration thread bundle for a producer and consumer in accordance with an embodiment of the present invention.
FIG. 7 is a flow chart of a cooperative thread bundle for performing producer-consumer tasks in accordance with an embodiment of the present invention.
Wherein the symbols in the drawings are briefly described as follows:
10 electronic device 100 stream multiprocessor 210 arithmetic logic unit 220 thread bundle instruction scheduler 230 general register 240 instruction cache 250 barrier register 260, 260# 0-260 #7 each thread bundle resource register 300# 0-300 #7 general register memory block Base # 0-Base #7 Base location S510-S540 method steps 610 consumer thread bundles 621, 663 barrier instructions 623, 661 series of instructions 650 producer thread bundles S710-S770 method steps.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings. In the drawings, like reference numerals designate identical or similar components or process flows.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, values, method steps, operation processes, components, and/or groups, but do not preclude the addition of further features, values, method steps, operation processes, components, groups, or groups of the above.
In the present invention, terms such as "first," "second," "third," and the like are used for modifying elements of the claims, and are not used for describing a priority order, a precedence order, or a temporal order in which elements of one method step are performed or are used for distinguishing between elements having the same name.
It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Conversely, when an element is described as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between components may also be interpreted in a similar fashion, such as "between" versus "directly between," or "adjacent" versus "directly adjacent," etc.
Reference is made to fig. 1. The electronic device 10 may be implemented in a mainframe, workstation, personal computer, notebook computer (Laptop PC), tablet computer, mobile phone, digital camera, digital video camera, etc. The electronic device 10 may set up a stream multiprocessor cluster (STREAMING MULTIPROCESSOR CLUSTER, SMC) in a vector computing system, including a plurality of stream multiprocessors (STREAMING MULTIPROCESSOR, SM) 100, instruction execution between different stream multiprocessors 100 may be synchronized with each other using signals. The streaming multiprocessor 100 is programmed to perform a variety of application tasks including, but not limited to, linear and nonlinear data transformation, database operations, big data operations, artificial intelligence computations, encoding, decoding, modeling operations of audio, video data, image rendering operations, etc. Each stream multiprocessor 100 may simultaneously execute multiple thread bundles (Warps), each of which is made up of a group of threads (Group of Threads), which are the smallest unit of operation using hardware and have their own lifecycle. The thread bundles may be associated with single instruction multiple data flow (Single Instruction Multiple Data, SIMD) instructions, single instruction multiple thread (Single Instruction Multiple Thread, SIMT) techniques, and the like. Execution between different thread bundles may be independent or sequential. A thread may represent a task associated with one or more instructions. For example, each stream multiprocessor 100 may concurrently execute 8 thread bundles, each thread bundle including 32 threads. Although fig. 1 depicts 4 stream multiprocessors 100, those skilled in the art may arrange more or fewer stream multiprocessors in a vector computing system according to different needs, and the invention is not so limited.
Reference is made to fig. 2. Each stream multiprocessor 100 includes an instruction cache (Instruction Cache) 240 for storing a plurality of instructions of a program core (Kernel). Each stream multiprocessor 100 further comprises a thread bundle instruction scheduler (Warp Instruction Scheduler) 220 for fetching a series of instructions for each thread bundle and storing them in the instruction cache 240, fetching instructions to be executed from the instruction cache 240 for each thread bundle according to the program counter. Each thread bundle has a separate Program Counter (PC) register for recording the location of the instruction (i.e., the instruction address) that is now executing. Each time an instruction is fetched from the instruction cache for a thread bundle, the corresponding program counter is incremented by one. The thread bundle instruction scheduler 220 sends instructions, defined in the instruction set architecture (Instruction Set Architecture, ISA) of the particular computing system, to the arithmetic logic unit (ARITHMETIC LOGICAL UNIT, ALU) 210 for execution at appropriate points in time. The arithmetic logic unit 210 may perform a wide variety of operations, such as integer, floating point addition and multiplication, comparison, boolean (Boolean) operations, bit shifting, algebraic functions (e.g., planar interpolation, trigonometric functions, exponential functions, logarithmic functions), etc. The arithmetic logic unit 210 may read data from a specified location (also referred to as a source address) of a General-purpose register (General-Purpose Registers, GPRs) 230 and write back execution results to the specified location (also referred to as a destination address) of the General-purpose register 230 during execution. Each stream multiprocessor 100 also includes a barrier register (Barriers Register) 250 that can be used to allow software to synchronize execution among different thread bundles, and a respective thread bundle Resource register (Resource-per-WARP REGISTER) 230 that can be used to dynamically configure the spatial extent of the general purpose registers 230 that each thread bundle can use when executing. Although fig. 2 only lists components 210 through 260, this is for purposes of briefly illustrating the features of the present invention, and those skilled in the art will appreciate that each stream multiprocessor 100 also includes many more components.
In some embodiments, the general purpose registers 230 may be physically or logically divided into Blocks (Blocks) with each block of memory space being allocated for access by only one thread bundle. The memory space between different blocks is not overlapping for avoiding access conflicts between different thread bundles. Referring to FIG. 3, for example, when one stream multiprocessor 100 may process data of eight thread bundles and the general purpose register 230 contains 256 kilobytes (Kilobyte, KB) of memory, the memory of the general purpose register 230 may be divided into eight blocks 300#0 to 300#7, each block containing non-overlapping 32KB of memory and provided for specified thread bundle utilization. However, since vector computing systems are often applied to large data and artificial intelligence computation, the amount of processing data is huge, and the space for fixed division may be insufficient for one thread bundle, and thus the computing requirement of a large amount of data cannot be satisfied. For application in big data and artificial intelligence computing, these embodiments may be modified to have each stream multiprocessor 100 process data for only one thread bundle, and have the entire general purpose register 230 memory space for only that thread bundle. However, when two consecutive instructions have data correlation, that is, the input data of the second instruction is the output result after the execution of the first instruction, the operation of the arithmetic logic unit 210 is not efficient. In detail, the second instruction must wait for the execution result of the first instruction to be ready in the general purpose register 230 before starting execution. For example, assuming that each instruction requires 8 clock cycles to pass in the pipeline from initialization to output results to general purpose register 230, the second instruction needs to wait for the execution results of the first instruction before it can begin execution from the 9 th clock cycle. At this point, the instruction execution delay (Instruction Execution Latency) is 8 clock cycles, resulting in very low pipeline utilization. Furthermore, because the streaming multiprocessor 100 processes only one thread bundle, instructions that would otherwise be executable in parallel need to be arranged for sequential execution, which is inefficient.
To solve the above-described problems, in one aspect, the thread bundle instruction scheduler 220 causes each of the plurality of thread bundles to access all or specified partial data in the general register 230 through the arithmetic logic unit 210 and to complete the operation of each thread bundle through the arithmetic logic unit 210, depending on the configuration of the software at the time of execution. By enabling software to dynamically adjust and configure general purpose registers to components of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, and the like.
In another aspect, embodiments of the present invention provide an environment that enables software to determine the instruction segments that each thread bundle contains. In some embodiments, a program core may divide instructions into multiple segments, each of which is independent of the other and is executed in a thread bundle. Table 1 is an example of virtual code for a program core:
TABLE 1
Assuming that each stream multiprocessor 100 runs a maximum of eight thread bundles and that each thread bundle has unique identification, when the thread bundle instruction scheduler 220 fetches instructions of the program cores shown in Table 1, the thread bundle identification can be checked for a particular thread bundle, jumps to the instruction segment associated with this thread bundle and stores it to the instruction cache 240, and then fetches instructions from the instruction cache 240 according to the corresponding program counter values and sends them to the arithmetic logic unit 210 to complete the particular computation. In this case, each thread bundle may perform tasks independently, and all thread bundles may run simultaneously, leaving the pipeline in the arithmetic logic unit 210 as busy as possible to avoid Bubbles (Bubbles). The instructions of each segment in the same program core may be referred to as relatively independent instructions. While the example of Table 1 is directed to instructions that add conditional decisions to achieve segmentation of instructions in a program core, one skilled in the art may use other different instructions that achieve the same or similar results to accomplish segmentation of instructions in a program core. In a program core, these instructions used to segment instructions for multiple thread bundles may also be referred to as thread bundle allocation instructions. In general, the thread bundle instruction scheduler 240 may cause each of the plurality of thread bundles to include a portion of the relatively independent instructions in the program core in accordance with the thread bundle allocation instructions in the program core for the arithmetic logic unit 210 to execute the plurality of thread bundles independently and in parallel.
The barrier register 250 may store information for synchronizing execution of different thread bundles, including the number of threads that need to wait for other thread bundles to finish executing, and the number of thread bundles that are currently waiting to continue executing. To coordinate execution of different thread bundles, software may set the contents of barrier register 250 to record the number of thread bundles that need to wait for execution to complete. Each instruction segment in the program core may be provided with a barrier instruction (Barrier Instruction) where appropriate depending on the system requirements. When the thread bundle instruction scheduler 220 fetches a barrier instruction for one thread bundle, the number of thread bundles waiting for continued execution recorded in the barrier register 250 is increased by 1, and the thread bundle is put into a waiting state. Next, the thread bundle instruction scheduler 220 examines the contents of the barrier register 250 to determine whether the number of thread bundles currently waiting to continue execution is equal to or greater than the number of thread bundles that need to wait for execution of other thread bundles to complete. If so, the thread bundle instruction scheduler 220 wakes up all waiting thread bundles, allowing them to continue execution. Otherwise, the thread bundle instruction scheduler 220 fetches the instructions of the next thread bundle from the instruction cache.
Furthermore, the partitioning of the embodiment depicted in FIG. 3 is a pre-configuration of the streaming multiprocessor 100, and the software is not modifiable. However, the memory space required by different thread bundles may not be uniform, and for some thread bundles the pre-partitioned memory space may be more than needed, while for other thread bundles may be insufficient.
In another aspect, while one stream multiprocessor 100 is capable of executing multiple thread bundles and all thread bundles execute the same program core, embodiments of the present invention do not pre-partition the general purpose registers 230 for different thread bundles. In particular, to accommodate different applications more broadly, the streaming multiprocessor 100 does not fix the partitioning of the general purpose registers 230 into multiple blocks of memory space for multiple thread bundles, but rather provides an environment that enables software to dynamically adjust and configure the general purpose registers 230 to different thread bundles so that the software can have each thread bundle use all or part of the general purpose registers 230 depending on the application needs.
In other embodiments, each stream multiprocessor 100 may include a respective thread bundle resource register 260 for storing information for the base location of each thread bundle, each base location pointing to a particular address of the general purpose register 230. In order to allow different thread bundles to access non-overlapping memory spaces in the general purpose registers 230, software may dynamically change the contents of the thread bundle resource registers 260 to set the base location of each thread bundle. For example, referring to FIG. 4, software may divide the general purpose register 230 into eight blocks for eight thread bundles, with block 0 associated with the 0 th thread bundle having an address range that includes Base #0 through Base #1-1, block 1 associated with the 1 st thread bundle having an address range that includes Base #1 through Base #2-1, and so on. The software may set the contents of the respective thread bundle resource registers 260#0 to 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230. After the instruction scheduler 220 extracts the instruction from the instruction cache 240 for the ith thread bundle, the source address and destination address of the instruction may be adjusted according to the contents of the thread bundle resource register 260#i for mapping to the memory space dynamically allocated to the ith thread bundle in the general purpose register 230. For example, the original instruction is:
Dest_addr=Instr_i(Src_addr0,Src_addr1)
where Instr_i represents the OpCode (op code) of the instruction assigned to the ith thread bundle, src_addr0 represents the 0 th source address, src_addr0 represents the 1 st source address, and dest_addr represents the destination address.
The thread bundle instruction scheduler 220 modifies the instructions as described above to become:
Base#i+Dest_addr=Instr_i(Base#i+Src_addr0,Base#i+Src_addr1)
Where Base #i represents the Base location recorded in the thread bundle resource register 260 #i. That is, the thread bundle instruction scheduler 220 adjusts the source address and destination address of the instruction according to the contents of each thread bundle resource register 260 so that the specified portions of the general purpose registers mapped between different thread bundles do not overlap.
In still other embodiments, not only may one program core divide instructions into multiple sections independent of each other, and in addition to executing with one thread bundle per instruction section, the software may set the contents of the respective thread bundle resource registers 260#0 through 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230.
The invention is applicable to a collaborative thread bundle (Cooperative Warps) that performs tasks in parallel, with reference to the flowchart example shown in FIG. 5.
In step S510, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240.
Steps S520-S540 form a loop, and the thread bundle instruction scheduler 220 may use a scheduling method (e.g., a polling scheduling algorithm) to obtain the specified instructions from the instruction cache 240 one by one according to the corresponding program counter of the thread bundle, and send them to the arithmetic logic unit 210 for execution. Thread bundle instruction scheduler 220 may sequentially fetch instructions from instruction cache 240 indicated by the program counter of thread bundle 0, instructions indicated by the program counter of thread bundle 1, instructions indicated by the program counter of thread bundle 2, and so on.
In step S520, the thread bundle instruction scheduler 220 retrieves instructions for the 0 th or next thread bundle from the instruction cache 240.
Step S530, the thread bundle instruction scheduler 220 sends the fetched instruction to the arithmetic logic unit 210.
In step S540, the arithmetic logic unit 210 performs a specified operation according to the input instruction. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
In order to avoid conflicts in accessing the general purpose registers 230 by different thread bundles, in some embodiments of step S510, the thread bundle instruction scheduler 220 may, in accordance with thread bundle allocation instructions in the program core (e.g., the example shown in table 1), let each thread bundle be responsible for processing multiple instructions (also referred to as relatively independent instructions) of specified segments in the same program core, which are arranged to be independent of each other and may be executed in parallel.
In other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the corresponding thread bundle resource register 260 for mapping to the memory space dynamically allocated to the thread bundle in the general purpose register 230 before the instruction is provided to the arithmetic logic unit 210 (prior to step S530).
In still other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the respective thread bundle resource registers 260 before the instruction is sent to the arithmetic logic unit 210 (before step S530), in addition to having each thread bundle be responsible for processing multiple instructions of a specified segment in the same program core according to the associated instruction in the program core (as illustrated in the example of table 1) at step S510.
The present invention is applicable to a bundle of collaborative threads that perform producer-consumer tasks. Referring to FIG. 6, assume that thread bundle 610 is considered the consumer of data and thread bundle 650 is considered the producer of data. In other words, execution of a portion of instructions in thread bundle 610 requires referencing the results of execution of a portion of instructions in thread bundle 650. In some embodiments, software may configure the contents of the respective thread bundle resource registers corresponding to thread bundles 610 and 650 when executed to enable instructions of thread bundles 610 and 650 to be accessed by arithmetic logic unit 210 to overlapping blocks in general purpose registers 230.
For producer-consumer task execution, refer in detail to the flowchart example shown in fig. 7.
In step S710, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240. The thread bundle instruction scheduler 220 may take each thread bundle, depending on the relevant instructions in the program core (examples shown in table 1), to handle multiple instructions for a given segment of the same program core, which form a producer-consumer relationship under arrangement.
In step S720, the thread bundle instruction scheduler 220 obtains the barrier instruction (Barrier Instruction) 621 of the consumer thread bundle 610 from the instruction cache 240, and accordingly brings the consumer thread bundle 610 into a wait state.
In step S730, the thread bundle instruction scheduler 220 fetches a series of instructions 661 of the producer thread bundle 650 from the instruction cache 240 and sequentially sends the fetched instructions to the arithmetic logic unit 210.
In step S740, the arithmetic logic unit 210 performs a specified operation according to the input instruction 661. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
In step S750, thread bundle instruction scheduler 220 obtains barrier instruction 663 of producer thread bundle 650 from instruction cache 240 and wakes up consumer thread bundle 610 accordingly. In some embodiments, the thread bundle instruction scheduler 220 may also put the producer thread bundle 650 into a wait state.
Step S760 the thread bundle instruction scheduler 220 fetches a series of instructions 623 of the consumer thread bundle 610 from the instruction cache 240 and sequentially passes the fetched instructions to the arithmetic logic unit 210.
In step S770, the arithmetic logic unit 210 performs a specified operation according to the input instruction 623. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230 (including data previously generated by producer thread bundle 650), performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
It should be appreciated here that the content of steps S730, S740, S760, and S770 is merely a brief description for ease of understanding, and that during execution of steps S730 and S740, S760, and S770, thread bundle instruction scheduler 220 may also obtain instructions of other thread bundles (i.e., any thread bundle that is neither thread bundle 610 nor thread bundle 610) from instruction cache 240 and drive arithmetic logic unit 210 to perform operations.
Although the components described above are included in fig. 1 and 2, it is not excluded that many more additional components are used to achieve the better technical effect without violating the spirit of the invention. In addition, although the flowcharts of fig. 5 and 7 are executed in the order specified, the order among these steps may be modified by those skilled in the art without departing from the spirit of the invention, and therefore, the present invention is not limited to using only the order described above. Furthermore, one skilled in the art may integrate several steps into one step or perform more steps in addition to these steps, sequentially or in parallel, and the invention should not be limited thereby.
The above description is only of the preferred embodiments of the present application, but not limited thereto, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present application, and the scope of the present application is defined by the appended claims.
Claims (11)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210479765.2A CN114816529B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundles in a vector computing system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210479765.2A CN114816529B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundles in a vector computing system |
| CN202011131448.9A CN112214243B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundle in vector operation system |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011131448.9A Division CN112214243B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundle in vector operation system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114816529A CN114816529A (en) | 2022-07-29 |
| CN114816529B true CN114816529B (en) | 2025-07-18 |
Family
ID=74056291
Family Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210479765.2A Active CN114816529B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundles in a vector computing system |
| CN202011131448.9A Active CN112214243B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundle in vector operation system |
| CN202210480192.5A Active CN114968358B (en) | 2020-10-21 | 2020-10-21 | Device and method for configuring cooperative thread warps in vector computing system |
Family Applications After (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011131448.9A Active CN112214243B (en) | 2020-10-21 | 2020-10-21 | Apparatus and method for configuring cooperative thread bundle in vector operation system |
| CN202210480192.5A Active CN114968358B (en) | 2020-10-21 | 2020-10-21 | Device and method for configuring cooperative thread warps in vector computing system |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20220121444A1 (en) |
| CN (3) | CN114816529B (en) |
| TW (1) | TWI793568B (en) |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114035847B (en) * | 2021-11-08 | 2023-08-29 | 海飞科(南京)信息技术有限公司 | Method and apparatus for parallel execution of kernel programs |
| CN114896079B (en) * | 2022-05-26 | 2023-11-24 | 上海壁仞智能科技有限公司 | Instruction execution method, processor and electronic device |
| CN116483536B (en) * | 2023-04-24 | 2024-05-10 | 上海芷锐电子科技有限公司 | Data scheduling method, computing chip and electronic equipment |
| CN116360708B (en) * | 2023-05-26 | 2023-08-11 | 摩尔线程智能科技(北京)有限责任公司 | Data writing method and device, electronic equipment and storage medium |
| CN118732958B (en) * | 2024-09-02 | 2025-02-28 | 山东浪潮科学研究院有限公司 | Warp-aware memory controller |
| CN119440774B (en) * | 2025-01-08 | 2025-05-13 | 山东浪潮科学研究院有限公司 | A conflicting thread warp scheduling method and a GPGPU register file access method and system |
| CN121166401A (en) * | 2025-11-20 | 2025-12-19 | 上海壁仞科技股份有限公司 | Inter-thread data sharing methods, electronic devices, storage media, and application products |
| CN121144256A (en) * | 2025-11-20 | 2025-12-16 | 上海壁仞科技股份有限公司 | Artificial intelligence chips and collaborative thread beam calculation methods |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101344842A (en) * | 2007-07-10 | 2009-01-14 | 北京简约纳电子有限公司 | Multithreading processor and multithreading processing method |
| CN106575219A (en) * | 2014-09-26 | 2017-04-19 | 英特尔公司 | Instruction and logic for a vector format for processing computations |
Family Cites Families (39)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6925643B2 (en) * | 2002-10-11 | 2005-08-02 | Sandbridge Technologies, Inc. | Method and apparatus for thread-based memory access in a multithreaded processor |
| US7472258B2 (en) * | 2003-04-21 | 2008-12-30 | International Business Machines Corporation | Dynamically shared group completion table between multiple threads |
| US7290261B2 (en) * | 2003-04-24 | 2007-10-30 | International Business Machines Corporation | Method and logical apparatus for rename register reallocation in a simultaneous multi-threaded (SMT) processor |
| CN1278227C (en) * | 2004-06-25 | 2006-10-04 | 中国科学院计算技术研究所 | A processor multithreading method and apparatus based on MIPS instruction set |
| TWI296387B (en) * | 2005-12-13 | 2008-05-01 | Nat Univ Tsing Hua | Scheduling method for remote object procedure call and system thereof |
| US8321849B2 (en) * | 2007-01-26 | 2012-11-27 | Nvidia Corporation | Virtual architecture and instruction set for parallel thread computing |
| US7685409B2 (en) * | 2007-02-21 | 2010-03-23 | Qualcomm Incorporated | On-demand multi-thread multimedia processor |
| US9639479B2 (en) * | 2009-09-23 | 2017-05-02 | Nvidia Corporation | Instructions for managing a parallel cache hierarchy |
| US10360039B2 (en) * | 2009-09-28 | 2019-07-23 | Nvidia Corporation | Predicted instruction execution in parallel processors with reduced per-thread state information including choosing a minimum or maximum of two operands based on a predicate value |
| US9710275B2 (en) * | 2012-11-05 | 2017-07-18 | Nvidia Corporation | System and method for allocating memory of differing properties to shared data objects |
| US20140258680A1 (en) * | 2013-03-05 | 2014-09-11 | Qualcomm Incorporated | Parallel dispatch of coprocessor instructions in a multi-thread processor |
| CN103955356B (en) * | 2014-04-24 | 2017-05-10 | 深圳中微电科技有限公司 | General-purpose register bank distribution method and device in multithreaded processor |
| US9804666B2 (en) * | 2015-05-26 | 2017-10-31 | Samsung Electronics Co., Ltd. | Warp clustering |
| CN106325996B (en) * | 2015-06-19 | 2019-11-19 | 华为技术有限公司 | A method and system for allocating GPU resources |
| GB2539958B (en) * | 2015-07-03 | 2019-09-25 | Advanced Risc Mach Ltd | Data processing systems |
| KR102545176B1 (en) * | 2015-11-16 | 2023-06-19 | 삼성전자주식회사 | Method and apparatus for register management |
| CN106648545A (en) * | 2016-01-18 | 2017-05-10 | 天津大学 | Register file structure used for branch processing in GPU |
| US10115175B2 (en) * | 2016-02-19 | 2018-10-30 | Qualcomm Incorporated | Uniform predicates in shaders for graphics processing units |
| US10592466B2 (en) * | 2016-05-12 | 2020-03-17 | Wisconsin Alumni Research Foundation | Graphic processor unit providing reduced storage costs for similar operands |
| US20170371662A1 (en) * | 2016-06-23 | 2017-12-28 | Intel Corporation | Extension of register files for local processing of data in computing environments |
| US10929944B2 (en) * | 2016-11-23 | 2021-02-23 | Advanced Micro Devices, Inc. | Low power and low latency GPU coprocessor for persistent computing |
| US10558460B2 (en) * | 2016-12-14 | 2020-02-11 | Qualcomm Incorporated | General purpose register allocation in streaming processor |
| GB2558220B (en) * | 2016-12-22 | 2019-05-15 | Advanced Risc Mach Ltd | Vector generating instruction |
| US20180203694A1 (en) * | 2017-01-16 | 2018-07-19 | Intel Corporation | Execution Unit with Selective Instruction Pipeline Bypass |
| GB201717303D0 (en) * | 2017-10-20 | 2017-12-06 | Graphcore Ltd | Scheduling tasks in a multi-threaded processor |
| US10866806B2 (en) * | 2017-11-14 | 2020-12-15 | Nvidia Corporation | Uniform register file for improved resource utilization |
| US11163578B2 (en) * | 2018-02-23 | 2021-11-02 | Intel Corporation | Systems and methods for reducing register bank conflicts based on a software hint bit causing a hardware thread switch |
| CN108595258B (en) * | 2018-05-02 | 2021-07-27 | 北京航空航天大学 | A dynamic extension method of GPGPU register file |
| CN108733492A (en) * | 2018-05-20 | 2018-11-02 | 北京工业大学 | A kind of batch scheduling memory method divided based on Bank |
| US11138009B2 (en) * | 2018-08-10 | 2021-10-05 | Nvidia Corporation | Robust, efficient multiprocessor-coprocessor interface |
| US10698689B2 (en) * | 2018-09-01 | 2020-06-30 | Intel Corporation | Recompiling GPU code based on spill/fill instructions and number of stall cycles |
| GB2580327B (en) * | 2018-12-31 | 2021-04-28 | Graphcore Ltd | Register files in a multi-threaded processor |
| GB2584268B (en) * | 2018-12-31 | 2021-06-30 | Graphcore Ltd | Load-Store Instruction |
| CN111562976B (en) * | 2019-02-13 | 2023-04-18 | 同济大学 | GPU (graphics processing unit) acceleration method and system for radar imaging of electrically large target |
| WO2020177229A1 (en) * | 2019-03-01 | 2020-09-10 | Huawei Technologies Co., Ltd. | Inter-warp sharing of general purpose register data in gpu |
| WO2020186630A1 (en) * | 2019-03-21 | 2020-09-24 | Huawei Technologies Co., Ltd. | Serializing divergent accesses using peeling |
| CN110716755B (en) * | 2019-10-14 | 2023-05-02 | 浙江诺诺网络科技有限公司 | Thread exit method, device, equipment and readable storage medium |
| CN111124492B (en) * | 2019-12-16 | 2022-09-20 | 成都海光微电子技术有限公司 | Instruction generation method and device, instruction execution method, processor and electronic equipment |
| US11934867B2 (en) * | 2020-07-23 | 2024-03-19 | Nvidia Corp. | Techniques for divergent thread group execution scheduling |
-
2020
- 2020-10-21 CN CN202210479765.2A patent/CN114816529B/en active Active
- 2020-10-21 CN CN202011131448.9A patent/CN112214243B/en active Active
- 2020-10-21 CN CN202210480192.5A patent/CN114968358B/en active Active
-
2021
- 2021-04-14 TW TW110113318A patent/TWI793568B/en active
- 2021-07-02 US US17/366,588 patent/US20220121444A1/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101344842A (en) * | 2007-07-10 | 2009-01-14 | 北京简约纳电子有限公司 | Multithreading processor and multithreading processing method |
| CN106575219A (en) * | 2014-09-26 | 2017-04-19 | 英特尔公司 | Instruction and logic for a vector format for processing computations |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112214243B (en) | 2022-05-27 |
| TWI793568B (en) | 2023-02-21 |
| CN114968358B (en) | 2025-04-25 |
| CN114816529A (en) | 2022-07-29 |
| TW202217601A (en) | 2022-05-01 |
| US20220121444A1 (en) | 2022-04-21 |
| CN112214243A (en) | 2021-01-12 |
| CN114968358A (en) | 2022-08-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114816529B (en) | Apparatus and method for configuring cooperative thread bundles in a vector computing system | |
| CN111310910B (en) | Computing device and method | |
| JP6895484B2 (en) | Multithreaded processor register file | |
| JP6944974B2 (en) | Load / store instructions | |
| Kapasi et al. | The Imagine stream processor | |
| CN112381220B (en) | Neural network tensor processor | |
| CN108376097B (en) | Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines | |
| US20170371660A1 (en) | Load-store queue for multiple processor cores | |
| US11720332B2 (en) | Compiling a program from a graph | |
| TWI794789B (en) | Apparatus and method for vector computing | |
| US20100145992A1 (en) | Address Generation Unit Using Nested Loops To Scan Multi-Dimensional Data Structures | |
| CN111651203A (en) | A device and method for performing vector arithmetic | |
| US20230305844A1 (en) | Implementing specialized instructions for accelerating dynamic programming algorithms | |
| CN117808048A (en) | Operator execution method, device, equipment and storage medium | |
| Yang et al. | A case for a flexible scalar unit in SIMT architecture | |
| WO2017185404A1 (en) | Apparatus and method for performing vector logical operation | |
| US20100146241A1 (en) | Modified-SIMD Data Processing Architecture | |
| US9003165B2 (en) | Address generation unit using end point patterns to scan multi-dimensional data structures | |
| Forsell et al. | An extended PRAM-NUMA model of computation for TCF programming | |
| CN120469721B (en) | Vector core module of artificial intelligence chip and its operation method | |
| US6785743B1 (en) | Template data transfer coprocessor | |
| Stepchenkov et al. | Recurrent data-flow architecture: features and realization problems | |
| US11822541B2 (en) | Techniques for storing sub-alignment data when accelerating Smith-Waterman sequence alignments | |
| Hussain et al. | Mvpa: An fpga based multi-vector processor architecture | |
| US10996960B1 (en) | Iterating single instruction, multiple-data (SIMD) instructions |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information |
Country or region after: China Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant after: Shanghai Bi Ren Technology Co.,Ltd. Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd. Country or region before: China |
|
| GR01 | Patent grant | ||
| GR01 | Patent grant |