CN114816529B

CN114816529B - Apparatus and method for configuring cooperative thread bundles in a vector computing system

Info

Publication number: CN114816529B
Application number: CN202210479765.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2025-07-18
Anticipated expiration: 2040-10-21
Also published as: CN112214243B; TWI793568B; CN114968358B; CN114816529A; TW202217601A; US20220121444A1; CN112214243A; CN114968358A

Abstract

The invention relates to a device and a method for configuring a cooperative thread bundle in a vector operation system, wherein the device comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or appointed local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. The invention can be more widely suitable for different applications, such as big data, artificial intelligent operation and the like by enabling software to dynamically adjust and configure the general registers to components of different thread bundles as described above.

Description

Apparatus and method for configuring cooperative thread bundles in a vector computing system

The application relates to a split application of Chinese patent application with the application date of 2020, 10-21 and the application number of 202011131448.9, and the name of 'a device and a method for configuring a cooperative thread bundle in a vector operation system'.

Technical Field

The present invention relates to vector computing devices, and more particularly, to a device and method for configuring a cooperative thread bundle in a vector computing system.

Background

A vector computer is a computer equipped with specialized vector instructions for increasing the speed of vector processing. Vector computers are capable of processing data computation of multiple thread bundles (Warp) simultaneously, and therefore, vector computers are much faster than scalar computers in terms of processing data of thread bundles. However, multiple thread bundles may conflict with access to a General-Purpose REGISTER FILE, GPR File, and therefore the present invention proposes a method of configuring a device for cooperating thread bundles in a vector computing system.

Disclosure of Invention

In view of this, how to alleviate or eliminate the above-mentioned drawbacks of the related art is a real problem to be solved.

An embodiment of the invention relates to a device for configuring a cooperative thread bundle in a vector operation system, which comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or specified local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. Each thread bundle resource register is associated with one thread bundle, and is used for enabling each thread bundle to match the content of the corresponding thread bundle resource register, mapping data access to specified local parts in the general registers, wherein the specified local parts in the general registers mapped among different thread bundles are not overlapped.

Embodiments of the present invention also relate to a method of configuring a cooperative thread bundle in a vector computing system, comprising letting each of a plurality of thread bundles contain a portion of relatively independent instructions in a program core according to a thread bundle allocation instruction in the program core, letting each of the plurality of thread bundles access data in a general purpose register or specified portions thereof through an arithmetic logic unit according to a configuration of software when executing, and completing an operation of each of the thread bundles through the arithmetic logic unit. Wherein the method further comprises having the data access of each of the thread bundles mapped to a specified local in the general purpose register in dependence upon the contents of a plurality of respective thread bundle resource registers, and wherein the specified local in the general purpose registers mapped between different ones of the thread bundles do not overlap.

One of the advantages of the above embodiments is that by enabling software to dynamically adjust and configure general purpose registers to components and operations of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, etc.

One of the advantages of the above embodiment is that by dynamically configuring multiple sections of relatively independent instructions in the program core to different thread bundles, the inter-thread bundles can be prevented from interfering with each other to improve the utilization rate of the pipeline.

Other advantages of the present invention will be explained in more detail in connection with the following description and accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application.

FIG. 1 is a block diagram of a vector operation system according to an embodiment of the present invention.

Fig. 2 is a block diagram of a streaming multiprocessor according to an embodiment of the present invention.

FIG. 3 is a split schematic diagram of general registers of some embodiments.

FIG. 4 is a diagram illustrating dynamic partitioning of a general purpose register with thread bundle resource registers according to an embodiment of the present invention.

FIG. 5 is a flow chart of a cooperative thread bundle applied to execute tasks in parallel, in accordance with an embodiment of the present invention.

FIG. 6 is a schematic diagram of a collaboration thread bundle for a producer and consumer in accordance with an embodiment of the present invention.

FIG. 7 is a flow chart of a cooperative thread bundle for performing producer-consumer tasks in accordance with an embodiment of the present invention.

Wherein the symbols in the drawings are briefly described as follows:

10 electronic device 100 stream multiprocessor 210 arithmetic logic unit 220 thread bundle instruction scheduler 230 general register 240 instruction cache 250 barrier register 260, 260# 0-260 #7 each thread bundle resource register 300# 0-300 #7 general register memory block Base # 0-Base #7 Base location S510-S540 method steps 610 consumer thread bundles 621, 663 barrier instructions 623, 661 series of instructions 650 producer thread bundles S710-S770 method steps.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings. In the drawings, like reference numerals designate identical or similar components or process flows.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, values, method steps, operation processes, components, and/or groups, but do not preclude the addition of further features, values, method steps, operation processes, components, groups, or groups of the above.

In the present invention, terms such as "first," "second," "third," and the like are used for modifying elements of the claims, and are not used for describing a priority order, a precedence order, or a temporal order in which elements of one method step are performed or are used for distinguishing between elements having the same name.

It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Conversely, when an element is described as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between components may also be interpreted in a similar fashion, such as "between" versus "directly between," or "adjacent" versus "directly adjacent," etc.

Reference is made to fig. 1. The electronic device 10 may be implemented in a mainframe, workstation, personal computer, notebook computer (Laptop PC), tablet computer, mobile phone, digital camera, digital video camera, etc. The electronic device 10 may set up a stream multiprocessor cluster (STREAMING MULTIPROCESSOR CLUSTER, SMC) in a vector computing system, including a plurality of stream multiprocessors (STREAMING MULTIPROCESSOR, SM) 100, instruction execution between different stream multiprocessors 100 may be synchronized with each other using signals. The streaming multiprocessor 100 is programmed to perform a variety of application tasks including, but not limited to, linear and nonlinear data transformation, database operations, big data operations, artificial intelligence computations, encoding, decoding, modeling operations of audio, video data, image rendering operations, etc. Each stream multiprocessor 100 may simultaneously execute multiple thread bundles (Warps), each of which is made up of a group of threads (Group of Threads), which are the smallest unit of operation using hardware and have their own lifecycle. The thread bundles may be associated with single instruction multiple data flow (Single Instruction Multiple Data, SIMD) instructions, single instruction multiple thread (Single Instruction Multiple Thread, SIMT) techniques, and the like. Execution between different thread bundles may be independent or sequential. A thread may represent a task associated with one or more instructions. For example, each stream multiprocessor 100 may concurrently execute 8 thread bundles, each thread bundle including 32 threads. Although fig. 1 depicts 4 stream multiprocessors 100, those skilled in the art may arrange more or fewer stream multiprocessors in a vector computing system according to different needs, and the invention is not so limited.

Reference is made to fig. 2. Each stream multiprocessor 100 includes an instruction cache (Instruction Cache) 240 for storing a plurality of instructions of a program core (Kernel). Each stream multiprocessor 100 further comprises a thread bundle instruction scheduler (Warp Instruction Scheduler) 220 for fetching a series of instructions for each thread bundle and storing them in the instruction cache 240, fetching instructions to be executed from the instruction cache 240 for each thread bundle according to the program counter. Each thread bundle has a separate Program Counter (PC) register for recording the location of the instruction (i.e., the instruction address) that is now executing. Each time an instruction is fetched from the instruction cache for a thread bundle, the corresponding program counter is incremented by one. The thread bundle instruction scheduler 220 sends instructions, defined in the instruction set architecture (Instruction Set Architecture, ISA) of the particular computing system, to the arithmetic logic unit (ARITHMETIC LOGICAL UNIT, ALU) 210 for execution at appropriate points in time. The arithmetic logic unit 210 may perform a wide variety of operations, such as integer, floating point addition and multiplication, comparison, boolean (Boolean) operations, bit shifting, algebraic functions (e.g., planar interpolation, trigonometric functions, exponential functions, logarithmic functions), etc. The arithmetic logic unit 210 may read data from a specified location (also referred to as a source address) of a General-purpose register (General-Purpose Registers, GPRs) 230 and write back execution results to the specified location (also referred to as a destination address) of the General-purpose register 230 during execution. Each stream multiprocessor 100 also includes a barrier register (Barriers Register) 250 that can be used to allow software to synchronize execution among different thread bundles, and a respective thread bundle Resource register (Resource-per-WARP REGISTER) 230 that can be used to dynamically configure the spatial extent of the general purpose registers 230 that each thread bundle can use when executing. Although fig. 2 only lists components 210 through 260, this is for purposes of briefly illustrating the features of the present invention, and those skilled in the art will appreciate that each stream multiprocessor 100 also includes many more components.

In some embodiments, the general purpose registers 230 may be physically or logically divided into Blocks (Blocks) with each block of memory space being allocated for access by only one thread bundle. The memory space between different blocks is not overlapping for avoiding access conflicts between different thread bundles. Referring to FIG. 3, for example, when one stream multiprocessor 100 may process data of eight thread bundles and the general purpose register 230 contains 256 kilobytes (Kilobyte, KB) of memory, the memory of the general purpose register 230 may be divided into eight blocks 300#0 to 300#7, each block containing non-overlapping 32KB of memory and provided for specified thread bundle utilization. However, since vector computing systems are often applied to large data and artificial intelligence computation, the amount of processing data is huge, and the space for fixed division may be insufficient for one thread bundle, and thus the computing requirement of a large amount of data cannot be satisfied. For application in big data and artificial intelligence computing, these embodiments may be modified to have each stream multiprocessor 100 process data for only one thread bundle, and have the entire general purpose register 230 memory space for only that thread bundle. However, when two consecutive instructions have data correlation, that is, the input data of the second instruction is the output result after the execution of the first instruction, the operation of the arithmetic logic unit 210 is not efficient. In detail, the second instruction must wait for the execution result of the first instruction to be ready in the general purpose register 230 before starting execution. For example, assuming that each instruction requires 8 clock cycles to pass in the pipeline from initialization to output results to general purpose register 230, the second instruction needs to wait for the execution results of the first instruction before it can begin execution from the 9 th clock cycle. At this point, the instruction execution delay (Instruction Execution Latency) is 8 clock cycles, resulting in very low pipeline utilization. Furthermore, because the streaming multiprocessor 100 processes only one thread bundle, instructions that would otherwise be executable in parallel need to be arranged for sequential execution, which is inefficient.

To solve the above-described problems, in one aspect, the thread bundle instruction scheduler 220 causes each of the plurality of thread bundles to access all or specified partial data in the general register 230 through the arithmetic logic unit 210 and to complete the operation of each thread bundle through the arithmetic logic unit 210, depending on the configuration of the software at the time of execution. By enabling software to dynamically adjust and configure general purpose registers to components of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, and the like.

In another aspect, embodiments of the present invention provide an environment that enables software to determine the instruction segments that each thread bundle contains. In some embodiments, a program core may divide instructions into multiple segments, each of which is independent of the other and is executed in a thread bundle. Table 1 is an example of virtual code for a program core:

TABLE 1

Assuming that each stream multiprocessor 100 runs a maximum of eight thread bundles and that each thread bundle has unique identification, when the thread bundle instruction scheduler 220 fetches instructions of the program cores shown in Table 1, the thread bundle identification can be checked for a particular thread bundle, jumps to the instruction segment associated with this thread bundle and stores it to the instruction cache 240, and then fetches instructions from the instruction cache 240 according to the corresponding program counter values and sends them to the arithmetic logic unit 210 to complete the particular computation. In this case, each thread bundle may perform tasks independently, and all thread bundles may run simultaneously, leaving the pipeline in the arithmetic logic unit 210 as busy as possible to avoid Bubbles (Bubbles). The instructions of each segment in the same program core may be referred to as relatively independent instructions. While the example of Table 1 is directed to instructions that add conditional decisions to achieve segmentation of instructions in a program core, one skilled in the art may use other different instructions that achieve the same or similar results to accomplish segmentation of instructions in a program core. In a program core, these instructions used to segment instructions for multiple thread bundles may also be referred to as thread bundle allocation instructions. In general, the thread bundle instruction scheduler 240 may cause each of the plurality of thread bundles to include a portion of the relatively independent instructions in the program core in accordance with the thread bundle allocation instructions in the program core for the arithmetic logic unit 210 to execute the plurality of thread bundles independently and in parallel.

The barrier register 250 may store information for synchronizing execution of different thread bundles, including the number of threads that need to wait for other thread bundles to finish executing, and the number of thread bundles that are currently waiting to continue executing. To coordinate execution of different thread bundles, software may set the contents of barrier register 250 to record the number of thread bundles that need to wait for execution to complete. Each instruction segment in the program core may be provided with a barrier instruction (Barrier Instruction) where appropriate depending on the system requirements. When the thread bundle instruction scheduler 220 fetches a barrier instruction for one thread bundle, the number of thread bundles waiting for continued execution recorded in the barrier register 250 is increased by 1, and the thread bundle is put into a waiting state. Next, the thread bundle instruction scheduler 220 examines the contents of the barrier register 250 to determine whether the number of thread bundles currently waiting to continue execution is equal to or greater than the number of thread bundles that need to wait for execution of other thread bundles to complete. If so, the thread bundle instruction scheduler 220 wakes up all waiting thread bundles, allowing them to continue execution. Otherwise, the thread bundle instruction scheduler 220 fetches the instructions of the next thread bundle from the instruction cache.

Furthermore, the partitioning of the embodiment depicted in FIG. 3 is a pre-configuration of the streaming multiprocessor 100, and the software is not modifiable. However, the memory space required by different thread bundles may not be uniform, and for some thread bundles the pre-partitioned memory space may be more than needed, while for other thread bundles may be insufficient.

In another aspect, while one stream multiprocessor 100 is capable of executing multiple thread bundles and all thread bundles execute the same program core, embodiments of the present invention do not pre-partition the general purpose registers 230 for different thread bundles. In particular, to accommodate different applications more broadly, the streaming multiprocessor 100 does not fix the partitioning of the general purpose registers 230 into multiple blocks of memory space for multiple thread bundles, but rather provides an environment that enables software to dynamically adjust and configure the general purpose registers 230 to different thread bundles so that the software can have each thread bundle use all or part of the general purpose registers 230 depending on the application needs.

In other embodiments, each stream multiprocessor 100 may include a respective thread bundle resource register 260 for storing information for the base location of each thread bundle, each base location pointing to a particular address of the general purpose register 230. In order to allow different thread bundles to access non-overlapping memory spaces in the general purpose registers 230, software may dynamically change the contents of the thread bundle resource registers 260 to set the base location of each thread bundle. For example, referring to FIG. 4, software may divide the general purpose register 230 into eight blocks for eight thread bundles, with block 0 associated with the 0 th thread bundle having an address range that includes Base #0 through Base #1-1, block 1 associated with the 1 st thread bundle having an address range that includes Base #1 through Base #2-1, and so on. The software may set the contents of the respective thread bundle resource registers 260#0 to 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230. After the instruction scheduler 220 extracts the instruction from the instruction cache 240 for the ith thread bundle, the source address and destination address of the instruction may be adjusted according to the contents of the thread bundle resource register 260#i for mapping to the memory space dynamically allocated to the ith thread bundle in the general purpose register 230. For example, the original instruction is:

Dest_addr=Instr_i(Src_addr0,Src_addr1)

where Instr_i represents the OpCode (op code) of the instruction assigned to the ith thread bundle, src_addr0 represents the 0 th source address, src_addr0 represents the 1 st source address, and dest_addr represents the destination address.

The thread bundle instruction scheduler 220 modifies the instructions as described above to become:

Base#i+Dest_addr=Instr_i(Base#i+Src_addr0,Base#i+Src_addr1)

Where Base #i represents the Base location recorded in the thread bundle resource register 260 #i. That is, the thread bundle instruction scheduler 220 adjusts the source address and destination address of the instruction according to the contents of each thread bundle resource register 260 so that the specified portions of the general purpose registers mapped between different thread bundles do not overlap.

In still other embodiments, not only may one program core divide instructions into multiple sections independent of each other, and in addition to executing with one thread bundle per instruction section, the software may set the contents of the respective thread bundle resource registers 260#0 through 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230.

The invention is applicable to a collaborative thread bundle (Cooperative Warps) that performs tasks in parallel, with reference to the flowchart example shown in FIG. 5.

In step S510, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240.

Steps S520-S540 form a loop, and the thread bundle instruction scheduler 220 may use a scheduling method (e.g., a polling scheduling algorithm) to obtain the specified instructions from the instruction cache 240 one by one according to the corresponding program counter of the thread bundle, and send them to the arithmetic logic unit 210 for execution. Thread bundle instruction scheduler 220 may sequentially fetch instructions from instruction cache 240 indicated by the program counter of thread bundle 0, instructions indicated by the program counter of thread bundle 1, instructions indicated by the program counter of thread bundle 2, and so on.

In step S520, the thread bundle instruction scheduler 220 retrieves instructions for the 0 th or next thread bundle from the instruction cache 240.

Step S530, the thread bundle instruction scheduler 220 sends the fetched instruction to the arithmetic logic unit 210.

In step S540, the arithmetic logic unit 210 performs a specified operation according to the input instruction. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.

In order to avoid conflicts in accessing the general purpose registers 230 by different thread bundles, in some embodiments of step S510, the thread bundle instruction scheduler 220 may, in accordance with thread bundle allocation instructions in the program core (e.g., the example shown in table 1), let each thread bundle be responsible for processing multiple instructions (also referred to as relatively independent instructions) of specified segments in the same program core, which are arranged to be independent of each other and may be executed in parallel.

In other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the corresponding thread bundle resource register 260 for mapping to the memory space dynamically allocated to the thread bundle in the general purpose register 230 before the instruction is provided to the arithmetic logic unit 210 (prior to step S530).

In still other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the respective thread bundle resource registers 260 before the instruction is sent to the arithmetic logic unit 210 (before step S530), in addition to having each thread bundle be responsible for processing multiple instructions of a specified segment in the same program core according to the associated instruction in the program core (as illustrated in the example of table 1) at step S510.

The present invention is applicable to a bundle of collaborative threads that perform producer-consumer tasks. Referring to FIG. 6, assume that thread bundle 610 is considered the consumer of data and thread bundle 650 is considered the producer of data. In other words, execution of a portion of instructions in thread bundle 610 requires referencing the results of execution of a portion of instructions in thread bundle 650. In some embodiments, software may configure the contents of the respective thread bundle resource registers corresponding to thread bundles 610 and 650 when executed to enable instructions of thread bundles 610 and 650 to be accessed by arithmetic logic unit 210 to overlapping blocks in general purpose registers 230.

For producer-consumer task execution, refer in detail to the flowchart example shown in fig. 7.

In step S710, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240. The thread bundle instruction scheduler 220 may take each thread bundle, depending on the relevant instructions in the program core (examples shown in table 1), to handle multiple instructions for a given segment of the same program core, which form a producer-consumer relationship under arrangement.

In step S720, the thread bundle instruction scheduler 220 obtains the barrier instruction (Barrier Instruction) 621 of the consumer thread bundle 610 from the instruction cache 240, and accordingly brings the consumer thread bundle 610 into a wait state.

In step S730, the thread bundle instruction scheduler 220 fetches a series of instructions 661 of the producer thread bundle 650 from the instruction cache 240 and sequentially sends the fetched instructions to the arithmetic logic unit 210.

In step S740, the arithmetic logic unit 210 performs a specified operation according to the input instruction 661. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.

In step S750, thread bundle instruction scheduler 220 obtains barrier instruction 663 of producer thread bundle 650 from instruction cache 240 and wakes up consumer thread bundle 610 accordingly. In some embodiments, the thread bundle instruction scheduler 220 may also put the producer thread bundle 650 into a wait state.

Step S760 the thread bundle instruction scheduler 220 fetches a series of instructions 623 of the consumer thread bundle 610 from the instruction cache 240 and sequentially passes the fetched instructions to the arithmetic logic unit 210.

In step S770, the arithmetic logic unit 210 performs a specified operation according to the input instruction 623. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230 (including data previously generated by producer thread bundle 650), performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.

It should be appreciated here that the content of steps S730, S740, S760, and S770 is merely a brief description for ease of understanding, and that during execution of steps S730 and S740, S760, and S770, thread bundle instruction scheduler 220 may also obtain instructions of other thread bundles (i.e., any thread bundle that is neither thread bundle 610 nor thread bundle 610) from instruction cache 240 and drive arithmetic logic unit 210 to perform operations.

Although the components described above are included in fig. 1 and 2, it is not excluded that many more additional components are used to achieve the better technical effect without violating the spirit of the invention. In addition, although the flowcharts of fig. 5 and 7 are executed in the order specified, the order among these steps may be modified by those skilled in the art without departing from the spirit of the invention, and therefore, the present invention is not limited to using only the order described above. Furthermore, one skilled in the art may integrate several steps into one step or perform more steps in addition to these steps, sequentially or in parallel, and the invention should not be limited thereby.

The above description is only of the preferred embodiments of the present application, but not limited thereto, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present application, and the scope of the present application is defined by the appended claims.

Claims

1. A device for configuring cooperative thread warps in a vector computing system, comprising:

General registers;

an arithmetic logic unit, coupled to the general register;

a thread warp instruction scheduler coupled to the arithmetic logic unit, wherein the thread warp instruction scheduler allows each of the plurality of thread warps to include a portion of relatively independent instructions in the program core according to thread warp allocation instructions in the program core, allows each of the plurality of thread warps to access designated local data in the general register through the arithmetic logic unit according to the configuration of the software during execution, and independently and in parallel executes operations of each of the thread warps through the arithmetic logic unit; and

A plurality of warp resource registers, wherein each of the warp resource registers is associated with one warp, and is used for each warp to map data access to a specified portion of the general register in combination with the content of the corresponding warp resource register, wherein the content of each warp resource register is used to point to a base address in the general register associated with each warp, and the specified portions of the general register mapped between different warps do not overlap.

2. The device for configuring cooperative thread warps in a vector computing system according to claim 1, wherein the device does not pre-configure a designated part of the general register associated with each thread warp.

3. The device for configuring cooperative thread warps in a vector computing system as described in claim 1 is characterized in that the thread warps include a first thread warp and a second thread warp, when the thread warp instruction scheduler obtains a barrier instruction of the first thread warp from an instruction cache, the first thread warp is put into a waiting state, and when the thread warp instruction scheduler obtains a barrier instruction of the second thread warp from the instruction cache, the first thread warp is awakened, wherein the first thread warp and the second thread warp are configured to be associated with an overlapping block in the general register.

4 . The device for configuring cooperative warps in a vector computing system according to claim 3 , wherein the first warp is a consumer warp and the second warp is a producer warp.

5. The apparatus for configuring cooperative warps in a vector computing system according to claim 1, wherein the warps are independent of each other, and each warp is configured to be associated with a non-overlapping block of the general register.

6 . The apparatus for configuring cooperative warps in a vector computing system according to claim 1 , wherein the warp instruction scheduler maintains an independent program counter for each of the warps.

7. A method for configuring cooperative thread warps in a vector computing system, executed in a streaming multiprocessor, comprising:

Allocating instructions according to the thread warps in the program core so that each of the plurality of thread warps contains a portion of relatively independent instructions in the program core;

Allowing each of the plurality of warps to access designated local data in a general register through an arithmetic logic unit according to a configuration of the software during execution; and

The arithmetic logic unit performs the operations of each of the above thread warps independently and in parallel,

The method further comprises:

Data access of each warp is mapped to a designated portion of the general register according to contents of a plurality of warp resource registers, wherein the contents of each warp resource register are used to point to a base address in the general register associated with each warp, and designated portions of the general register mapped between different warps do not overlap.

8 . The method for configuring cooperative warps in a vector computing system according to claim 7 , wherein the streaming multiprocessor does not pre-configure a designated portion of the general register associated with each warp.

9. The method for configuring cooperative warps in a vector computing system according to claim 7, wherein the warps include a first warp and a second warp, the first warp and the second warp are configured to be associated with an overlapping block of the general register, and the method comprises:

When obtaining the barrier instruction of the first thread warp from the instruction cache, placing the first thread warp into a waiting state; and

When the barrier instruction of the first warp is obtained from the instruction cache, the second warp is awakened.

10 . The method for configuring cooperative warps in a vector computing system according to claim 9 , wherein the first warp is a consumer warp, and the second warp is a producer warp.

11. The method for configuring cooperative warps in a vector computing system according to claim 7, characterized in that it comprises:

An independent program counter is maintained for each of the warps.