Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In modern parallel computing architectures, such as GPGPU, single instruction multithreading (Single Instruction Multiple Threads, SIMT) execution mode is often employed to increase computing efficiency. In this mode, multiple threads (threads) are organized into a thread bundle (typically comprising 32 threads), and threads within the same thread bundle may execute the same instruction in parallel. The thread bundles are organized into a thread block (thread block) that is scheduled to be executed on the streaming multiprocessor.
To better organize and manage complex collaborative computing tasks, multiple thread bundles within the same thread block may be further divided into one or more thread bundle groups (warp groups). Each thread bundle group typically contains multiple thread bundles and is assigned to perform a particular sub-task in a larger computing task. For example, in complex operations such as matrix multiply-accumulate, the thread bundles within a thread block may be divided into thread bundle groups that are responsible for core matrix computation and thread bundle groups that are responsible for processing computation results (e.g., performing data type conversion, writing back to memory, etc.), each of which may be composed of four consecutive thread bundles (warp). This partitioning allows the computing units of different functions to efficiently cooperate in a pipelined fashion.
To accomplish these collaborative computing tasks, efficient data exchange and synchronization is often required between different threads, particularly threads belonging to different thread bundles, and even to different thread bundles groups. At present, the realization of inter-thread data sharing mainly depends on the following modes:
The first approach is to use Shared Memory (Shared Memory). Shared memory is a block of read-write memory area on a chip that is visible to all threads (including all thread bundles and thread bundle groups) within the same thread block. Threads may exchange data by reading and writing addresses in shared memory. However, although the access speed of the shared memory is far faster than that of the global memory, the shared memory is still a memory access, the access delay is higher compared with that of the direct access register, and the correct sequence of data reading and writing needs to be ensured through an explicit synchronous instruction, which causes a certain performance overhead, especially in the scene that the data needs to be exchanged frequently.
The second way is to use instructions within the thread bundle, such as a shuffle instruction. This allows data to be exchanged directly between threads within the same thread bundle through registers without the need for sharing memory. The speed is very fast and the delay is low due to the direct operation of the registers.
However, register-level data exchange based on instructions such as shuffle is limited to between threads within a single thread bundle. When the threads which need to exchange data belong to different thread bundles, in particular to different thread bundle groups, this approach is no longer applicable. In this scenario, for example, when the thread bundle group for the processing matrix calculation needs to pass the calculation result to the thread bundle group for the processing result, it can only be completed through the shared memory, which limits the overall performance of the parallel calculation.
In addition, even within the thread bundle, there are deficiencies in the shuffle type of instruction. They tend to have higher instruction overhead and longer execution latency and have a limit (e.g., less than 32) on the number of threads that can participate in the data exchange, which affects programming flexibility and program execution efficiency to some extent.
In this regard, the present invention provides a method for sharing data between threads, which firstly constructs a collaboration thread group crossing the boundary of a traditional thread group by a new thread grouping mode, then provides a shared data channel which is based on registers and isolated from each other for each collaboration thread group, and finally defines a lightweight synchronization mechanism at group level to ensure the correctness of data reading and writing, thereby enabling threads which originally need to communicate through a slower shared memory and belong to different thread groups or even different thread groups to directly share data through a high-speed register, remarkably reducing the delay of data exchange between threads crossing thread groups or even thread groups, improving the execution efficiency of parallel computation, and overcoming the defects.
It should be noted that, in the inter-thread data sharing method provided by the present invention, the execution body may be a computing device with parallel processing capability. The computing device may be hardware with parallel computing capabilities, such as a GPGPU, a graphics processor (Graphics Processing Unit, GPU), a graphics processing unit integrated in a central processing unit (Central Processing Unit, CPU), an Application-specific integrated Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), and the like.
Taking the GPGPU architecture as an example, the execution subject may specifically be one or more streaming multiprocessors or computing units in the GPGPU. When a user-written parallel computing program (commonly referred to as a Kernel) is scheduled to execute on the computing device, instructions within the program are executed by the compute cores in these streaming multiprocessors or compute units. Program logic is broken down into a large number of threads that are organized in blocks or bundles of threads and distributed onto streaming multiprocessors or computing units. The method of the invention is mainly executed in the streaming multiprocessors or computing units aiming at a plurality of threads in a thread block, and realizes high-efficiency inter-thread data sharing by carrying out new grouping on the threads and providing new data exchange and synchronization mechanisms.
FIG. 1 is a flow chart of a method for sharing data between threads according to the present invention, as shown in FIG. 1, the method includes:
Step S10, dividing a plurality of threads in a thread block into at least one cooperative thread group based on a thread index of each thread in the thread block, wherein the thread block is divided into a plurality of thread bundle groups, and the threads in each cooperative thread group are from at least two different thread bundle groups.
In particular, the purpose of this step is to logically group multiple threads within a thread block into a new, cross-thread bundle group to form a collaborative thread group. Specifically, it is first necessary to determine the unique identity, i.e., thread index, of each thread within a thread block. The index is a logical, linear number. For example, in a three-dimensionally organized thread block, the thread index of each thread can be calculated specifically by the following formula:
thread.index_rank=threadIdx.zblockDim.yblockDim.x+threadIdx.yblockDim.x+threadIdx.x
Wherein, thread_rank represents the thread index of a thread in the belonging thread block, blockdim.x and blockdim.y represent the thread number of the thread block in the x and y dimensions respectively, and thread_idx.x, thread_idx.y and thread_idx.z represent the index of the thread in the x, y and z dimensions of the belonging thread block respectively, that is, the three-dimensional coordinates of the thread in the belonging thread block.
After the unique thread index of each thread is obtained, the threads can be distributed into different cooperative thread groups according to a preset partitioning rule. A key feature of this partitioning rule is that it can partition threads belonging to different thread bundles into the same cooperative thread group. This breaks through the limitations of the conventional art where collaboration is typically limited to the interior of a single thread bundle or group of thread bundles.
Illustratively, it is assumed that a thread bundle group consists of 4 consecutive thread bundles, each thread bundle comprising 32 threads, threads numbered 0-127 within a thread block comprising thread bundle group one, threads numbered 128-255 comprising thread bundle group two, and so on. The grouping mode of the embodiment of the invention can divide the thread with the index of i and the thread with the index of 128+i (wherein, 0 is less than or equal to i < 128) into the same cooperative thread group. It is apparent that thread i belongs to thread bundle one and thread 128+i belongs to thread bundle two, from different thread bundles, but logically belonging to the same synergistic element. The division of the cross-thread beam groups lays a foundation for realizing more flexible direct interaction of data among threads.
Step S20, when receiving a write operation of a first thread in any cooperative thread group, writing target data corresponding to the write operation into a shared register resource associated with the any cooperative thread group.
Specifically, after dividing the threads into cooperative thread groups, this step aims to provide an efficient data sharing channel for the threads within the group using shared register resources.
In a specific implementation, for each cooperative thread group, when a first thread in the cooperative thread group (i.e., a thread performing a write operation, which may be any thread in the cooperative thread group) needs to share target data calculated or obtained from other storage levels to other threads in the cooperative thread group where it is located, it performs a write operation. The write operation here is not written to conventional global memory or shared memory, but rather to a shared register resource uniquely associated with the cooperating thread group.
The shared register resource is here a special memory resource. At the hardware level, it may be part of a register file physically present in a streaming multiprocessor or compute unit of the GPGPU. It should be appreciated that in some GPGPU architectures, there may be some underutilized register resources that embodiments of the present invention may readjust for inter-thread group sharing memory space. Moreover, this block of resources is shared and accessible to all threads within the same collaborative thread group, while threads for different collaborative thread groups are isolated. In other words, the threads in the cooperative thread group A cannot access the data written by the threads in the cooperative thread group B, so that the isolation of the data is ensured. The association between the shared register resource and the cooperative thread group ensures that the range of data sharing is precisely limited inside the cooperative thread group, and data collision and interference are avoided.
Step S30, when receiving a read operation of the second thread in the any cooperative thread group for the target data, executing a synchronization operation for the any cooperative thread group, and after the execution of the synchronization operation is completed, reading the target data from the shared register resource.
It should be noted that, in the parallel execution mode, execution progress of different threads cannot be guaranteed due to uncertainty of hardware scheduling. A thread performing a read operation (referred to herein as a second thread) may be started before the write operation execution of the first thread is completed. If the second thread attempts to read before the first thread completes the data write, invalid or stale data will result.
To solve this problem, this step introduces a synchronization operation. When a second thread (which may be any thread in the cooperative thread group other than the first thread) needs to read target data written by other members in the group (such as the first thread), a synchronization operation for the cooperative thread group in which the second thread is located needs to be triggered first. Here, synchronization is primarily used to suspend execution of a thread requesting synchronization (e.g., a second thread) until all threads within the collaborative thread group have reached this synchronization point. This synchronization operation ensures that all write operations initiated by any thread in the group (in particular write operations directed to shared register resources) are performed before the synchronization point and that the result is visible to all threads in the group.
After the synchronization operation is completed, the consistency of the data is ensured. At this point, the second thread may safely perform a read operation, reading the target data written by the first thread from the same shared register resource associated with the cooperating thread group. Because the first thread and the second thread belong to the same cooperative thread group, the first thread and the second thread access the same physical register space, and therefore data transmission can be completed correctly.
It will be appreciated that the above described synchronization operation is for a smaller granularity range of cooperative thread groups, rather than the entire thread block. Compared with the traditional thread block level synchronization, the group level synchronization has smaller influence range, and the number of threads participating in the synchronization is smaller, so that lower synchronization overhead and delay can be realized.
The method provided by the embodiment of the invention constructs the cooperative thread group crossing the boundary of the traditional thread group in a new thread grouping mode, is not limited by the traditional thread group or the division of the thread group, and flexibly organizes threads from at least two different thread groups into the same cooperative thread group according to the thread index. Because the thread bundle group is composed of a plurality of thread bundles, the threads in the cooperative thread group cross different thread bundles, which necessarily means that the threads are also cross thread bundles, when the threads crossing the thread bundles or the thread bundles need to exchange data, the invention provides a high-speed channel based on a register, namely, data needing to be exchanged can be written into shared register resources associated with the cooperative thread group instead of the traditional shared memory, thereby avoiding the problems of high delay and high power consumption caused by using the shared memory, and improving the efficiency of data exchange. In addition, the invention adopts a finer synchronization mechanism, the synchronization operation is executed for the cooperative thread group instead of globally synchronizing the whole thread block, and the lightweight synchronization mode reduces unnecessary thread waiting, further reduces synchronization expenditure, and ensures that the whole data sharing process is more efficient.
Based on the above embodiment, step S10 specifically includes:
Step S11, performing modular operation based on thread indexes of each thread in the thread block and the preset thread quantity, so as to obtain a cooperative thread group identifier of each thread, wherein the preset thread quantity is the thread quantity included in each thread bundle group;
step S12 divides the threads having the same cooperative thread group identification into the same cooperative thread group.
Specifically, when grouping multiple threads within a thread block, a unique, linear thread index is first determined for each thread within the thread block. At the same time, the system defines a predetermined number of threads, which may be specifically defined as the number of threads included in each thread bundle. For example, in a specific hardware architecture, if a thread bundle includes 32 threads and a thread bundle group is composed of 4 consecutive thread bundles, the preset number of threads is 32×4=128.
After the thread index and the preset number of threads are obtained, the system performs modulo operation on each thread. Specifically, for a thread with a thread index thread_rank, the collaborative thread group identifier to which it belongs can be calculated by the following formula:
co-operative_threads_group_ID=thread.index_rank%N
Wherein co-operative _threads_group_id represents a cooperative thread group identifier, which is an integer for uniquely marking a cooperative thread group, and N represents a preset thread number.
After the modulo operation, all the threads which calculate the same cooperative thread group identification are logically classified into the same cooperative thread group. This means that they will share the same set of associated shared register resources and participate in the same group level synchronization operation.
Fig. 2 is a schematic diagram of inter-thread data sharing of a cross-thread bundle group according to the present invention, and as shown in fig. 2, it illustrates a scenario in which threads in a certain thread block are scheduled to be executed on different execution units in a processing unit, where the processing unit may be a streaming multiprocessor or a computing unit in a GPGPU architecture, and the execution unit may be a computing core in particular. Assuming that the thread block includes 256 threads with thread indexes from 0 to 255, the threads are divided into 8 thread bundles, namely warp0 to warp7, each thread bundle including 32 threads. Four consecutive thread bundles form one thread bundle group, namely, warp 0-warp 3 form one thread bundle group, and warp 4-warp 7 form another thread bundle group. At this time, the preset number of threads (i.e., the size of one thread bundle) is 128.
Data exchange and sharing can be performed between different threads in the same thread bundle through registers, and data exchange needs to be performed between threads belonging to different thread bundles or different thread bundles through shared memories. In order to improve the inter-thread data exchange efficiency of the cross-thread bundle group, the embodiment of the invention provides a new grouping mode, namely, modulo operation is carried out according to the thread index of each thread and the preset thread number, so that the threads in different thread bundles are divided into the same cooperative thread group, and the inter-thread data exchange of the cross-thread bundle group can be realized rapidly through sharing register resources.
For example, the thread index of the thread in warp0 is 0 to 31, the thread index of the thread in warp4 is 128 to 159, the cooperative thread group is identified as 0% 128=0 for the thread with line Cheng Suoyin of warp0 and 128% 128=0 for the thread with line Cheng Suoyin of warp4, so the thread with the thread index of 0 in warp0 and the thread with line Cheng Suoyin of warp4 are divided into the same cooperative thread group, and the two threads can exchange data through the shared registers associated with the cooperative thread group. Similarly, the thread with line Cheng Suoyin of 1 in warp0 and the thread with line Cheng Suoyin of 129 in warp4 are also divided into the same cooperative thread group and data is exchanged through the corresponding register resources. By analogy, each corresponding thread in warp0 and warp4 can exchange data through corresponding register resources, each corresponding thread in warp1 and warp5 can exchange data through corresponding register resources, each corresponding thread in warp2 and warp6 can also exchange data through corresponding register resources, and each corresponding thread in warp3 and warp7 can also exchange data through corresponding register resources.
Since warp0 and warp4 belong to different thread bundles and different thread bundle groups, and warp1 and warp5 also belong to different thread bundles and different thread bundle groups, the grouping mode based on modulo arithmetic provided by the embodiment of the invention can simply and effectively realize the division of threads from different thread bundle groups into the same cooperative thread group, the method has simple calculation and extremely low cost, and can establish the logical cooperative relationship of the cross-thread bundle groups in a deterministic and regular manner, so that hardware or a scheduler can easily identify members of each cooperative thread group, and a clear and consistent basis is provided for subsequent data sharing and synchronous operation.
Based on any of the above embodiments, the first thread and the second thread belong to different thread bundles.
Specifically, according to the grouping scheme in the above embodiment, the members of a cooperative thread group are themselves from different thread bundles, and naturally also from different thread bundles. Thus, the first thread (write thread) and the second thread (read thread) belong to different thread bundles.
Illustratively, as shown in FIG. 2, assume that the first thread is the thread with index 0 in warp0 and the second thread is the thread with index 128 in warp4, both belonging to threads within different thread bundles. When a first thread performs a computational task and needs to pass a result data (i.e., target data) to a second thread, the first thread performs a write operation to its cooperating thread group, writing the target data to the shared register resource associated with the cooperating thread group. Since the first thread and the second thread are located in different thread bundles, their execution progress may not be synchronized at all. The second thread first performs a synchronization operation for the cooperative thread group when the target data is required to be used. This synchronization ensures that the write operation of the first thread has completed and that the data is visible to members within the group. After synchronization is completed, the second thread can safely and correctly read the target data from the same share of the shared register resources associated with the cooperative thread group.
By the above flow, the inter-thread bundle communication of a longer path which is originally needed to be written into the shared memory, thread block synchronization and read out of the shared memory is adjusted to the short path communication for writing into the shared register, collaboration group synchronization and reading out of the shared register. Because the access speed of the register is far faster than that of the shared memory, and the synchronization overhead of the cooperative thread group level is usually far smaller than that of the thread block level, the efficiency of data sharing among threads of the cross-thread bundle can be improved.
Based on any of the above embodiments, before the first thread performs the write operation, the method further comprises:
And responding to the variable declaration of the preset memory type, distributing corresponding register resources for each cooperative thread group as shared register resources among threads in each cooperative thread group, and sharing data by at least two threads in each cooperative thread group by accessing the same storage position in the shared register resources.
It should be noted that the source and allocation of the shared register resources associated with each cooperating thread group are specifically described in the embodiments of the present invention, and the steps described in the embodiments of the present invention generally occur when a thread block is scheduled to a streaming multiprocessor or computing unit, and before any actual data read or write operations occur.
Specifically, in actual parallel program development, a developer needs a way to inform a compiler and hardware that a certain variable needs to adopt the sharing mechanism proposed by the present invention. To achieve this, embodiments of the present invention introduce a predefined memory type or type qualifier, which may be a new key, such as "__ thread_shared_register __".
Illustratively, a developer may use the type qualifier described above to declare a variable when writing its parallel computing Kernel (Kernel) code. When the compiler compiles the code, it recognizes that the variable is not a normal thread private variable or a conventional thread block level shared memory variable once it parses the preset memory type declaration. The compiler generates specific instructions to trigger subsequent resource allocation procedures.
Upon receipt of such specific instructions, the processing unit (e.g., a streaming multiprocessor or computing unit) of the hardware or the runtime system may perform the resource allocation operation. Specifically, the system allocates a corresponding register resource for each cooperating thread group. In the above embodiment, it has been described how all threads within a thread block are divided into M (M is an integer of 1 or more) cooperative thread groups by modulo arithmetic. Thus, in this step, the system reserves M physical registers (or M block register space if the variable size exceeds one register) from the physical register file, and binds the M physical registers to the M cooperative thread groups one by one, respectively. This portion of the allocated register resources constitutes shared register resources within each cooperating thread group.
It will be appreciated that by allocating respective associated register resources for each cooperating thread group, the physical basis of inter-thread data sharing is ensured. All threads within each cooperating thread group may share data by accessing the same memory location in the shared register resource associated with that cooperating thread group. In the embodiment of the invention, by providing an explicit and convenient programming interface (namely, preset memory type), a user can quickly utilize an efficient data sharing mechanism provided by hardware.
Based on any of the above embodiments, the variables declared by the preset memory types are used as an array, and the threads in each of the cooperative thread groups access the shared register resource of the cooperative thread group by means of an array index.
Specifically, in the embodiment of the present invention, the variables declared by the preset memory types may be a plurality of groups. This design enhances the flexibility and practicality of the application of the present invention because in many complex parallel computing algorithms, such as parallel reduction, histogram computation, etc., a set of structured data needs to be exchanged between threads, rather than a single data. Designing the shared variable in an array form can meet this requirement.
Illustratively, a developer may declare an array tsr of size n, such as "__ thread shared register __ int tsr n", via a preset memory type qualifier, and when the declaration is compiled and executed, the system allocates a block of contiguous or logically contiguous register resources capable of holding n integers for each cooperating thread group.
Accordingly, threads within a collaborative thread group may access shared register resources within the collaborative thread group by way of an array index. For example, a thread within a collaborative thread group may read from and write to a particular location in the set of shared registers through tsr [ j ] (where 0≤j < n).
The method provided by the embodiment of the invention improves the data processing capacity and application breadth by declaring the variables into the form of an array and accessing the variables by adopting the array index, is not limited to the exchange of single data any more, and can support the efficient and register-level collaborative operation of the structured data set among the members in the thread group. This allows many complex parallel computing modes (e.g., reduction, etc.) to be implemented at the very low latency register level, enhancing the execution performance and resource utilization efficiency of the algorithm. In addition, the user can directly access the shared register resource through the array index without concern for the register index mapping of the bottom layer complexity, thereby simplifying the programming complexity.
Based on any of the above embodiments, different threads within the same collaborative thread group access the same physical memory location using the same array index;
Threads within different cooperative thread groups access different physical memory locations that are isolated from each other using the same array index.
Specifically, for different threads in the same cooperative thread group, the accessible shared register resources are the same, so that the same physical memory unit is accessed by using the same array index, thereby realizing access to read and write the same data. Specifically, when two or more threads within the same collaborative thread group all attempt to access the same index location of the shared array, the hardware's address translation mechanism will direct all of these access requests to a unique, identical physical memory location. This ensures that when one thread writes data to that location, other threads in the group can read this new data from the exact same location, thus achieving data sharing.
For threads in different cooperative thread groups, because the shared register resources which can be accessed by the threads are different, even though the threads are indexed by the same array, the threads can not access the same array for reading and writing, because the threads access different physical storage units which are isolated from each other, thereby avoiding data competition in parallel computing. For example, assuming thread A belongs to collaboration thread group one and thread B belongs to collaboration thread group two, when thread A accesses tsr [0] while thread B also accesses tsr [0], hardware will map it to two disparate physical memory locations, although they use exactly the same variable names and indices at the code level. The access of thread a would be directed to the 0 th shared register allocated for cooperative thread group one, while the access of thread B would be directed to the 0 th shared register allocated for cooperative thread group two, the two physical registers being independent of each other.
Thus, the write operation of thread A to tsr [0] is completely invisible to thread B, and vice versa. The design ensures that the data operation among the cooperative thread groups is not interfered with each other, and each group has own private shared space, so that the correctness and stability of the program are ensured.
Based on any of the above embodiments, the synchronization operation is triggered by invoking a preset synchronization interface function for the collaborative thread group.
In particular, in parallel computing, it is critical to ensure the correct order of data dependencies. The synchronization operation of embodiments of the present invention is to ensure that the write operation is completed and visible to all members of the group before the shared data is read.
In particular, the synchronization operation may be triggered by invoking a preset synchronization interface function for the collaborative thread group. This means that embodiments of the present invention provide a well-defined application programming interface (Application Programming Interface, API) for developers to insert synchronization points in code. This API may be a built-in function or a library function, which may be named "__ sync_concurrent_threads ()" or similar.
When a thread (e.g., a second thread) within the collaborative thread group executes to the line of code that invokes the synchronous interface function, the thread's execution flow may be suspended. At the same time, the synchronous control unit of the processor starts to check the execution state of all other member threads in the cooperative thread group to which the thread belongs. Only when all threads within the cooperating thread group have reached this synchronization point (i.e. the synchronization interface function has been called) and the results of all memory write operations (in particular write operations to shared register resources) preceding the synchronization point have been globally visible, will the synchronization control unit release all threads of the cooperating thread group allowing them to continue executing instructions following the synchronization point.
It will be appreciated that the synchronization interface function is directed to collaborative thread group granularity. Conventional thread block level synchronization functions may suspend execution of all threads within an entire thread block until all threads within the block reach a synchronization point. The synchronous operation of the embodiment of the invention only affects the cooperative thread group where the thread calling the synchronous operation is located, and the threads of other cooperative thread groups can continue to execute without any influence. Synchronization at the collaborative thread group level typically has lower latency and overhead due to a smaller synchronization scope and a smaller number of threads participating in the synchronization.
The embodiment of the invention provides a simple, visual and efficient means for a user to control the execution flow in the cooperative thread group and ensure the consistency of data by providing a preset synchronous interface function. Compared with the traditional coarse-granularity block level synchronization, the fine-granularity synchronization mechanism can remarkably reduce unnecessary thread waiting time and improve the utilization rate of a processor and the overall throughput of parallel programs.
The inter-thread data sharing device provided by the invention is described below, and the inter-thread data sharing device described below and the inter-thread data sharing method described above can be referred to correspondingly.
Based on any of the above embodiments, fig. 3 is a schematic structural diagram of an inter-thread data sharing device according to the present invention, and as shown in fig. 3, the device includes:
A thread grouping unit 310, configured to divide a plurality of threads in a thread block into at least one cooperative thread group based on a thread index of each thread in the thread block, where the thread block is divided into a plurality of thread bundle groups, and threads in each of the cooperative thread groups are from at least two different thread bundle groups;
A data writing unit 320, configured to, when receiving a write operation of a first thread in any one of the cooperative thread groups, write target data corresponding to the write operation into a shared register resource associated with the any one of the cooperative thread groups;
And the data reading unit 330 is configured to perform a synchronization operation for the any one of the cooperative thread groups when receiving a read operation for the target data by the second thread in the any one of the cooperative thread groups, and read the target data from the shared register resource after the synchronization operation is performed.
The device provided by the embodiment of the invention constructs the cooperative thread group crossing the boundary of the traditional thread group in a new thread grouping mode, is not limited by the traditional thread group or the division of the thread group, and dynamically organizes threads from at least two different thread groups into the same cooperative thread group according to the thread index. Because the thread bundle group is composed of a plurality of thread bundles, the threads in the cooperative thread group cross different thread bundles, which necessarily means that the threads are also cross thread bundles, when the threads crossing the thread bundles or the thread bundles need to exchange data, the invention provides a high-speed channel based on a register, namely, data needing to be exchanged can be written into shared register resources associated with the cooperative thread group instead of the traditional shared memory, thereby avoiding the problems of high delay and high power consumption caused by using the shared memory, and improving the efficiency of data exchange. In addition, the invention adopts a finer synchronization mechanism, the synchronization operation is executed for the cooperative thread group instead of globally synchronizing the whole thread block, and the lightweight synchronization mode reduces unnecessary thread waiting, further reduces synchronization expenditure, and ensures that the whole data sharing process is more efficient.
Based on any of the above embodiments, the thread grouping unit 310 is specifically configured to:
performing modular operation based on thread indexes of each thread in the thread block and the preset thread quantity, so as to obtain a cooperative thread group identifier of each thread, wherein the preset thread quantity is the thread quantity included in each thread group;
And dividing the threads with the same cooperative thread group identification into the same cooperative thread group.
Based on any of the above embodiments, the first thread and the second thread belong to different thread bundles.
Based on any of the foregoing embodiments, the apparatus further includes a resource allocation unit, where the resource allocation unit is configured to:
And responding to the variable declaration of the preset memory type, distributing corresponding register resources for each cooperative thread group as shared register resources among threads in each cooperative thread group, and sharing data by at least two threads in each cooperative thread group by accessing the same storage position in the shared register resources.
Based on any of the above embodiments, the variables declared by the preset memory types are used as an array, and the threads in each of the cooperative thread groups access the shared register resource of the cooperative thread group by means of an array index.
Based on any of the above embodiments, different threads within the same collaborative thread group access the same physical memory location using the same array index;
Threads within different cooperative thread groups access different physical memory locations that are isolated from each other using the same array index.
Based on any of the above embodiments, the synchronization operation is triggered by invoking a preset synchronization interface function for the collaborative thread group.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include a processor 410, a communication interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communication interface 420, and the memory 430 communicate with each other via the communication bus 440. The processor 410 may call logic instructions in the memory 430 to perform an inter-thread data sharing method, which includes dividing a plurality of threads in a thread block into at least one cooperative thread group based on a thread index of each thread in the thread block, wherein the thread block is divided into a plurality of thread bundles, each thread in the cooperative thread group is from at least two different thread bundles, writing target data corresponding to a write operation of a first thread in any cooperative thread group into a shared register resource associated with the any cooperative thread group when a write operation of the first thread is received, executing a synchronous operation for the any cooperative thread group when a read operation of a second thread in the any cooperative thread group for the target data is received, and reading the target data from the shared register resource after the execution of the synchronous operation is completed.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program when executed by a processor being capable of executing the inter-thread data sharing method provided by the methods described above, the method comprising dividing a plurality of threads within a thread block into at least one cooperative thread group based on a thread index of each thread within the thread block, wherein the thread block is divided into a plurality of thread bundle groups, the threads in each of the cooperative thread groups being from at least two different thread bundle groups, writing target data corresponding to a write operation to a first thread in any one of the cooperative thread groups into shared register resources associated with the any one of the cooperative thread groups when a read operation of a second thread in the any one of the cooperative thread groups for the target data is received, performing a synchronization operation for the any one of the cooperative thread groups, and reading the target data from the shared register resources after completion of the synchronization operation execution.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the method of inter-thread data sharing provided by the methods described above, the method comprising dividing a plurality of threads within a thread block into at least one cooperative thread group based on a thread index of each thread within the thread block, wherein the thread block is divided into a plurality of thread bundles, the threads in each of the cooperative thread groups being from at least two different thread bundles, writing target data corresponding to a write operation into a shared register resource associated with a first thread in any one of the cooperative thread groups upon receipt of the write operation, performing a synchronization operation for the any one of the cooperative thread groups upon receipt of a read operation of a second thread in the any one of the cooperative thread groups for the target data, and reading the target data from the shared register resource upon completion of the synchronization operation execution.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.