CN119003137A

CN119003137A - Method, computing device, medium, and program product for generating thread random emissions within a thread bundle

Info

Publication number: CN119003137A
Application number: CN202411479522.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2024-10-22
Filing date: 2024-10-22
Publication date: 2024-11-22
Anticipated expiration: 2044-10-22
Also published as: CN119003137B

Abstract

The present invention relates to a method, computing device, medium and program product for generating random emissions of threads within a thread bundle. The method comprises the following steps: configuring a thread local register within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute, the thread bundle sharing one program counter; in response to determining that the program basic block execution is complete, selecting an active thread as the representative thread; and taking a program counter representing the subsequent desired execution of the thread as a program counter for the subsequent desired execution of the thread bundle. The invention can avoid thread deadlock through program solution, and can complete automatic transformation at the compiler.

Description

Method, computing device, medium, and program product for generating thread random emissions within a thread bundle

Technical Field

Embodiments of the present invention relate generally to the field of artificial intelligence and, more particularly, relate to a method, computing device, computer readable storage medium, and computer program product for generating random transmissions of threads within a thread bundle.

Background

Single instruction multithreading (Single Instruction Multi Thread, SIMT) is a common instruction for GPU programming. For example, consider a thread bundle (warp) consisting of 32 threads, which share a Program Counter (PC). Each instruction is executed by 32 threads together with 32 thread local registers (Thread Local Register, TLR) carrying the respective data. A graphics processor (Graphics Processing Unit, GPU) provides a Mask register to control whether each thread needs to execute. The mask register is a 32-bit thread bundle scalar register (WARP SCALAR REGISTER, WSR) in which each bit corresponds to the active state of a thread (e.g., when the register corresponding bit is "1", it indicates that the corresponding thread is active, the corresponding instruction is executed, and when the register corresponding bit is "0", it indicates that the corresponding thread is inactive, and the corresponding instruction is not executed).

It should be appreciated that there are typically instructions of the thread branch in the intermediate representation to be executed ("INTERMEDIATE REPRESENTATION", or "INTERMEDIATE CODE", i.e., "intermediate code". It should be understood that the intermediate representation is an intermediate data structure of the compiler before generating the object code) and its basic blocks. In a conventional method for supporting thread branching, a mutex lock (mutex) shared by all threads is generally used. It should be appreciated that a mutex lock (also referred to as a "thread lock") is a mechanism used in multi-threaded programming that prevents two threads from simultaneously reading and writing to the same common resource (e.g., global variable). The above-described approach of employing a mutex lock shared by all threads is prone to thread deadlock problems when faced with loop instructions. By "thread deadlock" is meant a phenomenon in which two or more threads are blocked during execution due to competing resources. If multiple threads are blocked at the same time, one or both of them is waiting for a certain resource to be released. Such a blockage may cause the program to fail to terminate properly. In conventional program solutions for avoiding thread deadlock, by having only one thread per thread bundle contend for resources, and then separating the threads in the thread bundles to execute a single threaded portion, it is ensured that only one thread per thread bundle enters a loop that acquires resources. However, because of the very large number of modes of mutual exclusion locks, conventional program solutions for avoiding thread deadlocks cannot accomplish automatic transformations at the compiler.

In summary, the conventional method has the following disadvantages: it is difficult to implement an automatic transformation that can be accomplished by a compiler while avoiding thread deadlocks by the program solution.

Disclosure of Invention

The present invention provides a method, computing device, computer readable storage medium and computer program product for generating random emissions of threads within a thread bundle, which not only enable avoiding thread deadlock by a program solution, but also enable automatic transformations to be done at a compiler.

According to a first aspect of the present invention, there is provided a method for generating random emissions of threads within a thread bundle. The method comprises the following steps: configuring a thread local register within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute, the thread bundle sharing one program counter; in response to determining that the program basic block execution is complete, selecting an active thread as the representative thread; and taking a program counter representing the subsequent desired execution of the thread as a program counter for the subsequent desired execution of the thread bundle.

In some embodiments, the method for generating thread random emissions within a thread bundle further comprises: determining whether a program basic block in the function contains an instruction which needs to be synchronized in a thread bundle through the back end of the compiler; in response to determining that a program basic block within a function contains instructions for which intra-thread bundle synchronization is required, a first sequence of leading instructions is inserted at the beginning of the program basic block to be executed for completing synchronization between threads within the thread bundle by the first sequence of leading instructions.

In some embodiments, the method for generating thread random emissions within a thread bundle further comprises: in response to determining that the program basic block within the function does not contain instructions requiring intra-thread bundle synchronization, a second sequence of leading instructions is inserted at the beginning of the program basic block to be executed.

In some embodiments, the method for generating thread random emissions within a thread bundle further comprises: an end instruction sequence is generated for executing the generated end instruction sequence after execution of the original instruction of the program basic block and before execution of the jump instruction.

In some embodiments, the method for generating thread random emissions within a thread bundle further comprises: calculating a jump condition; updating a program counter that is expected to be executed subsequently for each thread using the program counter identification of the first jump target program basic block or the program counter identification of the second jump target program basic block based on the state of whether the jump condition is satisfied; using a program counter currently representing a subsequent desired execution of the thread as a program counter for the subsequent desired execution of the thread bundle; selecting the next active thread as the representative thread; and jumping to a program counter of a subsequent desired execution of the thread bundle to begin execution of the subsequent instruction.

In some embodiments, generating the second sequence of preamble instructions for the program basic block comprises: determining whether a program counter of the current thread is identical to a starting program counter of a basic block of a program to be executed; updating the mask in response to determining that the program counter of the current thread is the same as the starting program counter of the basic block of program to be executed; and executing the instructions within the basic block of the program to be executed.

In some embodiments, generating the first sequence of preamble instructions for the program basic block comprises: determining whether there are threads whose program counter to be executed later is smaller than that of the program basic block to be synchronized; executing the management instruction in response to determining that there is a program counter for subsequent desired execution of the thread that is less than a program counter for a program base block that needs to be synchronized; in response to determining that there is no thread for which the program counter to be subsequently executed is less than the program counter of the program basic block to be synchronized, updating the mask according to whether the program counter to be subsequently executed of the thread is the same as the starting program counter of the program basic block to be synchronized; and executing the original instructions of the program basic blocks which need to be synchronized.

In some embodiments, executing the management instructions includes: screening threads whose program counter to be executed later is smaller than that of the program basic block to be synchronized so as to update the mask; selecting a representative thread based on the updated mask; using a program counter representing a subsequent desired execution of the thread as a subsequent program counter for the thread bundle; and jumping to a program counter of a subsequent desired execution of the thread bundle to begin execution of the subsequent instruction.

In some embodiments, selecting an active thread as the representative thread includes: and selecting one active thread from the active threads in the current area as a representative thread in a polling scheduling mode.

In some embodiments, configuring a thread local register within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently expects to execute includes: confirming whether a mode of mutual exclusion lock exists in the intermediate representation through the front end and the middle end of the compiler; and in response to confirming that the mode of the mutual exclusion lock exists in the intermediate representation, configuring a thread local register within the thread bundle for maintaining a program counter that each thread in the thread bundle subsequently expects to execute.

According to a second aspect of the present invention, there is also provided a computing device. The computing device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the computing device to perform the method of the first aspect of the invention.

According to a third aspect of the present invention, there is also provided a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a machine, performs the method of the first aspect of the invention.

According to a fourth aspect of the present invention there is also provided a computer program product comprising a computer program which when executed by a machine performs the method of the first aspect of the present invention.

The invention has the beneficial effects that: not only can thread deadlock be avoided by the program solution, but also automatic transformation can be done at the compiler since the program counter of the subsequent desired execution of the thread bundle is the selected active program counter representing the subsequent desired execution of the thread, which is not affected by the mode of the mutex lock.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

FIG. 1 schematically illustrates a schematic diagram of a computing device implementing a method for generating random emissions of threads within a thread bundle, in accordance with an embodiment of the present invention.

FIG. 2 illustrates a flow chart of a method for generating random emissions of threads within a thread bundle, according to an embodiment of the invention.

FIG. 3 illustrates a flow chart of a method for generating an ending instruction sequence according to an embodiment of the invention.

Fig. 4 illustrates a flow chart of a method for inserting a sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention.

Fig. 5 illustrates a flowchart of a method for inserting a second sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention.

Fig. 6 illustrates a flowchart of a method for inserting a first sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention.

Fig. 7 schematically shows a schematic view of an area in a program flow diagram.

Like or corresponding reference characters indicate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are illustrated in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.

As described previously, fig. 7 schematically shows a schematic diagram of one region 700 in a program flow diagram. The area 700 includes, for example: a first program basic block 710 (or "BB 0"), a second program basic block 720 (or "BB 1"), a third program basic block 730 (or "BB 2"), and a fourth program basic block 740 (or "BB 1"). As shown in fig. 7, the first 16 threads satisfy a predetermined condition (the predetermined condition is, for example, "cond=tid <16", where tid indicates, for example, a thread identification) and thus need to execute the second program basic block (or "BB 1"), and the last 16 threads do not satisfy the predetermined condition ("cond=tid < 16") and thus need to execute the third program basic block (or "BB 2"), so that the GPU would execute BB1 and BB2, respectively, and make the mask, for example, 0x00FF, i.e., the first 16 threads active when executing BB 1; the mask is, for example, 0xFF00 when BB2 is executed, i.e., the last 16 threads are active. It should be appreciated that for a SIMT model-based GPU, 32 threads share the same program counter (or "PC"), and therefore, support for thread branching of the type described above is required by hardware or software implementation. Conventional methods for supporting thread branching mainly include two types: one is a way that is supported entirely by hardware implementations. Another is the manner supported by the software implementation. For the manner of hardware implementation support, the method has the advantages that: the program counter of each thread may be saved and updated in real time, while at any point in time it may be switched to other threads to execute the program counter of the corresponding thread and the mask at that time updated. However, this approach requires high hardware requirements, requires hardware to fully manage the program counter and mask, and also incurs additional performance penalty due to recording and updating the program counter and mask. While the manner in which software is implemented to support mainly includes two specific manners. The first software implementation mainly includes: the mask of each program basic block (or called as a basic block) is recorded, and the mask of the basic block required to be executed next is calculated through bit operation. The second software implementation mainly comprises: the position of the next expected execution PC of each thread is saved, and after jumping to a new program basic block, whether the thread is active in the current program basic block is judged by comparing whether the current position is consistent with the expected execution position.

Studies have shown that in the above-described conventional method for supporting thread branching, the problem of mutual exclusion locking needs to be solved. In conventional approaches for supporting thread branching, a mutual exclusion lock is typically utilized that is shared by all threads. It should be appreciated that mutual exclusion is a mechanism used in multi-threaded programming to prevent two threads from simultaneously reading and writing to the same common resource (e.g., global variable). Its main function is to guarantee access to common resources in a single thread manner within critical sections by slicing the code into critical sections (critical sections) one by one.

In general, the mutex lock is a global variable. Each thread may attempt to acquire the mutex lock via an atomic instruction (e.g., atomicCAS). Table 1 below schematically illustrates exemplary pseudocode for acquiring the mutex lock via AtomicCAS.

TABLE 1

It should be appreciated that the AtomicCAS (& mutex, 1, 0) instruction is to compare the variable to which the first parameter points (e.g., "mutex") with the second parameter (e.g., "1"), and if equal, assign a third parameter (e.g., "0"). The return value is the original value in the variable (e.g., "mutex") pointed to by the first parameter. For example, first compare whether a resource exists through atomic instruction AtomicCAS, e.g., confirm whether a mutex lock (mutex) is "1", if "1" indicates no lock is available, replace it with "0" and "0" indicates locked; and returns the original value "1" in the variable (e.g., "mutex"), otherwise "0" is returned. In the while loop, if the resource is not acquired, the thread is always executed in the loop, and the thread acquiring the resource executes the subsequent single-threaded part, finally releases the resource through the atomic instruction AtomicExch, and resets mutex to "1". It should be appreciated that the AtomicExch (& mutex, 1) instruction described above is to assign the variable to which the first parameter points (e.g., "mutex") as the second parameter (e.g., "1").

It should be appreciated that only threads that acquire resources may leave the loop. In the thread branching process, when the thread branches are combined, the two thread branches usually wait for the execution of both thread branches and then execute the subsequent branches. For example, as shown in fig. 7, BB2 waits for both BB0 and BB1 to complete and then starts executing again, so as to reduce the number of times BB2 is repeatedly executed, while also avoiding that some special instructions in BB2 are split to execute. In the above manner, when all threads desire to leave the loop, the program counter corresponding to the thread bundle will not leave the loop. This results in the case of a mutex lock, where only one thread acquires that the resource is expected to leave the loop, but other threads do not acquire the resource but cannot leave the loop, thus resulting in the program counter of the final thread bundle still executing in the loop and failing to execute instructions that subsequently release the resource, thus resulting in thread deadlock.

It has been shown that to avoid thread deadlock, one program solution is to modify the code, for example, such that only one thread per thread bundle contends for resources, and then separate threads in the thread bundle execute a single threaded portion, thereby ensuring that only one thread per thread bundle enters the loop of acquiring resources. However, the above-described program solution for avoiding thread deadlock cannot accomplish automatic transformation at the compiler due to the very many modes of mutual exclusion locks. Program modification needs to be done by the user according to the mode of the different mutual exclusion locks, and therefore the above program solution is not a long-lasting solution.

To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a scheme for generating thread random emissions within a thread bundle. In this approach, by configuring thread local registers within a thread bundle for maintaining program counters that each thread in the thread bundle subsequently expects to execute; selecting an active thread as a representative thread when the execution of the basic program block is completed; and taking a program counter representing a subsequent desired execution of the thread as a program counter for subsequent desired execution of the thread bundle; the invention can select an active thread as a representative thread to obtain resources when the execution of the current program basic block is completed and the next program basic block is required to be executed, thus random thread emission in a thread bundle (warp) can be realized in a software mode, a program counter (WPC) of the warp for subsequent expected execution is related to a program counter (LTN.TPC) of the selected representative thread for subsequent expected execution, and the method does not depend on that all threads in the thread bundle leave a loop, so that the representative thread and the warp can acquire resources without being influenced by the mode of the loop and the exclusive lock. Therefore, the invention can avoid thread deadlock through a program solution, and can complete automatic transformation by a compiler because the program counter of the subsequent expected execution of the thread bundle is the selected active program counter representing the subsequent expected execution of the thread and is not influenced by the mode of mutual exclusion lock.

In addition, in the conventional method for avoiding the thread deadlock, it is necessary to rely on the topological order, that is, the program basic block as the dependent item can be executed only after the execution of the program basic block as the dependent item is completed, that is, the thread in front of the topological order is prioritized. In the method of the present invention, the program counter representing the subsequent expected execution of the thread selected when the execution of the previous program basic block is completed is used as the program counter of the subsequent expected execution of the thread bundle, so further, the present invention may not depend on the topology order, that is, the thread of the previous program basic block of the dependent item may acquire the resource released by the program basic block of the dependent item.

FIG. 1 schematically illustrates a schematic diagram of a computing device 100 implementing a method for generating random emissions of threads within a thread bundle, in accordance with an embodiment of the present invention. As shown in fig. 1, computing device 100 may have one or more processing units, including a special purpose processing unit such as a graphics processor (Graphics Processing Unit, GPU), a field programmable gate array (Field Programmable GATE ARRAY, FPGA), an Application SPECIFIC INTEGRATED integrated Circuit (ASIC), or a General-purpose graphics processor (GPGPU), and a General-purpose processing unit such as a CPU. The computing device 100 further includes at least: a thread local register configuration unit 102, a representative thread selection unit 104, and a next program counter determination unit 106 for the thread bundle. It should be appreciated that the thread local register configuration unit 102, the representative thread selection unit 104, and the next program counter determination unit 106 of the thread bundle described above may be software modules, such as one or more processing units configured to operate on the computing device 100.

With respect to the thread local register configuration unit 102, it is used to configure the thread local registers within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute, the thread bundle sharing one program counter.

Regarding the representative thread selection unit 104, it is used to determine whether the execution of the basic block of the program is completed; and in response to determining that the program basic block execution is complete, selecting an active thread as the representative thread.

The next program counter determination unit 106 regarding the thread bundle is configured to take a program counter representing the next desired execution of the thread as a program counter for the next desired execution of the thread bundle.

A method 200 for generating thread random emissions within a thread bundle according to an embodiment of the present invention will be described below in conjunction with fig. 2. FIG. 2 illustrates a flow chart of a method 200 for generating random transmissions of threads within a thread bundle according to an embodiment of the invention. It should be appreciated that the method 200 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 200 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.

At step 202, the computing device 100 configures a thread local register within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute, the thread bundle sharing one program counter.

Regarding thread bundles (warp), which are the smallest unit of execution of a GPU program, threads within the same warp execute the same instruction, i.e., SIMT. Each warp shares a Program Counter (PC).

With respect to a Program Counter (PC), it indicates the instruction position of program execution. Specifically, the address of the first instruction indicating the basic block of the program.

With respect to thread local registers (Thread Local Register, TLR), it is a register that is unique to each thread. Configured to maintain a program counter that each thread in the thread bundle subsequently expects to execute.

In some embodiments, a method for configuring a thread local register within a thread bundle, for example, includes: confirming whether a mode of mutual exclusion lock exists in the intermediate representation through the front end and the middle end of the compiler; and in response to confirming that the mode of the mutual exclusion lock exists in the intermediate representation, configuring a thread local register within the corresponding thread bundle for maintaining a program counter that each thread in the thread bundle subsequently expects to execute. For example, in some embodiments, if an atomic instruction is present in the intermediate representation, and the data result of the atomic instruction is a predicate condition for a loop jump, then there may be a pattern of exclusive locks (e.g., a corresponding identification is present to indicate the pattern of exclusive locks). If the mode of mutual exclusion lock is present in the intermediate representation is confirmed via the compiler front-end and middle-end, step 202 is performed. In some embodiments, the relevant identification for indicating the mode of the exclusive lock may also be noted via manual annotation.

At step 204, the computing device 100 determines whether the execution of the program basic block is complete. In some embodiments, if the computing device 100 determines that the program basic block has not been completed by execution, it waits at step 204.

With respect to basic blocks, or "basic blocks," BB "is a linear (or" sequential execution ") sequence of instructions. Wherein the basic block of the program has only one inlet and one outlet, and each operation in the basic block is sequentially executed without any bifurcation. The entry of the program basic block is its first statement, which may be diverted by a conditional branch statement or an unconditional branch statement. The program basic block exit is its last statement. It should be appreciated that the program flow graph is node-wise with program basic blocks. For example, fig. 7 indicates a first basic block BB0, a second basic block BB1, a third basic block BB2, and a fourth basic block BB3. Where BB1 or BB2 is a precursor (predecessor) basic block of BB3, and BB3 is a subsequent (successor) basic block of BB1 or BB 2.

In some embodiments, computing device 100 determines whether the current program basic block (e.g., without limitation, BB 1) is executing to completion. In some embodiments, the computing device 100 determines, for each basic block in the current block, whether execution of the basic block is complete, respectively.

At step 206, if the computing device 100 determines that the program basic block execution is complete, an active thread is selected as the representative thread.

For example, if the computing device 100 determines that the current basic block execution is complete, and the next basic block needs to be executed, one active thread is selected as a representative thread among active threads in the current region based on the manner of the polling schedule.

With respect to Round-Robin scheduling (Round-Robin), which generally refers to sequentially allocating equal length time slices to each process in a constant order, in the present invention, it is achieved to have each thread as a representative thread (LEADING THREAD) in a sequential loop based on the Round-Robin scheduling.

With respect to the region, it is a combination of program basic blocks, and only one program basic block as an entry and one program basic block as an exit are included in the combination of program basic blocks. It should be understood that a region may correspond to a function or a portion of a function. For example, fig. 7 indicates an area 700. The region 700 includes a first program basic block 710 (i.e., BB 0), a second program basic block 720 (i.e., BB 1), a third program basic block 730 (i.e., BB 2), and a fourth program basic block 740 (i.e., BB 3). Where BB0 is the only entry of region 700. BB3 is the only one exit of zone 700.

For example, the skip condition of BB0 is not "cond=tid < 16", and thus the program basic blocks that the first 16 threads (the thread identifications of the first 16 threads are, for example, "0" to "15") are expected to execute are, for example: BB0- > BB1- > BB3, and the program basic blocks that the latter 16 threads (the threads of the latter 16 threads are represented as "16" to "31", for example) are expected to execute are, for example: BB0- > BB2- > BB3.

With respect to the delegate thread (LEADING THREAD), it is, for example, one active thread selected from a plurality of active threads in the current region.

With respect to active threads, they are, for example, threads that meet execution conditions (or "selected") and the resources (e.g., registers and shared memory) required for execution are in place (e.g., resources have been allocated to the threads). For example, for BB1, threads identified as "0" through "15" are active threads. For BB2, the threads identified as "15" through "31" are active threads.

At step 208, the computing device 100 takes as a program counter for subsequent desired execution of the thread bundle a program counter representing subsequent desired execution of the thread.

For example, the program counter representing the subsequent desired execution of the thread is, for example, LTN TPC. The computing device 100 causes the program counter (WPC) of the subsequent desired execution of the thread bundle (warp) to be the same as ltn.tpc.

In the above scheme, the thread local register is configured in the thread bundle to be used for maintaining a program counter expected to be executed by each thread in the thread bundle; selecting an active thread as a representative thread when the execution of the basic program block is completed; and taking a program counter representing a subsequent desired execution of the thread as a program counter for subsequent desired execution of the thread bundle; the invention can select an active thread in a current area as a representative thread to obtain resources when the execution of a current program basic block is completed and the next program basic block is required to be executed, so that random thread emission in a thread bundle (warp) can be realized in a software mode, a program counter (WPC) of the warp for subsequent expected execution is related to a program counter (LTN.TPC) of the selected representative thread for subsequent expected execution, and the method does not depend on that all threads in the thread bundle leave a loop, so that the representative thread and the warp can acquire resources without being influenced by the mode of the loop and the exclusive lock. Therefore, the invention can avoid thread deadlock through a program solution, and can complete automatic transformation by a compiler because the program counter of the subsequent expected execution of the thread bundle is the selected active program counter representing the subsequent expected execution of the thread and is not influenced by the mode of mutual exclusion lock.

In some embodiments, the method 200 further comprises: the computing device 100 ranks all program basic blocks in topology according to the program flow graph; and such that the program counter of the precursor of each current program basic block is smaller than the program counter of the current program basic block except for the loop-back edge. By adopting the means, the invention can arrange the basic blocks of the program according to the topological order, so as to be beneficial to generating a preamble instruction sequence and a tail instruction sequence for the basic blocks of the program in the function.

In some embodiments, method 200 also includes a method 300 for generating an ending instruction sequence. For example, the computing device 100 generates an end instruction sequence for executing the generated end instruction sequence after execution of the original instruction of the program basic block and before execution of the jump instruction.

A method 300 for generating an ending instruction sequence according to an embodiment of the invention will be described below in conjunction with fig. 3. FIG. 3 illustrates a flow chart of a method 300 for generating an ending instruction sequence according to an embodiment of the invention. It should be appreciated that the method 300 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 300 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.

It should be appreciated that the method 300 needs to be performed after the original instruction is executed at the end of the program basic block, and before the jump instruction is executed.

At step 302, the computing device 100 calculates a jump condition.

For example, as shown in Table 2, the jump condition is "Cond=tid < 16" (i.e., the jump condition is that the thread identification is less than 16). For example, the computing device 100 calculates whether the skip condition "cond=tid < 16" is satisfied according to the current thread identification.

At step 304, the computing device 100 updates the program counter that is expected to be executed subsequently for each thread based on the status of whether the jump condition is satisfied, using the program counter identification of the first jump target program basic block or the program counter identification of the second jump target program basic block.

With respect to a program counter, it is a register in a processor (e.g., GPU) that contains the address (location) of the instruction currently being executed. As each instruction is fetched, the memory address of the program counter is incremented by one. After each instruction is fetched, the program counter points to the next instruction in the sequence.

Taking the second basic block BB1 in FIG. 7 as a first jump target basic block; and the third program basic block BB2 is the second jump target program basic block, for example, as indicated by the instruction "tpc=cond? as indicated by bb2_pc", the computing device 100 updates The Program Counter (TPC) that the thread is expected to execute subsequently with the program counter identification of BB1 (e.g., the program counter identification of the first skip target program basic block of bb1, which indicates) or the program counter identification of BB2 (e.g., the program counter identification of the second skip target program basic block, which indicates) according to the state of whether the skip condition "cond=tid < 16" is satisfied or not. For example, the location of TPC for each thread is updated using either bb1_pc or BB2 as an operand. For example, if the current thread identification (tid) is 10, which is less than 16, the TPC is updated with bb1_pc. For example, if the current thread identification (tid) is 20, which is greater than 16, the TPC is updated with bb2_pc.

At step 306, the computing device 100 uses the program counter currently representing the subsequent desired execution of the thread as the program counter for the subsequent desired execution of the thread bundle.

That is, WPC of the thread bundle is updated with a program counter (LTN.TPC) that currently represents the next desired execution of the thread. For example, as indicated by the instruction "wpc=ltn.tpc", a program counter (i.e., ltn.tpc) that is currently representing subsequent desired execution of the thread (representing the thread identification LTN) is used as the subsequent program counter (i.e., WPC) of the thread bundle.

Regarding the current representative thread, it is the active thread in the selected one of the current regions.

At step 308, the computing device 100 selects the next active thread as the representative thread. It should be appreciated that switching threads is accomplished by selecting the next active thread as the representative thread.

For example, the computing device 100 randomly selects the next active thread as the representative thread, e.g., based on a Round-Robin (Round-Robin:) approach. It should be appreciated that the next active thread should be selected as the active thread in the current region. For example, the instruction "ltn=round-Robin (LTN)" in table 2 indicates.

At step 310, the computing device 100 jumps to a program counter of a subsequent desired execution of the thread bundle to begin execution of subsequent instructions.

For example, as indicated by instruction "JUMP WPC" in table 2, computing device 100 JUMPs to the position of WPC for the thread bundle to begin executing subsequent instructions.

Table 2 below schematically illustrates exemplary pseudocode for a trailing instruction sequence.

TABLE 2

By adopting the means, the invention not only can realize the random emission of the threads generated in the thread bundle, but also can finish the transfer instruction when the basic block of the program is finished.

In some embodiments, the method 200 further comprises: a method for inserting a sequence of preamble instructions at the beginning of a basic block of a program to be executed. Fig. 4 illustrates a flow chart of a method 400 for inserting a sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention. It should be appreciated that the method 400 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 400 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.

At step 402, the computing device 100 determines, via the compiler back-end, whether a program basic block within a function contains instructions that require intra-thread bundle synchronization.

For example, if the compiler back-end determines that a program basic block within a function includes at least one of the following instructions: shuffle instruction, synchronize instruction, return instruction, then determine that the program basic block within the function contains instructions that require intra-thread bundle synchronization.

At step 404, if the computing device 100 determines that the program basic block within the function contains instructions for which intra-thread bundle synchronization is required, a first sequence of leading instructions is inserted at the beginning of the program basic block to be executed for completing synchronization between threads within the thread bundle by the first sequence of leading instructions.

With respect to a method for inserting a first sequence of preamble instructions at the beginning of a basic block of a program to be executed, it comprises, for example: the computing device 100 determines whether there are threads for which the program counter to be subsequently executed is smaller than the program counter of the program basic block that needs to be synchronized; executing the management instruction in response to determining that there is a program counter for subsequent desired execution of the thread that is less than a program counter for a program base block that needs to be synchronized; in response to determining that there is no program counter for subsequent desired execution of the thread that is less than the program counter of the program basic block to be synchronized, updating the mask according to whether the program counter for subsequent desired execution of the thread is the same as the starting program counter of the program basic block to be synchronized; and executing the original instructions of the program basic blocks which need to be synchronized. The method 600 for inserting the first sequence of preamble instructions at the beginning of the basic block of the program to be executed will be described in detail below in conjunction with fig. 6, and will not be described in detail herein.

At step 406, if the computing device 100 determines that the program basic block within the function does not contain instructions that require intra-thread bundle synchronization, a second sequence of leading instructions is inserted at the beginning of the program basic block to be executed.

With respect to a method for inserting a second sequence of preamble instructions at the beginning of a basic block of a program to be executed, it comprises, for example: it includes, for example: the computing device 100 determines whether the program counter of the current thread is the same as the starting program counter of the basic block of program to be executed; updating the mask in response to determining that the program counter of the current thread is the same as the starting program counter of the basic block of program to be executed; and executing the instructions within the basic block of the program to be executed. The method 500 for inserting the second sequence of preamble instructions at the beginning of the basic block of the program to be executed will be described in detail below in conjunction with fig. 5, and will not be described in detail herein.

By adopting the means, the method can generate different preamble instruction sequences based on whether the instructions of the program basic blocks in the functions need the synchronization in the thread bundles or not so as to be inserted into the beginning of the program basic blocks.

A method 500 for inserting a second sequence of preamble instructions at the beginning of a basic block of a program to be executed is described below in connection with fig. 5 and 7. Fig. 5 illustrates a flowchart of a method 500 for inserting a second sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention. It should be appreciated that the method 500 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 500 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.

At step 502, the computing device 100 determines whether the program counter of the current thread is the same as the starting program counter of the basic block of programs to be executed.

Regarding the basic block of a program to be executed, it is, for example, BB1. As shown in fig. 7, BB1 includes, for example, two instructions, respectively: "c=a+b" and "Jump BB3".

At step 504, if the computing device 100 determines that the program counter of the current thread is the same as the starting program counter of the program basic block to be executed, the mask is updated.

For example, if the computing device 100 determines that the TPC for the current thread is the same as bb1_pc, mask is updated. For example, mask is set to "1" (e.g., "1" indicates "true"). As indicated by the instruction "mask=tpc= =bb1_pc" in table 3.

At step 506, the computing device 100 executes instructions within the basic block of the program to be executed.

For example, computing device 100 executes instructions within BB 1. For example, "c=a+b" is performed.

By employing the various means of method 500 described above, the present invention can be implemented where the second sequence of preamble instructions is inserted at the beginning of the program basic block.

It should be appreciated that after the execution of instruction "c=a+b" in BB1, the following sequence of trailing instructions also needs to be executed before the execution of Jump instruction "Jump BB 3".

First, the computing device 100 uses the program counter identification of the jump target program basic block to update the program counter that each thread is expected to execute subsequently.

As for the jump target program basic block, for example, a fourth program basic block (BB 3) shown in fig. 7 is mentioned.

For example, as indicated by the instruction "tpc=bb3_pc" in table 3, the computing apparatus 100 updates The Program Counter (TPC) that is desired to be executed next to each thread using the program counter identification (bb3_pc) of the skip target program basic block (which is, for example, BB 3).

The computing device 100 then uses the program counter currently representing the subsequent desired execution of the thread as the subsequent program counter for the thread bundle.

For example, as indicated by the instruction "wpc=ltn.tpc" in table 3, the computing device 100 uses the program counter (ltn.tpc) currently representing the subsequent desired execution of the thread as the subsequent program counter (WPC) of the thread bundle (warp).

Furthermore, the computing device 100 selects the next active thread as the representative thread.

For example, as indicated by the instruction "ltn=round-Robin (LTN)" in table 3, the computing device 100 selects the next active thread as the representative thread based on the manner of Round-Robin scheduling (Round-Robin:).

Thereafter, the computing device 100 jumps to a program counter of a subsequent desired execution of the thread bundle to begin execution of the subsequent instructions. For example, as indicated by instruction "JUMP WPC".

Table 3 below schematically shows exemplary pseudocode of the generated end instruction sequence where the second leading instruction sequence is inserted to the beginning of the program basic block BB1 and after the execution of the original instruction of the program basic block BB1 and before the execution of the jump instruction.

TABLE 3 Table 3

It should be appreciated that there are some thread bundle dependent instructions in the GPU program, such as inter-thread data exchange instructions (shuffles) within a thread bundle that allow inter-threads to access registers of each other, synchronization instructions (sync), instructions that require intra-warp synchronization. For these instructions that require intra-warp synchronization, if such instructions are executed in multiple passes, semantic changes may result and even thread deadlocks may result. For basic blocks where these instructions exist, a sequence of synchronization-related preamble instructions (e.g., a first sequence of preamble instructions) needs to be inserted at the beginning of the basic block of the program to be executed for completing synchronization between threads within the thread bundle by the first sequence of preamble instructions.

A method 600 for inserting a first sequence of preamble instructions at the beginning of a basic block of a program to be executed is described below in connection with fig. 6 and 7. Fig. 6 illustrates a flow chart of a method 600 for inserting a first sequence of preamble instructions at the beginning of a basic block of a program to be executed, according to some embodiments of the invention. It should be appreciated that the method 600 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 600 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.

At step 602, the computing device 100 determines whether there are threads for which the program counter to be subsequently executed is smaller than the program counter of the program basic block that needs to be synchronized.

As for the program basic block that needs synchronization, it is, for example, not limited to, the fourth program basic block (BB 3) in fig. 7. In some embodiments, BB3 comprises, for example, an instruction of "RETURN c". In some embodiments, when the "Return c" instruction is invoked, there are as many threads as are required at the beginning of the executing function, and these threads all need to be active threads at the end of the executing function. I.e. the threads involved in executing the function are made to "Return" together.

For example, the computing device 100 determines whether there is a program counter (e.g., bb3_pc) with TPC less than the program basic block (e.g., BB 3) that needs synchronization. As indicated by the instruction "cond=tpc < bb3_pc" in table 4.

At step 604, if the computing device 100 determines that there is a program counter for the subsequent desired execution of the thread that is less than the program counter for the program basic block that needs to be synchronized, the management instruction is executed.

For example, as indicated by the instruction "Branch Cond, handle, bb3_pc_new" in table 4, if the jump condition Cond is satisfied, a management instruction (e.g., handle instruction) is executed, and if the jump condition Cond is not satisfied, an instruction (e.g., bb3_pc_new) to update a program counter of a program basic block that needs synchronization is executed.

Wherein the execution management instructions include, for example: step 606 to step 612.

At step 606, computing device 100 screens threads for which the program counter of the next desired execution is smaller than the program counter of the program basic block that needs to be synchronized in order to update the mask.

For example, as indicated by the instruction "mask=tpc < bb3_pc" in table 4, the computing device 100 screens the thread of TPC < bb3_pc to update the Mask based on the thread of TPC < bb3_pc.

At step 608, the computing device 100 selects a representative thread based on the updated mask.

For example, the computing device 100 selects a representative thread among threads that do not execute to BB3 based on the manner of polling scheduling. As indicated by instruction "ltn=round-Robin (LTN) (Mask)" in table 4.

At step 610, computing device 100 uses a program counter representing a subsequent desired execution of the thread as a subsequent program counter for the thread bundle.

For example, computing device 100 uses ltn.tpc as WPC for Wrap. As indicated by the instruction "wpc=ltn.tpc" in table 4.

At step 612, the computing device 100 jumps to a program counter of a subsequent desired execution of the thread bundle to begin execution of the subsequent instructions. As indicated by instruction "Jump WPC" in table 4.

At step 614, if the computing device 100 determines that there is no program counter for the subsequent desired execution of the thread that is less than the program counter for the program basic block that needs to be synchronized, the mask is updated based on whether the program counter for the subsequent desired execution of the thread is the same state as the starting program counter for the program basic block that needs to be synchronized.

For example, if the computing device 100 determines that there are no threads with TPC less than bb3_pc (i.e., all threads walk to BB 3), the Mask is updated according to whether TPC is the same as bb3_pc, as indicated by the instruction "mask=tpc= bb3_pc in table 4.

At step 616, the computing device 100 executes the native instructions of the program basic block that require synchronization.

For example, as indicated by instruction "Return c" in table 4, computing device 100 exits BB3, no longer requiring updating TPC, and returns directly.

Table 4 below schematically illustrates exemplary pseudocode for completing synchronization between threads within a thread bundle by a first sequence of leading instructions where the first sequence of leading instructions is inserted at the beginning of a basic block of a program to be executed.

TABLE 4 Table 4

In the above scheme, the present invention aims at program basic blocks which need thread synchronization in a thread bundle, and the synchronization between threads in the thread bundle is realized by inserting a first precursor instruction sequence into the beginning of the program basic blocks to be executed.

The various processes and treatments described above, such as methods 200-500, may be performed at a computing device. The computing device includes, for example: at least one processor (at least one graphics processor and at least one central processor); and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor. In some embodiments, the methods 200-500 may be implemented as a computer software program or program product tangibly embodied on a machine-readable medium. In some embodiments, part or all of the computer program may be loaded and/or installed onto the computing device via Read-Only Memory (ROM) and/or a communication unit. One or more of the acts of the methods 200 through 500 described above may be performed when a computer program is loaded into Random-access memory (RAM) and executed by a GPU and a CPU.

The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a central processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the central processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors.

Claims

1. A method for generating random emissions of threads within a thread bundle, comprising:

configuring a thread local register within a thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute, the thread bundle sharing one program counter;

In response to determining that the program basic block execution is complete, selecting an active thread as the representative thread; and

The program counter representing the subsequent expected execution of the thread is used as the program counter of the subsequent expected execution of the thread bundle.

2. The method as recited in claim 1, further comprising:

Determining whether a program basic block in the function contains an instruction which needs to be synchronized in a thread bundle through the back end of the compiler;

In response to determining that a program basic block within a function contains instructions for which intra-thread bundle synchronization is required, a first sequence of leading instructions is inserted at the beginning of the program basic block to be executed for completing synchronization between threads within the thread bundle by the first sequence of leading instructions.

3. The method as recited in claim 2, further comprising:

in response to determining that the program basic block within the function does not contain instructions requiring intra-thread bundle synchronization, a second sequence of leading instructions is inserted at the beginning of the program basic block to be executed.

4. The method as recited in claim 1, further comprising:

An end instruction sequence is generated for executing the generated end instruction sequence after execution of the original instruction of the program basic block and before execution of the jump instruction.

5. The method of claim 4, wherein generating the trailing instruction sequence comprises:

Calculating a jump condition;

Updating a program counter that is expected to be executed subsequently for each thread using the program counter identification of the first jump target program basic block or the program counter identification of the second jump target program basic block based on the state of whether the jump condition is satisfied;

using a program counter currently representing a subsequent desired execution of the thread as a program counter for the subsequent desired execution of the thread bundle;

selecting the next active thread as the representative thread; and

The program counter of the thread bundle for subsequent desired execution is skipped to begin execution of subsequent instructions.

6. A method according to claim 3, wherein generating a second sequence of preamble instructions for a program basic block comprises:

determining whether a program counter of the current thread is identical to a starting program counter of a basic block of a program to be executed;

Updating the mask in response to determining that the program counter of the current thread is the same as the starting program counter of the basic block of program to be executed; and

Instructions within the basic block of the program to be executed are executed.

7. The method of claim 2, wherein generating a first sequence of preamble instructions for a program basic block comprises:

determining whether there are threads whose program counter to be executed later is smaller than that of the program basic block to be synchronized;

Executing the management instruction in response to determining that there is a program counter for subsequent desired execution of the thread that is less than a program counter for a program base block that needs to be synchronized;

In response to determining that there is no thread for which the program counter to be subsequently executed is less than the program counter of the program basic block to be synchronized, updating the mask according to whether the program counter to be subsequently executed of the thread is the same as the starting program counter of the program basic block to be synchronized; and

The original instruction of the program basic block requiring synchronization is executed.

8. The method of claim 7, wherein executing the management instructions comprises:

Screening threads whose program counter to be executed later is smaller than that of the program basic block to be synchronized so as to update the mask;

selecting a representative thread based on the updated mask;

Using a program counter representing a subsequent desired execution of the thread as a subsequent program counter for the thread bundle; and

9. The method of claim 1, wherein selecting an active thread as the representative thread comprises:

based on the manner of polling scheduling, one active thread is selected as a representative thread among active threads in the current region.

10. The method of claim 1, wherein configuring thread local registers within the thread bundle for maintaining a program counter that each thread in the thread bundle subsequently desires to execute comprises:

Confirming whether a mode of mutual exclusion lock exists in the intermediate representation through the front end and the middle end of the compiler; and

In response to confirming the mode of the mutual exclusion lock in the intermediate representation, a thread local register is configured within the thread bundle for maintaining a program counter that each thread in the thread bundle subsequently expects to execute.

11. A computing device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

12. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a machine, performs the method according to any of claims 1-10.

13. A computer program product comprising a computer program which, when executed by a machine, performs the method according to any of claims 1-10.