Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a circulating body execution method, a circulating body execution system, a circulating body execution program product, a circulating body execution medium and circulating body execution equipment, and mainly solves the problem that an effective AI operation acceleration method which is widely applied and aims at circulating is lacking in the prior art.
The invention aims at realizing the following scheme:
According to the embodiment of the invention, a loop body execution method is provided, which comprises the following steps of S1, responding to detection of a branch instruction, determining whether a jump destination address of the branch instruction is smaller than a program counter of the branch instruction, if yes, recording the program counter of the branch instruction in a loop body end position register, recording two operands in the branch instruction, recording the comparison type of jump conditions in the branch instruction in a loop comparator, S2, jumping to the destination address to fetch a new instruction, determining whether an immediate operation is carried out on one of the two operands, if yes, storing the operand which carries out the immediate operation in a loop data register, storing the other operand in a constant register, storing the immediate operation in a step length register, if not, storing the new instruction in a loop buffer, sequentially fetching the next instruction, S3, judging whether the program counter of the next instruction is smaller than the program counter of the branch instruction, if yes, repeating the step S4, and when the comparison of the program counter of the next instruction in the loop buffer is executed, determining whether the immediate operation is equal to the instruction and accumulating the accumulated condition is met or not.
According to the embodiment of the invention, the method for executing the loop body further comprises the step of detecting whether the instruction in the destination address is in a failure buffer or not in S2, and if so, skipping the rest steps, wherein the failure buffer is used for storing a program counter of the last detected backward jump but not a loop instruction, and the backward jump is that the destination address of the backward jump is smaller than the program counter of the branch instruction.
According to an embodiment of the present invention, the types of comparison include greater than, less than, equal to, greater than or equal to, and less than or equal to.
According to an embodiment of the invention, the immediate operations include an immediate addition operation and an immediate subtraction operation.
According to an embodiment of the invention, the immediate operation is performed using an adder.
According to an embodiment of the present invention, when the immediate operation is an immediate subtraction operation, an addition operation is performed using the complement of the immediate.
According to an embodiment of the invention, the immediate may be any non-zero integer.
According to an embodiment of the present invention, the loop body execution method further includes, in the step S2, when the loop buffer is full and an address of a next instruction is smaller than a value in the loop body end position register, putting the value in the loop body end position register into a failure buffer, and skipping the remaining steps.
According to an embodiment of the present invention, the loop body execution method further includes, in the step S2, when performing a plurality of immediate operations on one of the two operands, placing the value in the loop body end position register into a failure buffer, and skipping the remaining steps.
According to the embodiment of the invention, the method for executing the cyclic body further comprises the steps of putting the value in the end position register of the cyclic body into a failure buffer and skipping the rest steps when the immediate operation is carried out on the one operand and then the immediate operation is carried out on the other operand in S4.
According to a second aspect, according to another embodiment of the present invention, there is provided a loop body execution system based on branch instruction detection, including a loop controller executing a program by using a loop body execution method according to the first aspect, a loop comparator for judging a loop termination condition, a loop adder for accumulating a loop variable, a loop body end position register for storing a program counter of a last branch instruction of a loop body, a loop buffer for temporarily storing an instruction of a loop body, a failure buffer for temporarily storing a program counter of a backward jump branch instruction which has been judged to be non-loop recently, a loop data register for storing a value of a loop accumulation variable, a constant register for storing a value of a loop threshold constant, and a step size register for storing a step size of loop variable accumulation.
In a third aspect, according to a further embodiment of the present invention, there is provided a computer program product comprising computer program code which, when run on an electronic device, causes the electronic device to perform the method according to the first aspect.
In a fourth aspect, according to a further embodiment of the present invention, there is provided a computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of the method according to the first aspect.
In a fifth aspect, according to a further embodiment of the present invention, there is provided an electronic device comprising one or more processors, and a memory, wherein the memory is for storing executable instructions, the one or more processors being configured to implement the steps of the method according to the first aspect via execution of the executable instructions.
Compared with the prior art, the method has the advantages that the loop body is automatically inferred, no modification is needed to an instruction set or a compiler, no additional requirement is needed to programming, the application range is expanded, the loop body instruction is cached by using the loop buffer, the loop variable accumulation instruction and the comparison instruction are removed, the execution of the loop is accelerated, the instruction is prevented from being fetched from the memory each time, the operation efficiency is further improved, and the power consumption is reduced.
Detailed Description
Before proceeding with the following detailed description, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The terms "coupled," "connected," and derivatives thereof, refer to any direct or indirect communication or connection between two or more elements, whether or not those elements are in physical contact with one another. The terms "transmit," "receive," and "communicate," and derivatives thereof, encompass both direct and indirect communication. The terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation. The term "or" is inclusive, meaning and/or. The phrase "associated with" and its derivatives are intended to include, be included within, interconnect with, contain, be included within, connect to, or be in communication with, mate with, interleave, juxtapose, proximity to, bind to, have an attribute of, have a relationship with, or have a relationship with, etc. the phrase "associated with" and its derivatives are intended to include, be included within, be connected to, or be in communication with the phrase. The term "controller" refers to any device, system, or portion thereof that controls at least one operation. Such a controller may be implemented in hardware, or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase "at least one," when used with a list of items, means that different combinations of one or more of the listed items may be used, and that only one item in the list may be required. For example, "at least one of A, B, C" includes any one of the combinations A, B, C, A and B, A and C, B and C, A and B and C.
Definitions for other specific words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.
In this patent document, the application combinations of modules and the division levels of sub-modules are for illustration only, and the application combinations of modules and the division levels of sub-modules may have different manners without departing from the scope of the disclosure.
For the purpose of making the technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by way of specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to more clearly illustrate the present invention, a review of the prior art loop instruction segment execution method is first made.
For example, one for loop, 100 times, where vector addition of b [ i ] and c [ i ] is performed:
for(i=0;i<100,i++){
A[i]=b[i]+c[i]
}
After compiling, the instruction blocks form a series of instruction codes, and the approximate assembly pseudo codes are as follows:
X10=0; # corresponds to i=0
X20=100; # number of cycles
Loop: # cycle body
LOAD X12, (b_addr) # places B into register X12
LOAD X13, (c_addr) # places C into register X13
ADD X11, X12, x13# calculate a=b+c, and the result is put into X11
STORE (A_ADDR), x11# STOREs results back into memory
Add A_ADDR,1# address accumulation
ADD B_ADDR,1# address accumulation
ADD C_ADDR,1# address accumulation
ADD X10,1#i =i+1, enter the next cycle
BLT X10, X20, loop# if less than 100, returns to computing the next set of elements at Loop
It can be seen that except for the middle ai=b i+c I, the for statement needs to accumulate the variable I and make a decision at the end of the loop. As described in the background art, in a short cycle, the performance is greatly affected by the overhead caused by accumulation of the cycle number and judgment of whether the cycle is ended. Furthermore, executing branch instructions speculatively may further lose performance. Therefore, there is a lack of an efficient AI operation acceleration method for loops that is more widely used in the prior art.
In response to detecting a branch instruction, determining whether a jump destination address of the branch instruction is smaller than a program counter of the branch instruction, if yes, recording the program counter of the branch instruction in a loop body end position register, recording two operands in the branch instruction, recording a comparison type of a jump condition in the branch instruction in a loop comparator, S2 jumping to the destination address to fetch a new instruction, determining whether to perform an immediate operation on one of the two operands, if yes, storing the operand performing the immediate operation in a loop data register, storing the other operand in a constant register, storing the immediate operation in a step length register, if not, storing the new instruction in a loop buffer, sequentially fetching the next instruction, S3 determining whether the program counter of the next instruction is smaller than the branch instruction, if yes, storing the new instruction in the loop buffer, and if yes, and determining whether the program counter of the next instruction is equal to the instruction in the loop buffer, and if the comparison condition is satisfied, and accumulating the instruction is executed from the loop buffer.
The method has the advantages that the loop body is automatically deduced without any modification to an instruction set or a compiler and additional requirements to programming, the application range is enlarged, the loop body instructions are cached by using the loop buffer, the loop variable accumulation instructions and the comparison instructions are removed, the execution of the loop is accelerated, the instruction fetching from the memory each time is avoided, the running efficiency is further improved, and the power consumption is reduced.
In addition, to reduce meaningless comparisons and further increase execution efficiency, according to an embodiment of the present invention, the loop body execution method further includes detecting whether an instruction in the destination address is in a failure buffer, and if so, skipping the remaining steps, wherein the failure buffer is used for storing a program counter of a last detected backward-skip but not loop instruction, and the backward-skip is that the destination address of the skip is smaller than the program counter of the branch instruction.
In order to enhance the adaptability of the loop body execution method, according to an embodiment of the present invention, the types of comparison include greater than, less than, equal to, greater than or equal to, and less than or equal to. In addition, other judging modes for the cycle ending condition can be supported according to the actual use scene.
According to an embodiment of the invention, the immediate operations include an immediate addition operation and an immediate subtraction operation. The immediate operation is used to jump to the next cycle. For positive versatility, whether immediate addition operations and immediate subtraction operations, the immediate operations are performed using adders according to embodiments of the present invention. According to an embodiment of the present invention, when the immediate operation is an immediate subtraction operation, an addition operation is performed using the complement of the immediate.
Furthermore, to increase the adaptability of the loop body execution method, the immediate may be any non-zero integer. That is, 1 may be added (subtracted) each time, or 2 may be added each time, and any non-zero integer may be added, so as to select and set according to the actual use scenario.
In addition, in order to ensure that program execution does not exceed hardware limitations, according to an embodiment of the present invention, the loop body execution method further includes, in the step S2, when the loop buffer is full and an address of a next instruction is smaller than a value in the loop body end position register, placing the value in the loop body end position register into a failure buffer, and skipping the remaining steps. If this happens, this means that the program is beyond what the hardware can withstand, as it may be that the loop body is too large or not loop at all.
In addition, in order to ensure the correctness of the loop, according to the embodiment of the invention, the method for executing the loop body further comprises the steps of putting the value in the loop body end position register into a failure buffer and skipping the rest when performing a plurality of immediate operations on one of the two operands in the step S2. As is known from the general definition, the conditions of the general cycle are not satisfied in this case.
In order to further ensure the correctness of the loop, according to the embodiment of the invention, the method further comprises the steps of putting the value in the loop body end position register into a failure buffer and skipping the rest steps when the immediate operation is performed on the one operand and then the immediate operation is performed on the other operand in the step S4. Likewise, this case does not meet the conditions of the general cycle.
According to another embodiment of the present invention, as shown in FIG. 2, there is provided a loop body execution system based on branch instruction detection, including a loop controller (not shown) executing a program using a loop body execution method as described above; a loop comparator (loop cmp) for determining a loop termination condition (jump? the loop adder (loop add) for accumulating the loop variable, a loop body end position register (not shown) for storing a program counter (loop pc) of the last branch instruction of the loop body, a loop buffer (loop buffer) for temporarily storing the instruction of the loop body, a fail buffer (fail cmp) for temporarily storing a program counter of the backward jump branch instruction which is judged to be non-loop recently, a loop data register (acc_value) for storing the value of the loop accumulation variable, a constant register (const_value) for storing the value of the loop threshold constant, and a step size register (step) for storing the step size of the loop variable accumulation. It can be seen that the loop variable accumulation instruction and loop determination in the original loop body are performed in the adder and comparator, while the other instructions (inst 0-4) are placed in the loop buffer. By the arrangement, the value does not need to be taken in the memory, and the instruction execution in the circulation buffer, the circulation variable accumulation instruction and the circulation judgment instruction can be performed simultaneously, so that the execution efficiency is improved.
According to yet another embodiment of the invention, a computer program product is provided, comprising computer program code which, when run on an electronic device, causes the electronic device to perform a loop body execution method as described above.
According to a further embodiment of the present invention, there is provided a computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of a method of performing a loop body as described above.
According to yet another embodiment of the present invention, there is provided an electronic device comprising one or more processors and memory, wherein the memory is for storing executable instructions, the one or more processors being configured to implement, via execution of the executable instructions, the steps of a loop body execution method as previously described.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.