CN117311814A - Instruction fetch unit, instruction reading method and chip - Google Patents
Instruction fetch unit, instruction reading method and chip Download PDFInfo
- Publication number
- CN117311814A CN117311814A CN202311289788.8A CN202311289788A CN117311814A CN 117311814 A CN117311814 A CN 117311814A CN 202311289788 A CN202311289788 A CN 202311289788A CN 117311814 A CN117311814 A CN 117311814A
- Authority
- CN
- China
- Prior art keywords
- unit
- address
- instruction
- data selection
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3804—Instruction prefetching for branches, e.g. hedging, branch folding
- G06F9/3806—Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30069—Instruction skipping instructions, e.g. SKIP
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
The application discloses a fetch unit, an instruction reading method and a chip, and relates to the technical field of chips. The finger taking unit comprises: a BTB unit, a RAS unit, an address accumulation unit, a data selection unit, a PC register, and an instruction fetch unit; the BTB unit, the RAS unit and the address accumulation unit are respectively connected with the data selection unit; the BTB unit is connected with the RAS unit; the data selection unit is connected with the PC register; the PC register is also respectively connected with the BTB unit, the RAS unit and the address accumulation unit; the PC register is connected with the instruction reading unit; in the scheme, the BTB unit and the RAS unit can predict the target address and the jump address of the branch instruction, and the data selection unit selects the correct address for output, so that the situation of introducing bubbles is avoided in the process, the instruction fetching efficiency is improved, and the performance of the processor is improved.
Description
Technical Field
The present disclosure relates to the field of chip technologies, and in particular, to a fetch unit, an instruction reading method, and a chip.
Background
In the processor, the instruction fetch unit at the front end is used for fetching instructions and submitting the instructions to the EXU for execution.
In the related art, since some jump instructions may be included in the program, the instruction fetch unit will generally set the branch prediction unit to determine the jump direction and jump address during the instruction fetch stage. The fetch unit of the processor may include a multi-stage pipeline, and branch predictions are performed simultaneously during the fetch process.
However, in the above scheme, the units for branch prediction are generally divided into multiple stages, and if an instruction needing to jump is predicted, the current instruction fetch stage needs to be emptied, which causes performance loss.
Disclosure of Invention
The embodiment of the application provides a fetching unit, an instruction reading method and a chip, which can avoid bubbles in the branch prediction process, improve the fetching efficiency and further improve the performance of a processor. The technical scheme is as follows.
In one aspect, there is provided a finger taking unit including: a branch target cache BTB unit, a return address stack RAS unit, an address accumulation unit, a data selection unit, an address counter PC register, and an instruction fetch unit;
the BTB unit, the RAS unit and the address accumulation unit are respectively connected with the data selection unit; the BTB unit is connected with the RAS unit;
The data selection unit is connected with the PC register; the PC register is also respectively connected with the BTB unit, the RAS unit and the address accumulation unit;
the PC register is connected with the instruction reading unit;
the BTB unit is used for predicting a target address in the case that the next instruction of the current instruction is a branch instruction according to an input address, and outputting the target address to the data selection unit; the input address is an instruction address of the current instruction;
the RAS unit is used for predicting a return address of a current function after a jump of a next instruction is performed based on the input address and information of the instruction of which the BTB unit predicts a hit, and outputting the return address to the data selection unit;
the address accumulation unit is used for adding 1 to the address of the current instruction to obtain an accumulated address of the next instruction, and outputting the accumulated address to the data selection unit;
the data selecting unit is used for selecting one address from a plurality of input addresses as an instruction address of a next instruction and outputting the instruction address of the next instruction to the PC register;
The PC register is used for caching the instruction address input by the data selection unit, outputting the cached instruction address to the instruction reading unit, and outputting the cached instruction address to the BTB unit and the RAS unit as a new input address when next instruction prediction is performed;
the instruction reading unit is used for reading the instruction based on the instruction address output by the PC register and sending the read instruction to the execution unit connected with the value unit.
In another aspect, there is provided an instruction fetch method, the method being performed by an instruction fetch unit as described above, the method comprising:
predicting, by the BTB unit, a target address in a case where a next instruction of a current instruction is a branch instruction according to an input address, and outputting the target address to the data selection unit; the input address is an instruction address of the current instruction;
predicting, by the RAS unit, a return address of a current function of a next instruction after a jump occurs based on the input address and information of the instruction of which the BTB unit predicts a hit, and outputting the return address to the data selecting unit;
Adding 1 to the address of the current instruction through the address accumulation unit to obtain an accumulated address of the next instruction, and outputting the accumulated address to the data selection unit;
selecting, by the data selecting unit, an address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to the PC register;
caching the instruction address input by the data selection unit through the PC register, outputting the cached instruction address to the instruction reading unit, and outputting the cached instruction address to the BTB unit and the RAS unit as a new input address when next instruction prediction is performed;
and reading an instruction based on the instruction address output by the PC register through the instruction reading unit, and sending the read instruction to an execution unit connected with the value unit.
In another aspect, a chip is provided, the chip comprising an instruction fetch unit as described above.
The beneficial effects that technical scheme that this application embodiment provided include at least:
the instruction fetching unit comprises a BTB unit, an RAS unit, an address accumulation unit, a data selection unit, a PC register and an instruction reading unit; the RAS unit predicts the return address of the current function after the next instruction jumps under the condition that the BTB unit predicts hit, and the address accumulation unit adds 1 to the address of the current instruction to obtain the accumulated address of the next instruction; the predicted or accumulated address is selected by a data selection unit to obtain the instruction address of the next instruction and is cached in a PC register, so that the instruction reading unit is used for reading the instruction based on the instruction address; in the scheme, the BTB unit and the RAS unit can predict the target address and the jump address of the branch instruction, and the data selection unit selects the correct address for output, so that the situation of introducing bubbles is avoided in the process, the instruction fetching efficiency is improved, and the performance of the processor is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a processor;
FIG. 2 is a schematic diagram of a fetch pipeline of a fetch unit at the front end of a superscalar processor;
FIG. 3 is a schematic diagram of a finger unit according to an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of a finger fetch unit according to an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a finger unit according to an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of a finger fetch unit according to an exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a finger unit according to an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a finger fetch unit according to an exemplary embodiment of the present application;
FIG. 9 is a flowchart of a method for instruction fetch, provided in one exemplary embodiment of the present application;
Fig. 10 is a frame diagram of the finger picking unit according to the present application;
FIG. 11 is a schematic diagram of a field segment in each entry of the BTB to which the present application relates;
FIG. 12 is a schematic diagram of an alternative strategy for BTB in accordance with the present application;
FIG. 13 is a schematic diagram of an update process according to the present application;
fig. 14 is a schematic diagram of cache writing according to the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that, although the terms first, second, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Some concepts to which this application relates are first described below:
1) Zero bubble: zero bubble, in this application, refers to no bubble insertion during the branch prediction phase.
2) EXU: execution Unit.
3) BP: branch Predictor.
4) BTB: branch Target Buffer, branch target cache.
5) RAS: return Address Stack, return address stack.
6) PC: point Counter, address Counter, indicate the address that the current instruction corresponds to.
7) Superscalar processor: refers to a processor that can execute more than one instruction in one clock cycle.
Fig. 1 shows a schematic diagram of a processor, as shown in fig. 1:
the processor 100 is composed of a controller 101, an arithmetic unit 102 and a memory 103, wherein the controller 101 includes an instruction fetching unit 101a, the controller 101 is responsible for fetching instructions from a memory through a bus 110, the arithmetic unit 102 is used for executing operations, such as addition and subtraction operations, according to the instructions fetched from the memory by the controller 101, and the memory 103 is used for storing data used in the instructions. The instruction fetch unit 101a includes an instruction counter 101a1 and an instruction register 101a2, where the instruction counter 101a1 is a register for storing the address of an instruction currently being executed, when a computer executes a program, the instructions are executed one by one, the instruction counter 101a1 is used to record the address of the instruction currently being executed, and each time an instruction is executed, the address in the instruction counter 101a1 is updated to point to the instruction to be executed next, so as to implement sequential execution of the program. The instruction register 101a2 is a special register in the computer, and is used for storing the currently executing instruction, the instruction fetching unit 101a mainly acts to read the instruction from the memory and transmit the instruction to the instruction decoder for analysis and execution, when the computer runs the program, the instruction fetching unit 101a reads the instruction from the memory through the bus 110, then stores the instruction into the instruction register 101a2, then the instruction register 101a2 transmits the operation code and the operand of the instruction to the instruction decoder, and the instruction decoder executes the corresponding operation according to the operation code.
That is, the instruction fetch unit 101a fetches an instruction in the memory through the bus 110, sends it to the instruction decoder, converts a binary instruction into a corresponding operation code and operand through the instruction decoder, loads the operand from the instruction register 101a2 into the operator 102, performs arithmetic or logic operation, and then stores the result in the memory 103.
In a processor, a front-end instruction fetch unit is used for fetching instructions, and generally, there are two types of instruction fetch units, one is based on a Instruction memory (instruction memory) instruction fetch unit, that is, an instruction to be executed is moved to instruction memory in advance through DMA (direct memory access), and the front-end instruction fetch unit reads the instruction to be executed from a memory according to a current PC value. Another type of instruction fetch unit is based on instruction cache (instruction cache), and for a scenario where the program segment is relatively large, the program is generally stored in the memory, if the read instruction is already in the cache, the instruction is directly read from the cache, otherwise, a cache miss occurs, and a read operation to the memory needs to be initiated to backfill the cache. For high performance processors, the architecture with cache is typically chosen.
Another critical component of the front-end fetch unit is the branch prediction unit, which, because of the fact that some jump instructions are included in the program, if the jump direction and jump address cannot be determined in the fetch stage, it is necessary to perform branch prediction in the fetch stage, that is, predict whether the current instruction is a branch instruction.
Referring to fig. 2, a schematic diagram of a value pipeline of a finger fetch unit at a front end of a superscalar processor is shown. As shown in fig. 2, the instruction fetch stage of the superscalar processor includes a multi-stage pipeline, corresponding to instruction fetch stages 0 to 2 shown in fig. 2, and performs branch prediction simultaneously in the instruction fetch process, and units of the branch prediction are performed in a plurality of stages, corresponding to BP 1 to 2 shown in fig. 1.
However, in the above-mentioned fig. 2, the units of branch prediction in the superscalar processor are divided into multiple stages, which is time-consuming to predict, and if it is predicted that a jump is required (i.e. the current instruction is a branch instruction that needs to be jumped), then the current instruction fetch stage needs to be flushed, and bubbles are introduced, which causes performance loss.
Referring to fig. 3, a schematic structural diagram of a finger capturing unit according to an exemplary embodiment of the present application is shown, where the finger capturing unit may include:
A branch target cache BTB unit 301, a return address stack RAS unit 302, an address accumulation unit 303, a data selection unit 304, an address counter PC register 305, and an instruction fetch unit 306.
The BTB unit 301, the RAS unit 302 and the address accumulating unit 303 are respectively connected to the data selecting unit 304, and one output port of each of the BTB unit 301, the RAS unit 302 and the address accumulating unit 303 is respectively connected to three input ports of the data selecting unit 304, wherein the address of each unit predicted hit in the BTB unit 301, the RAS unit 302 and the address accumulating unit 303 is transferred to one input port of the data selecting unit 304 through one output port corresponding to the unit, so that the address information of each unit predicted hit enters the data selecting unit 304. An output port of the BTB unit 301 is connected to an input port of the data selecting unit 304, and address information of the BTB unit 301 predicted hit is transferred from the output port of the BTB unit 301 to the input port of the data selecting unit 304, so that the address information of the BTB unit 301 predicted hit is input to the data selecting unit 304; an output port of the RAS unit 302 is connected to an input port of the data selection unit 304, and address information of the RAS unit 302 predicted hit is transferred from the output port of the RAS unit 302 to the input port of the data selection unit 304, so that the address information of the RAS unit 302 predicted hit enters the data selection unit 304; an output port of the address accumulating unit 303 is connected to an input port of the data selecting unit 304, the address accumulating unit 303 accumulates the address transferred from the PC register 305 to obtain an accumulated address, and the obtained accumulated address information is transferred from the output port of the address accumulating unit 303 to the input port of the data selecting unit 304, so that the accumulated address information is input to the data selecting unit 304. The BTB unit 301 is connected to the RAS unit 302, for example, by one output port of the BTB unit 301 and one input port of the RAS unit 302, and address information of a predicted hit of the BTB unit 301 is transferred from one output port of the BTB unit 301 to one input port of the RAS unit 302, so that address information of a predicted hit of the BTB unit 301 is input into the RAS unit 302.
The data selecting unit 304 is connected to the PC register 305, for example, by an output port of the data selecting unit 304 and an input port of the PC register 305, and information of an address in the data selecting unit 304 is transferred from the output port of the data selecting unit 304 to the input port of the PC register 305, so that the address information is input to the PC register 305.
The PC register 305 is further connected to the BTB unit 301, the RAS unit 302, and the address accumulating unit 303, and the address information of the current instruction is transferred from one output port of the PC register 305 to one input port of the BTB unit 301, the RAS unit 302, and the address accumulating unit 303, respectively, so that the address information of the current instruction is input into the BTB unit 301, the RAS unit 302, and the address accumulating unit 303, respectively.
The address information of the current instruction may also be transferred from different output ports on the PC register to one input port on each of the BTB unit 301, the RAS unit 302 and the address accumulation unit 303. For example, one output port 1 of the PC register 305 is connected to one input port of the BTB unit 301, and address information of the current instruction is transferred from the output port 1 of the PC register 305 to the input port of the BTB unit 301, so that the address information of the current instruction is input to the BTB unit 301; the other output port 2 of the PC register 305 is connected to one input port of the RAS unit 302, and address information of the current instruction is transferred from the above-mentioned output port 2 of the PC register 305 to the above-mentioned input port of the RAS unit 302, so that the address information of the current instruction is input into the RAS unit 302; an output port 3 of the PC register 305 is connected to an input port of the address accumulating unit 301, and address information of the current instruction is transferred from the output port 3 of the PC register 305 to the input port of the address accumulating unit 303, so that the address information of the current instruction is input to the address accumulating unit 303.
The PC register 305 is connected to the instruction fetch unit 306, for example, an output port of the PC register 305 and an input port of the instruction fetch unit 306 may be connected, and address information in the PC register 305 is sent from the output port of the PC register 305 to the input port of the instruction fetch unit 306, so that the address information is input to the instruction fetch unit 306.
The BTB unit 301 predicts a target address in the case where the next instruction of the current instruction is a branch instruction, based on the current instruction address information transferred from the PC register 305, and outputs the target address information to the data selecting unit 304. For example, there may be a cache in the BTB that records a history of previous branch instructions, including the address and target address of the branch instruction. When a new branch instruction is encountered by the BTB, in this embodiment, the instruction is sent by the PC register 305, the BTB unit 301 checks whether the same address as the previous branch instruction address exists in its own cache according to the branch address information of the instruction, and if so, uses the corresponding target address as a prediction result, i.e. BTB prediction hit, and outputs a prediction hit address.
The RAS unit 302 is configured to predict, in the case of a BTB unit predicted hit, a return address of a current function after a jump occurs for a next instruction based on the instruction address sent from the input PC register 305 and information of the instruction in which the BTB unit predicts the hit, and output the return address to the data selecting unit 304. The RAS may save the next-hop instruction address of the last-called subroutine in a last-in first-out memory and then take the address saved in the memory as the predicted return address when the return instruction is encountered. In this scheme, in the case where the BTB unit 301 hits, the RAS unit 302 fetches the next-hop instruction address of the last call subroutine from the memory as a predicted return address according to the input instruction address, i.e., RAS hit, when RAS hit, the hit return address is input into the data selecting unit 304 by the RAS unit 302.
An address accumulating unit 303 is configured to add 1 to the address of the current instruction sent by the PC register 305 to obtain an accumulated address of the next instruction, and output the accumulated address to the data selecting unit 304.
A data selecting unit 304 for selecting an address from the addresses input from the BTB unit 301, the RAS unit 302, and the address accumulating unit 303 as an instruction address of the next instruction, and outputting the instruction address of the next instruction to the PC register 305.
A PC register 305 for buffering the instruction address outputted from the data selecting unit 304, inputting the buffered instruction address to the instruction reading unit 306, and outputting the buffered instruction address as a new input address to the BTB unit 301, the RAS unit 302, and the address accumulating unit 303, respectively, at the next instruction prediction.
An instruction reading unit 306 for performing a read instruction operation based on the instruction address output from the PC register 305, and sending the read instruction to an execution unit connected to the value unit.
In summary, in the scheme shown in the embodiment of the present application, the instruction fetch unit includes a BTB unit, an RAS unit, an address accumulation unit, a data selection unit, a PC register, and an instruction fetch unit; the RAS unit predicts the return address of the current function after the next instruction jumps under the condition that the BTB unit predicts hit, and the address accumulation unit adds 1 to the address of the current instruction to obtain the accumulated address of the next instruction; the predicted or accumulated address is selected by a data selection unit to obtain the instruction address of the next instruction and is cached in a PC register, so that the instruction reading unit is used for reading the instruction based on the instruction address; in the scheme, the BTB unit and the RAS unit can predict the target address and the jump address of the branch instruction, and the data selection unit selects the correct address for output, so that the situation of introducing bubbles is avoided in the process, the instruction fetching efficiency is improved, and the performance of the processor is improved.
Referring to fig. 4, referring to the above-mentioned finger capturing unit shown in fig. 3, a schematic structural diagram of the finger capturing unit according to an exemplary embodiment of the present application is shown, and as shown in fig. 4, the data selecting unit 304 in fig. 3 includes: a first data selection unit 304a and a second data selection unit 304b.
Wherein one output port of the BTB unit 301 is connected to one input port of the first data selecting unit 304a, address information of a predicted hit of the BTB unit 301 is transferred from the above-mentioned output port of the BTB unit 301 to the above-mentioned input port of the first data selecting unit 304a, the predicted hit address information of the BTB unit 301 is input to the first data selecting unit 304a, one output port of the RAS unit 302 is connected to the other input port of the first data selecting unit 304a, the address information of a predicted hit of the RAS unit 302 is transferred from the above-mentioned output port of the RAS unit 302 to the above-mentioned input port of the first data selecting unit 304a, and the predicted hit address information of the RAS unit 302 is input to the first data selecting unit 304a, the first data selecting unit 304a for selecting one address from the inputted plurality of addresses as an instruction address of a next instruction.
An output port of the first data selecting unit 304a is connected to an input port of the second data selecting unit 304b, information of the address selected by the first data selecting unit 304a is transferred from the output port of the first data selecting unit 304a to the input port of the second data selecting unit 304b, information of the address selected by the first data selecting unit 304a is input to the second data selecting unit 304b, an output port of the address accumulating unit 303 is connected to another input port of the second data selecting unit 304b, information of the accumulated address added by 1 by the address accumulating unit 303 is transferred from the output port of the address accumulating unit 303 to the input port of the second data selecting unit 304b, and information of the accumulated address added by 1 by the address accumulating unit 303 is input to the second data selecting unit 304b, and the second data selecting unit 304b is used for selecting an address from the input plurality of addresses as an instruction address of a next instruction.
A first data selecting unit 304a for selecting and outputting an address input from the RAS unit when the RAS unit 302 predicts a hit, and selecting and outputting an address input from the BTB unit 301 when the RAS unit 302 predicts a miss.
The second data selecting unit 304b is configured to select and output the address input by the first data selecting unit 304a when the BTB unit 301 predicts a hit, and select and output the address input by the RAS unit 302 when the BTB unit 301 predicts a miss.
In some embodiments, the first data selecting unit 304a and the second data selecting unit 304b may be a Multiplexer (MUX) that selects one of the plurality of input signals to be output through a control signal. The MUX typically has two or more input signals, one or more control signals and an output signal, the control signals determining which signal output to select.
The control signal of the first data selecting unit 304a may be sent by the RAS unit 302, in which case, the RAS unit 302 predicts a hit, the RAS unit 302 sends a predicted hit signal to the first data selecting unit 304a, the first data selecting unit 304a receives the hit signal and outputs a hit address predicted by the RAS unit 302, the RAS unit 302 predicts a miss, and the first data selecting unit 304a outputs a hit address predicted by the BTB unit 301; in another case, regardless of whether the RAS unit 302 predicts a hit, the RAS unit 302 transmits a signal to the first data selection unit 304a, the value of the signal is different, if the RAS unit 302 hits transmitting 1,1 is a predicted hit signal, the RAS unit 302 misses transmitting 0,0 is a predicted miss signal, when the transmit signal is 1, the first data selection unit 304a outputs the hit address predicted by the RAS unit 302, and when the transmit signal is 0, the first data selection unit 304a outputs the hit address predicted by the BTB unit 301.
The control signal of the second data selecting unit 304b may be sent by the BTB unit 301, in which case the BTB unit 301 predicts a hit, the BTB unit 301 sends a predicted hit signal to the second data selecting unit 304b, the second data selecting unit 304b receives the hit signal and outputs a hit address predicted by the BTB unit 301, the BTB unit 301 predicts a miss, and the second data selecting unit 304b outputs a hit address predicted by the RAS unit 302; in another case, regardless of whether the BTB unit 301 predicts a hit, the BTB unit 301 sends a signal to the second data selecting unit 304b, the value of the signal is different, if the BTB unit 301 hits to send 1,1 is a predicted hit signal, the BTB unit 301 misses to send 0,0 is a predicted miss signal, when the send signal is 1, the second data selecting unit 304b outputs the hit address predicted by the BTB unit 301, and when the send signal is 0, the second data selecting unit 304b outputs the hit address predicted by the RAS unit 302.
In the embodiment of the present application, by the connection relationship between the first data selecting unit 304a and the second data selecting unit 304b in the data selecting unit 304 and the BTB unit 301, the RAS unit 302, and the address accumulating unit 303, and the data selecting manner of the first data selecting unit 304a and the second data selecting unit 304b, in the case where the RAS unit 302 predicts a hit, the address predicted in the RAS unit 302 is preferentially output, and in the case where the RAS unit 302 predicts a miss, and the BTB unit 301 predicts a hit, the address predicted in the BTB unit 301 is output, in the case where the RAS unit 302 and the BTB unit 301 both predict a miss, the address accumulating unit 303 performs address accumulation, so that in the case where there are a plurality of predicted addresses, the correct address is selected for output.
In one possible implementation, BTB unit 301 includes a first prediction record register;
and the first prediction record register is used for caching the prediction record of each branch instruction. The BTB unit is used for recording the historical behavior of the branch instruction in the program and predicting the target address so as to improve the execution efficiency of the branch instruction, the BTB unit can save the address of the branch instruction and the corresponding target address, when the processor encounters the branch instruction, the BTB unit is firstly inquired to try to predict the target address of the branch instruction, and if the record of the branch instruction exists in the BTB unit, the processor can directly jump to the predicted target address to execute, thereby avoiding the delay of waiting for the analysis of the branch instruction.
The BTB unit 301 predicts the target address of the branch instruction by using the history record and the statistical information, and stores the address of the branch instruction and the predicted target address in a cache, that is, the first prediction record register, when the same branch instruction is encountered next time, the target instruction is executed in advance according to the previous prediction result, thereby avoiding the pipeline stall, reducing the performance loss caused by the branch prediction error, and improving the execution efficiency of the branch instruction.
In one possible implementation, the RAS unit 302 includes a second prediction record register therein; and the second prediction record register is used for caching the prediction record of each branch instruction with jump.
The function of the RAS unit is to store the return address of the function call so as to restore the execution position when the function returns, the RAS unit uses a stack data structure to realize the function, and the address of the current instruction is pushed into the stack when the function call occurs each time; when the function returns, the RAS unit pops up the address of the stack top and sets the program counter as the address, so that the function of returning to the next instruction for calling the function is realized, the RAS unit can improve the efficiency of function call and reduce the access to the memory by caching the address for recording the function call and the return.
When the program is executed to the branch instruction, the RAS unit pushes the next instruction address of the branch instruction to the stack, and then according to the second prediction record register, the pipeline pause problem caused by the branch instruction can be reduced by caching the prediction record and performing correct jump, so that the integral instruction execution speed is improved, and the execution efficiency of the branch instruction is improved. In the embodiment of the application, the BTB unit and the RAS unit are prediction units constructed based on registers, and the memory is not required to be introduced, so that the areas of the BTB unit and the RAS unit can be reduced due to the small size and small occupied space of the registers, and the space of a chip is further saved.
Referring to fig. 5, referring to the instruction fetching unit shown in fig. 3 or fig. 4, a schematic structural diagram of the instruction fetching unit according to an exemplary embodiment of the present application is shown in fig. 5, where the instruction fetching unit shown in fig. 3 or fig. 4 further includes a loop prediction unit 307, the loop prediction unit 307 is connected to the data selection unit 304, an output port of the loop prediction unit 307 is connected to an input port of the data selection unit 304, and information of a loop jump address corresponding to a next instruction of a current instruction may be transferred from the output port of the loop prediction unit 307 to the input port of the data selection unit 304, so that the information of the address in the loop prediction unit 307 is input to the data selection unit 304.
The PC register 305 is further connected to the loop prediction unit 307, and an output port of the PC register 305 is connected to an input port of the loop prediction unit 307, so that information of the current instruction address in the PC register 305 can be transferred from the output port of the PC register 305 to the input port of the loop prediction unit 307, so that information of the current instruction address in the PC register 305 is input to the loop prediction unit 307; the PC register 305 is further configured to output the buffered instruction address as a new input address to the loop prediction unit 307 in next instruction prediction.
The loop prediction unit 307 is configured to predict, according to the input address, a loop jump address corresponding to an instruction next to the current instruction in a case where the current instruction is a last instruction in the loop instruction set, and output the loop jump address to the data selection unit 304.
At this time, the data selecting unit 304 may be configured to select one address from the addresses output by the BTB unit 301, the RAS unit 302, the address accumulating unit 303, and the loop predicting unit 307, and output the selected address as the instruction address of the next instruction, so as to ensure that the address is output accurately when the plurality of address predictions are compatible.
For the AI processor, in the application scenario/program, there are usually fewer branch instructions and more loops, and in this embodiment of the present application, a loop prediction unit is further provided in the instruction fetching unit, for predicting whether the next instruction is an instruction after triggering loop jump in the loop body (i.e. the 1 st instruction of the loop instruction body), and outputting the address of the instruction after loop jump when the loop jump is predicted, so as to support the prediction of the instruction address after loop jump in the loop instruction body, further improve the instruction fetching efficiency in the scenario where the loop exists, and improve the processor performance.
In some embodiments, a loop-skip address register is included in the loop prediction unit; and a loop prediction unit for determining a predicted hit in the case where there is a record of a loop jump address corresponding to the input instruction in the loop jump address register, and outputting the loop jump address to the data selection unit.
In the embodiment of the application, the cyclic prediction unit is a prediction unit constructed based on a register, and a memory is not required to be introduced, so that the area of the cyclic prediction unit can be reduced due to small volume of the register and small occupied space, and the space of a chip is further saved.
Referring to fig. 6, a schematic structural diagram of the finger capturing unit according to an exemplary embodiment of the present application is shown in fig. 6, and the data selecting unit 304 further includes a third data selecting unit 304c as shown in fig. 3, 4, or 5.
One output port of the second data selecting unit 304b is connected to one input port of the third data selecting unit 304c, and information of one address selected by the second data selecting unit 304b may be transferred from the output port of the second data selecting unit 304b to the input port of the third data selecting unit 304c, so that information of one address selected by the second data selecting unit 304b is input into the third data selecting unit 304c, one output port of the loop predicting unit 307 is connected to the other input port of the third data selecting unit 304c, and information of a loop jump address corresponding to a next instruction of a current instruction in the loop predicting unit 307 may be transferred from the output port of the loop predicting unit 307 to the input port of the third data selecting unit 304c, so that information of a loop jump address corresponding to a next instruction of a current instruction in the loop predicting unit 307 is input into the third data selecting unit 304c.
The third data selecting unit 304c is configured to select and output the address input by the loop predicting unit 307 when the loop predicting unit 307 predicts a hit, and select and output the address input by the second data selecting unit 304b when the loop predicting unit 307 predicts a miss.
The control signal of the third data selecting unit 304c may be sent by the cyclic prediction unit 307, in which case the cyclic prediction unit 307 predicts a hit, the cyclic prediction unit 307 sends a predicted hit signal to the third data selecting unit 304c, the third data selecting unit 304c receives the predicted hit signal and outputs a hit address predicted by the cyclic prediction unit 307, the cyclic prediction unit 307 predicts a miss, the predicted hit signal is not sent, and the third data selecting unit 304c outputs a hit address predicted by the second data selecting unit 304 b; in another case, regardless of whether the cyclic prediction unit 307 predicts a hit, the cyclic prediction unit 307 sends a signal to the third data selection unit 304c, the values of the signals are different, if the cyclic prediction unit 307 predicts that a hit is sent 1,1 is a predicted hit signal, the cyclic prediction unit 307 predicts that a miss is sent 0,0 is a predicted miss signal, when the send signal is 1, the third data selection unit 304c outputs the hit address predicted by the cyclic prediction unit 307, and when the send signal is 0, the third data selection unit 304c outputs the hit address predicted by the second data selection unit 304B.
Specifically, in the embodiment of the present application, by the connection relationship between the first data selecting unit 304a, the second data selecting unit 304b, and the third data selecting unit 304c in the above-described data selecting unit 304, the BTB unit 301, the RAS unit 302, the address accumulating unit 303, and the cyclic predicting unit 307, respectively, and the data selecting manner of the first data selecting unit 304a, the second data selecting unit 304b, and the third data selecting unit 304c, in the case where the cyclic predicting unit 307 predicts a hit, the address predicted by the cyclic predicting unit 307 is preferentially output, in the case where the cyclic predicting unit 307 predicts a miss, the RAS unit 302 predicts a hit, the address predicted by the RAS unit is output, and in the case where the cyclic predicting unit 307, the RAS unit 302, and the BTB unit 301 predict a miss, the address accumulating unit 303 performs address accumulation, and in the case where the cyclic predicting unit 307, the RAS unit 302, and the BTB unit 301 predict a miss, respectively, the address predicted by the address accumulating unit is output, and in the case where a plurality of predicted addresses are selected correctly.
Referring to fig. 7, a schematic structural diagram of an instruction fetch unit according to an exemplary embodiment of the present application is shown, and as shown in fig. 7, the data selection unit 304 is further connected to an execution unit;
The data selecting unit 304 is further configured to output a flushing address to the PC register when the execution unit flushes the instruction fetch unit.
In the embodiment of the application, the instruction fetching unit can also support the condition of prediction error, and the execution unit initiates flushing of the whole instruction fetching pipeline, namely supports writing of flushing addresses into the PC register, and the prediction and instruction extraction of instruction addresses are carried out again from the flushing addresses, so that the instruction fetching accuracy is ensured after the address prediction is wrong.
Referring to fig. 8, which is a schematic structural diagram of an instruction fetch unit according to an exemplary embodiment of the present application, as shown in the fig. 3 to 7, the data selection unit 304 further includes a fourth data selection unit 304d;
one output port of the third data selecting unit 304c is connected to one input port of the fourth data selecting unit 304d, information of the address selected by the third data selecting unit 304c may be transferred from the output interface of the third data selecting unit 304c to the input interface of the fourth data selecting unit 304d, such that information of the address selected by the third data selecting unit 304c is input to the fourth data selecting unit 304d, the other input port of the fourth data selecting unit 304d is connected to a flushing address output port of the executing unit, and information of the flushing address output by the executing unit may be transferred from the output interface of the first executing unit to the input interface of the fourth data selecting unit 304d, such that information of the flushing address is input to the fourth data selecting unit 304 d.
An output port of the fourth data selecting unit 304d is connected to an input port of the PC register 305, and information of one address selected by the fourth data selecting unit 304d may be transferred from the output interface of the fourth data selecting unit 304d to the input interface of the PC register 305, so that information of one address selected by the fourth data selecting unit 304d is input into the PC register 305.
The fourth data selecting unit 304d is configured to select and output the flush address when receiving the flush enable signal sent by the executing unit, and select and output the address input by the third data selecting unit 304c when not receiving the flush enable signal sent by the executing unit.
The control signal of the fourth data selecting unit 304d may be a flush enable signal sent by the executing unit, and when the fourth data selecting unit 304d receives the flush enable signal, a flush address is output, and when the fourth data selecting unit 304d does not receive the flush enable signal, a hit address predicted by the third data selecting unit 304c is output.
Specifically, in the embodiment of the present application, by the connection relationship between the first data selecting unit 304a, the second data selecting unit 304b, the third data selecting unit 304c, and the fourth data selecting unit 304d in the data selecting unit 304, and the BTB unit 301, the RAS unit 302, the address accumulating unit 303, the loop predicting unit 307, and the execution unit, respectively, and the data selecting manner of the first data selecting unit 304a, the second data selecting unit 304b, the third data selecting unit 304c, and the fourth data selecting unit 304d, in the case of the flush enable signal, the flush address is preferentially outputted, in the case of the flush enable signal being absent, and the loop predicting unit 307 predicts a hit, in the case of the loop predicting unit 307 predicting a miss, the RAS unit 302 predicts a hit, the address predicted by the RAS unit is outputted, in the case of the loop predicting unit 307 and the RAS unit predicting a miss, and the BTB unit 301 predicts a hit, in the case of the data selecting manner of the first data selecting unit 304a, in the loop predicting unit 304b, the address predicted by the RAS unit 307, in the case of the RAS unit and the RAS unit predicting a miss, the address predicted by the RAS unit 301 is outputted, in the case of the multiple of the address predicted by the RAS unit and the address predicted by the comparing.
Referring to fig. 9, a flowchart of an instruction fetch method according to an exemplary embodiment of the present application is shown. The method is used for the processor and executed by a value unit in the processor, and the value unit may be any one of the value units shown in fig. 3 to fig. 8, and as shown in fig. 9, the method includes:
step 910: predicting, by the BTB unit, a target address in a case that a next instruction of the current instruction is a branch instruction according to the input address, and outputting the target address to the data selecting unit; the input address is the instruction address of the current instruction.
Step 920: in the case of a predicted hit by the BTB unit, the RAS unit predicts the return address of the current function after the next instruction jumps based on the input address and information of the instruction for which the BTB unit predicts the hit, and outputs the return address to the data selecting unit.
Step 930: and adding 1 to the address of the current instruction through an address accumulation unit to obtain an accumulated address of the next instruction, and outputting the accumulated address to a data selection unit.
Step 940: one address is selected from the plurality of addresses inputted by the data selecting unit as an instruction address of the next instruction, and the instruction address of the next instruction is outputted to the PC register.
Step 950: the instruction address input by the data selecting unit is buffered through the PC register, the buffered instruction address is output to the instruction reading unit, and the buffered instruction address is output to the BTB unit and the RAS unit as a new input address when next instruction prediction is performed.
Step 960: and reading the instruction based on the instruction address output by the PC register through the instruction reading unit, and sending the read instruction to an execution unit connected with the value unit.
In one possible implementation, the data selection unit includes: a first data selecting unit and a second data selecting unit;
the BTB unit is connected to one input terminal of the first data selecting unit, for example, may be connected to the other input terminal of the first data selecting unit by the RAS unit;
the output end of the first data selection unit is connected with one input end of the second data selection unit, for example, the address accumulation unit can be connected with the other input end of the second data selection unit;
selecting, by a data selecting unit, an address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to a PC register, comprising:
Selecting, by the first data selecting unit, an address input by the RAS unit when the RAS unit predicts a hit, and selecting, when the RAS unit predicts a miss, an address input by the BTB unit;
and when the BTB unit predicts the miss, the address input by the address accumulation unit is selected and output.
In one possible implementation, the BTB unit includes a first prediction record register; and the first prediction record register is used for caching the prediction record of each branch instruction.
In one possible implementation, the RAS unit includes a second prediction record register therein; and the second prediction record register is used for caching the prediction record of each branch instruction with jump.
In one possible implementation manner, the finger taking unit further comprises a cyclic prediction unit, and the cyclic prediction unit is connected with the data selection unit;
the PC register is also connected with the cyclic prediction unit;
the method further comprises the steps of:
the method comprises the steps that a circulation prediction unit predicts a circulation jump address corresponding to a next instruction of a current instruction under the condition that the current instruction is the last instruction in a circulation instruction body according to an input address, and outputs the circulation jump address to a data selection unit;
And outputting the cached instruction address to the loop prediction unit as a new input address through the PC register when the next instruction is predicted.
In one possible implementation, the loop prediction unit includes a loop jump address register;
and predicting, by the loop prediction unit, a loop jump address corresponding to a next instruction of the current instruction in the case that the current instruction is a last instruction in the loop instruction body according to the input address, and outputting the loop jump address to the data selection unit, including:
when there is a record of a cycle skip address corresponding to an input instruction in a cycle skip address register, a cycle prediction unit determines a prediction hit and outputs the cycle skip address to a data selection unit.
In one possible implementation, the data selecting unit further includes a third data selecting unit;
the output end of the second data selection unit is connected with one input end of the third data selection unit, and the cyclic prediction unit is connected with the other input end of the third data selection unit;
selecting, by the data selecting unit, an address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to the PC register, further comprising:
And when the cyclic prediction unit predicts the miss, the address input by the second data selection unit is selected and output.
In a possible implementation, the data selection unit is further connected to the execution unit; the method further comprises the steps of:
and when the execution unit flushes the instruction fetch unit through the data selection unit, the flushing address is output to the PC register.
In a possible implementation, the data selection unit further comprises a fourth data selection unit;
the output end of the third data selection unit is connected with one input end of the fourth data selection unit, and the other input end of the fourth data selection unit is connected with the flushing address output end of the execution unit; the flushing address output end is used for outputting a flushing address;
the output end of the fourth data selection unit is connected with the PC register;
selecting, by a data selecting unit, an address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to a PC register, comprising:
and the fourth data selection unit is used for selecting and outputting the flushing address when receiving the flushing enable signal sent by the execution unit and selecting and outputting the address input by the third data selection unit when not receiving the flushing enable signal sent by the execution unit.
In summary, in the scheme shown in the embodiment of the present application, the instruction fetch unit includes a BTB unit, an RAS unit, an address accumulation unit, a data selection unit, a PC register, and an instruction fetch unit; the RAS unit predicts the return address of the current function after the next instruction jumps under the condition that the BTB unit predicts hit, and the address accumulation unit adds 1 to the address of the current instruction to obtain the accumulated address of the next instruction; the predicted or accumulated address is selected by a data selection unit to obtain the instruction address of the next instruction and is cached in a PC register, so that the instruction reading unit is used for reading the instruction based on the instruction address; in the scheme, the BTB unit and the RAS unit can predict the target address and the jump address of the branch instruction, and the data selection unit selects the correct address for output, so that the situation of introducing bubbles is avoided in the process, the instruction fetching efficiency is improved, and the performance of the processor is improved.
Based on the scheme shown in any one of fig. 3 to 8, the present application may provide a front end instruction fetch unit design based on zero bubble prediction, which may be used for a scenario with fewer branch instructions and multiple loops of an AI processor, a predictor with zero bubbles specifically designed, and a hardware loop instruction, so as to achieve higher performance.
In the processor field, the front end is got and is pointed the unit and is important subassembly, and the front end that this application above-mentioned scheme provided is got and is pointed the unit design based on zero bubble prediction has eliminated the production of prediction bubble, and the performance is very high to the chip area of consumption is less, has effectively promoted the competitiveness of product.
Referring to fig. 10, which shows a frame diagram of a fetch unit according to the present application, as shown in fig. 10, a PC register 1004 is an f1 stage PC register, which can store a memory address of a next instruction to be executed, and according to a current PC value, the fetch unit can predict the next instruction address, where the PC value is input to a BTB 1001 (hardware, a branch target buffer, for predicting a target address of a branch instruction) for preliminary prediction.
Referring to FIG. 11, which shows a schematic diagram of a field segment in each entry of the BTB, the field segment in each entry (entry, record representing a branch instruction) of the BTB is shown in FIG. 11 above, the BTB may set a plurality of entries (such as 32 entries), and the meaning of each field segment may be as follows:
1) valid: whether the current entry is valid;
2) And (3) tag: the tag information of the current entry is a section in the pc address, for example, the pc address is 32 bits, and the tag can be the upper 20 bits of the pc address;
3) target: a target jump address of the current entry;
4) attr:3bit, attribute information, for indicating whether the current instruction is of that type, such as pop instruction, push instruction, or direct jump instruction;
5) tcnt: a 2bit saturation counter, a token counter, for indicating a skip direction;
6) scnt: a 2bit saturation counter, speculation counter, for indicating prediction accuracy.
Wherein when a new branch instruction needs to be cached, if BTB 1001 is full, an old branch instruction needs to be selected for replacement; referring to FIG. 12, a schematic diagram of an alternative strategy for a BTB in accordance with the present application is shown. As shown in fig. 12, an invalid entry is first found by leading zero detect (leading zero detection, i.e., the ent_vld lzd unit), if all are valid, then an entry with scnt_hi of 0 (scnt represents the accuracy of the entry prediction, scnt [1] = 0, indicating inaccurate prediction and can be replaced preferentially) is found, and if scnt_hi is all 1, then a least recently used entry is found for replacement by LRU (Least Recently Used ) algorithm. Finally, the BTB entry number replaced_idx to be replaced is obtained.
Referring to fig. 13, a schematic diagram of an update process related to the present application is shown in fig. 13, where as shown in fig. 13, a PLRUm algorithm is adopted by the BTB1001, a 32-bit register is set for the BTB1001 of the 32entry, when writing (allocate new entry) or reading (BTB hit) is performed on a certain entry, the register is updated, and if the entire register is all 1 after the update, bits except for the current location are cleared.
RAS1002 is a return address predictor, which determines whether the current instruction is pop or push based on attribute information of BTB1001, and predicts the return address of the function targeted at RAS 1002. If RAS1002 hits (when accessing memory, the requested data is just in RAS1002, can be fetched directly, without additional access operations), then the result of RAS1002 is selected.
The Loop Predictor (LP) unit 1003 is a hardware Loop processing unit, and for a Loop instruction, the LP unit 1003 maintains information such as a Loop number register and a Loop jump address, and when the current pc is the last instruction of the Loop body, the Loop predictor gives an accurate next pc address. If the Loop predictor hits, the prediction result of the Loop predictor is selected.
The loop number register is used for storing the iteration number of one loop, and the value of the loop number register is reduced along with the completion of each iteration in the process of executing the program until reaching zero, so that the loop is ended.
For example, in the loop number register, the default loop iteration number is 10, when the hardware process completes one loop, an instruction is loaded into the LP unit 1003 at the front end, the iteration number in the loop number register is reduced by one, until the iteration number is 0, the hardware can know that the loop is over through the instruction, and the loop is jumped out.
The loop number register is typically used in combination with a conditional or unconditional branch structure, and before each iteration begins, the program examines the value of the loop number register and determines whether to continue execution of the loop body based on whether it is zero, when the value of the loop number register is zero, the program jumps to the position where the loop ends, thereby terminating the loop.
The circulation number register plays an important role in circulation control, and can simplify the control flow of circulation and improve the program execution efficiency.
The instruction address input in each prediction process is sequentially recorded in the loop jump address register, when the newly input instruction address is identical to one address which is recorded currently, the jump is indicated, and at the moment, the next address of the same address in the record is output as the loop jump address.
The Loop predictor may stall the pipeline, e.g., output a stall signal (e.g., loop_stall) signal to stall the register 1004.
If none of BTB1001, RAS1002, LP unit 1003 hit, then the current pc is directly incremented by 1 as the next instruction address. If the back end EXU needs to flush the entire pipeline, then the flush address (flush_addr) is loaded into register 1004, which process can be controlled by the flush enable signal (flush_en).
That is, in fig. 10, the BTB1001, the RAS1002, and the LP unit 1003 respectively predict the output address, the new address obtained after the PC address+1, and the flush address, and the MUX 1006 to 1009 may select to output an accurate address to the PC register 1004;
specifically, in fig. 10, addresses predicted by BTB1001 and RAS1002 are input to MUX 1006, MUX 1006 predicts whether a hit signal is made based on RAS1002, or selects one of the addresses to input to MUX 1007 based on whether there is a hit signal predicted by RAS1002 (when RAS hits, the address predicted by RAS1002 is output to MUX 1007, otherwise the address predicted by BTB1001 is output to MUX 1007), and furthermore, a new address obtained by adding 1 to the PC address is also input to MUX 1007, MUX 1007 predicts whether a hit signal is made based on BTB1001, or selects one of the addresses to input to MUX1008 based on whether there is a hit signal predicted by BTB1001 (when BTB1001 hits, the address predicted by MUX 1006 is output to MUX1008, otherwise the new address obtained by adding 1 to PC address is output to MUX 1007); note that, since RAS1002 predicts after BTB1001 predicts hit, if RAS1002 hits, BTB1001 must hit, at this time, in case of RAS1002 hit, the address predicted by RAS1002 will be input to MUX1008 through MUX 1006 and MUX 1007; the addresses predicted by the LP unit 1003 are also input to the MUX1008, and the MUX1008 selects one of the addresses input by the MUX 1007 and the LP unit 1003, respectively, according to the signal of whether the LP unit 1003 predicts a hit or whether there is a hit, and outputs the selected signal to the MUX 1009 (when the LP hits, the address predicted by the LP unit 1003 is output to the MUX 1009, otherwise the address input by the MUX 1007 is output to the MUX 1009); in addition, the flush address is also input to the MUX 1009, and the MUX 1009 selects one of the MUX1008 and the flush address according to whether the flush enable signal is present, and outputs the selected flush address to the PC register 1004 (when the flush enable signal is present, outputs the flush address to the PC register 1004, otherwise outputs the address input by the MUX1008 to the PC register 1004); then, at the next prediction, the address in the PC register 1004 is transferred to the BTB1001, the RAS1002, the LP1003 as a new input address, and 1 is added to obtain a new address.
After obtaining the instruction address pc_f1, the instruction fetching unit may fetch instructions in the instruction cache (instruction cache) 1005, where the scheme shown in fig. 10 sets a 4-way cache, the pc address is divided into 3 domain segments, the high order is tag information, the middle order is index information, the low order is offset, the index information may obtain tag information of the cache from the itag cache, and the index and offset are spliced together to obtain instruction information from the cache.
The tag information obtained from itag is compared with the tag information in the register 1004, if the tag information is the same as the tag information, the cache hit is indicated, the instruction read from the cache can be transmitted to the EXU, otherwise, the cache miss occurs, and the missing cache instruction needs to be read from the bus through the bus interface (bus interface).
The next problem after the cache instruction is read from the bus is to write the instruction into which way of 4 ways, where a PLRUt algorithm is adopted, for example, please refer to fig. 14, which shows a cache write schematic related to the present application, as shown in fig. 14, which shows an example of the PLRUt algorithm, specifically, a binary tree is maintained, where each node (node) is 1bit (i.e., each node corresponds to a 1bit value), where a value of 0 indicates that the left subtree is selected, a value of 1 indicates that the right subtree is selected, and a corresponding way is found from node 0 according to the arrow direction, for example, a way0 is selected according to the left side of fig. 14, and a way2 is selected according to the right side of fig. 14. After a way hit (hit), node points on the path in the opposite direction (i.e., the value of the node is reversed, modified to 1 if the original value is 0, and modified to 0 if the original value is 1). The hardware resources required by the PLRUt algorithm are small, and since each set in ICACHE needs to instantiate one LRU unit, the replacement strategy of ICACHE in the application can adopt PLRUt.
The solution shown in the above embodiments of the present application may be applied in a chip, for example, in a superscalar processor chip. For example, the scheme shown in the above embodiment of the present application may be applied to an AI chip.
The embodiment of the application also provides a chip, which comprises the finger taking unit shown in any one of the figures 3 to 8.
The embodiment of the application also provides computer equipment, which comprises a chip, wherein the chip comprises the finger taking unit shown in any one of the figures 3 to 8.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.
Claims (15)
1. A finger taking unit, characterized in that the finger taking unit comprises: a branch target cache BTB unit, a return address stack RAS unit, an address accumulation unit, a data selection unit, an address counter PC register, and an instruction fetch unit;
The BTB unit, the RAS unit and the address accumulation unit are respectively connected with the data selection unit; the BTB unit is connected with the RAS unit;
the data selection unit is connected with the PC register; the PC register is also respectively connected with the BTB unit, the RAS unit and the address accumulation unit;
the PC register is connected with the instruction reading unit;
the BTB unit is used for predicting a target address in the case that the next instruction of the current instruction is a branch instruction according to an input address, and outputting the target address to the data selection unit; the input address is an instruction address of the current instruction;
the RAS unit is used for predicting a return address of a current function after a jump of a next instruction is performed based on the input address and information of the instruction of which the BTB unit predicts a hit, and outputting the return address to the data selection unit;
the address accumulation unit is used for adding 1 to the address of the current instruction to obtain an accumulated address of the next instruction, and outputting the accumulated address to the data selection unit;
The data selecting unit is used for selecting one address from a plurality of input addresses as an instruction address of a next instruction and outputting the instruction address of the next instruction to the PC register;
the PC register is used for caching the instruction address input by the data selection unit, outputting the cached instruction address to the instruction reading unit, and outputting the cached instruction address to the BTB unit, the RAS unit and the address accumulation unit as a new input address when next instruction prediction is performed;
the instruction reading unit is used for reading the instruction based on the instruction address output by the PC register and sending the read instruction to the execution unit connected with the value unit.
2. The finger fetch unit according to claim 1, wherein the data selection unit comprises: a first data selecting unit and a second data selecting unit;
the BTB unit is connected with one input end of the first data selection unit, and the RAS unit is connected with the other input end of the first data selection unit;
the output end of the first data selection unit is connected with one input end of the second data selection unit, and the address accumulation unit is connected with the other input end of the second data selection unit;
The first data selecting unit is used for selecting and outputting the address input by the RAS unit when the RAS unit predicts hit, and selecting and outputting the address input by the BTB unit when the RAS unit predicts miss;
the second data selecting unit is configured to select and output the address input by the first data selecting unit when the BTB unit predicts a hit, and select and output the address input by the address accumulating unit when the BTB unit predicts a miss.
3. The finger unit according to claim 1 or 2, wherein the BTB unit comprises a first prediction record register;
the first prediction record register is used for caching the prediction record of each branch instruction.
4. The finger fetch unit according to claim 1 or 2, wherein the RAS unit comprises a second prediction record register;
and the second prediction record register is used for caching the prediction record of each jump-occurring branch instruction.
5. The finger fetch unit according to claim 2, further comprising a cyclic prediction unit, the cyclic prediction unit being coupled to the data selection unit;
The PC register is also connected with the cyclic prediction unit;
the PC register is also used for outputting the cached instruction address to the cyclic prediction unit as a new input address when the next instruction is predicted;
the loop prediction unit is configured to predict, according to the input address, a loop jump address corresponding to a next instruction of the current instruction when the current instruction is a last instruction in a loop instruction body, and output the loop jump address to the data selection unit.
6. The finger fetch unit according to claim 5, wherein the loop prediction unit comprises a loop jump address register;
the loop prediction unit is configured to determine a prediction hit in a case where a record of the loop jump address corresponding to the input instruction exists in the loop jump address register, and output the loop jump address to the data selection unit.
7. The finger fetch unit according to claim 5, wherein the data selection unit further comprises a third data selection unit;
the output end of the second data selection unit is connected with one input end of the third data selection unit, and the cyclic prediction unit is connected with the other input end of the third data selection unit;
The third data selecting unit is used for selecting and outputting the address input by the cyclic prediction unit when the cyclic prediction unit predicts hit, and selecting and outputting the address input by the second data selecting unit when the cyclic prediction unit predicts miss.
8. The fingering unit according to any one of claims 5 to 7, wherein the data selecting unit is further connected to the executing unit;
the data selecting unit is further configured to output a flushing address to the PC register when the executing unit flushes the instruction fetch unit.
9. The finger fetch unit according to claim 8, wherein the data selection unit further comprises a fourth data selection unit;
the output end of the third data selection unit is connected with one input end of the fourth data selection unit, and the other input end of the fourth data selection unit is connected with the flushing address output end of the execution unit; the flushing address output end is used for outputting the flushing address;
the output end of the fourth data selection unit is connected with the PC register;
the fourth data selecting unit is used for selecting and outputting the flushing address when receiving the flushing enabling signal sent by the executing unit, and selecting and outputting the address input by the third data selecting unit when not receiving the flushing enabling signal sent by the executing unit.
10. An instruction fetch method, the method performed by the instruction fetch unit of claim 1, the method comprising:
predicting, by the BTB unit, a target address in a case where a next instruction of a current instruction is a branch instruction according to an input address, and outputting the target address to the data selection unit; the input address is an instruction address of the current instruction;
predicting, by the RAS unit, a return address of a current function of a next instruction after a jump occurs based on the input address and information of the instruction of which the BTB unit predicts a hit, and outputting the return address to the data selecting unit;
adding 1 to the address of the current instruction through the address accumulation unit to obtain an accumulated address of the next instruction, and outputting the accumulated address to the data selection unit;
selecting, by the data selecting unit, an address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to the PC register;
caching the instruction address input by the data selection unit through the PC register, outputting the cached instruction address to the instruction reading unit, and outputting the cached instruction address to the BTB unit and the RAS unit as a new input address when next instruction prediction is performed;
And reading an instruction based on the instruction address output by the PC register through the instruction reading unit, and sending the read instruction to an execution unit connected with the value unit.
11. The method of claim 10, wherein the data selection unit comprises: a first data selecting unit and a second data selecting unit;
the BTB unit is connected with one input end of the first data selection unit, and the RAS unit is connected with the other input end of the first data selection unit;
the output end of the first data selection unit is connected with one input end of the second data selection unit, and the address accumulation unit is connected with the other input end of the second data selection unit;
the selecting, by the data selecting unit, one address from the plurality of addresses inputted as an instruction address of a next instruction, and outputting the instruction address of the next instruction to the PC register, includes:
selecting, by the first data selecting unit, an address input by the RAS unit when the RAS unit predicts a hit, and selecting, when the RAS unit predicts a miss, an address input by the BTB unit;
And when the BTB unit predicts a miss, selecting and outputting the address input by the address accumulation unit.
12. The method of claim 11, wherein the finger fetch unit further comprises a loop prediction unit, the loop prediction unit coupled to the data selection unit;
the PC register is also connected with the cyclic prediction unit;
the method further comprises the steps of:
under the condition that the current instruction is the last instruction in a cyclic instruction body, predicting a cyclic jump address corresponding to the next instruction of the current instruction by the cyclic prediction unit according to the input address, and outputting the cyclic jump address to the data selection unit;
and outputting the cached instruction address to the loop prediction unit as a new input address through the PC register when the next instruction is predicted.
13. The method of claim 12, wherein the loop prediction unit includes a loop jump address register;
And when the current instruction is predicted to be the last instruction in the cyclic instruction body by the cyclic prediction unit according to the input address, outputting the cyclic jump address corresponding to the next instruction of the current instruction to the data selection unit, wherein the cyclic jump address comprises:
and determining, by the loop prediction unit, a predicted hit in a case where there is a record of the loop jump address corresponding to the input instruction in the loop jump address register, and outputting the loop jump address to the data selection unit.
14. The method of claim 12, wherein the data selection unit further comprises a third data selection unit;
the output end of the second data selection unit is connected with one input end of the third data selection unit, and the cyclic prediction unit is connected with the other input end of the second data selection unit;
the selecting, by the data selecting unit, one address from the plurality of addresses input as an instruction address of a next instruction, and outputting the instruction address of the next instruction to the PC register, further includes:
And when the cyclic prediction unit predicts a miss, selecting and outputting the address input by the second data selection unit.
15. A chip comprising a value unit according to any one of claims 1 to 9.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311289788.8A CN117311814A (en) | 2023-09-28 | 2023-09-28 | Instruction fetch unit, instruction reading method and chip |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311289788.8A CN117311814A (en) | 2023-09-28 | 2023-09-28 | Instruction fetch unit, instruction reading method and chip |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN117311814A true CN117311814A (en) | 2023-12-29 |
Family
ID=89259917
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311289788.8A Pending CN117311814A (en) | 2023-09-28 | 2023-09-28 | Instruction fetch unit, instruction reading method and chip |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117311814A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118245187A (en) * | 2024-03-29 | 2024-06-25 | 海光信息技术股份有限公司 | Thread scheduling method and device, electronic device and storage medium |
| CN120743810A (en) * | 2025-08-29 | 2025-10-03 | 苏州元脑智能科技有限公司 | Data caching system and method and electronic equipment |
-
2023
- 2023-09-28 CN CN202311289788.8A patent/CN117311814A/en active Pending
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118245187A (en) * | 2024-03-29 | 2024-06-25 | 海光信息技术股份有限公司 | Thread scheduling method and device, electronic device and storage medium |
| CN120743810A (en) * | 2025-08-29 | 2025-10-03 | 苏州元脑智能科技有限公司 | Data caching system and method and electronic equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP4027620B2 (en) | Branch prediction apparatus, processor, and branch prediction method | |
| CN110069285B (en) | Method for detecting branch prediction and processor | |
| US8943300B2 (en) | Method and apparatus for generating return address predictions for implicit and explicit subroutine calls using predecode information | |
| KR100974384B1 (en) | Method and apparatus for predicting branch instructions | |
| US7516312B2 (en) | Presbyopic branch target prefetch method and apparatus | |
| JP5579694B2 (en) | Method and apparatus for managing a return stack | |
| JP2009536770A (en) | Branch address cache based on block | |
| CN114116016B (en) | Processor-based instruction prefetching method and apparatus | |
| US7017030B2 (en) | Prediction of instructions in a data processing apparatus | |
| CN117311814A (en) | Instruction fetch unit, instruction reading method and chip | |
| EP3037956A1 (en) | Processor system and method using variable length instruction word | |
| CN114528025A (en) | Instruction processing method and device, microcontroller and readable storage medium | |
| CN118277292A (en) | Data pre-fetching method and data pre-fetching device | |
| US6851033B2 (en) | Memory access prediction in a data processing apparatus | |
| MX2009001747A (en) | METHODS AND APPLIANCES TO REDUCE SEARCHES IN A CACHE MEMORY OF ADDRESS DESTINATION OF SALTOS. | |
| US20040225866A1 (en) | Branch prediction in a data processing system | |
| US20030204705A1 (en) | Prediction of branch instructions in a data processing apparatus | |
| US6754813B1 (en) | Apparatus and method of processing information for suppression of branch prediction | |
| US20080172547A1 (en) | Reusing a buffer memory as a microcache for program instructions of a detected program loop | |
| US7346737B2 (en) | Cache system having branch target address cache | |
| CN116149733A (en) | Instruction branch prediction system, method, device, computer equipment and storage medium | |
| CN114116010B (en) | Architecture optimization method and device for processor cycle body | |
| CN115328552B (en) | A Low-Cost and High-Efficiency Branch Predictor Implementation Method | |
| US6360310B1 (en) | Apparatus and method for instruction cache access | |
| CN114217860B (en) | Branch prediction device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication |