HK1112086A

HK1112086A - Branch target address cache storing two or more branch target addresses per index

Info

Publication number: HK1112086A
Application number: HK08107076.4A
Authority: HK
Inventors: 罗德尼‧韦恩‧史密斯; 詹姆斯‧诺里斯‧迪芬德尔费尔; 杰弗里‧托德‧布里奇斯; 托马斯‧安德鲁‧萨托里乌斯
Original assignee: 高通股份有限公司
Priority date: 2005-03-23
Filing date: 2006-03-23
Publication date: 2008-08-22

Abstract

A Branch Target Address Cache (BTAC) stores at least two branch target addresses in each cache line. The BTAC is indexed by a truncated branch instruction address. An offset obtained from a branch prediction offset table determines which of the branch target addresses is taken as the predicted branch target address. The offset table may be indexed in several ways, including by a branch history, by a hash of a branch history and part of the branch instruction address, by a gshare value, randomly, in a round-robin order, or other methods.

Description

Branch target address cache storing two or more branch target addresses per index

Technical Field

The present invention relates generally to the field of processors, and more specifically to a branch target address cache that stores two or more branch target addresses per index.

Background

Microprocessors may perform computational tasks in a wide variety of applications. Improving computer performance is a design goal that is constantly being pursued, namely, to drive product improvements through enhanced software to achieve faster operation and/or increased functionality. In many embedded applications, such as portable electronic devices, conserving power and reducing chip size are common goals in processor design and implementation.

Many modern processors use a pipelined architecture, in which sequential instructions (each having multiple execution steps) overlap in execution. This ability to exploit parallelism between instructions within a sequential instruction stream can help to significantly improve the performance of a processor. In some states, some processors may complete one instruction per execution cycle.

These ideal conditions are practically impossible to achieve due to various factors including: data dependencies among instructions (data hazards), control dependencies (e.g., branches) (control hazards), processor resource allocation conflicts (structural hazards), interrupts, cache misses, and so forth. Accordingly, a common goal of processor design is to avoid these hazards and keep the pipeline "full".

Real world programs typically include state branch instructions, whose actual branching behavior may not be known until the instructions are evaluated deeply in the pipeline. This branch uncertainty can create a control hazard that stalls the pipeline because the processor does not know which instructions should be taken after the branch instruction, and will not know until the state branch instruction is evaluated. Typically, modern processors use various forms of branch prediction, whereby the branching behavior of a stateful branch instruction is predicted early in the pipeline, and the processor will speculatively fetch and execute instructions based on the branch prediction, thereby keeping the pipeline full. If the prediction is correct, performance may be maximized while power consumption is minimized. When the branch instruction is actually evaluated, if the branch is mispredicted, the speculatively fetched instructions must be flushed from the pipeline and new instructions must be taken from the correct branch target address. Mispredicted branches negatively impact processor performance and power consumption.

State branch prediction has two components: state evaluation and branch target address. The state prediction is a binary decision: the branch is taken, which causes execution to jump to a different code sequence; or the branch is not taken, in which case the processor will execute the next sequential instruction after the branch instruction. If the branch evaluates taken, the branch target address is the address of the next instruction. Some branch instructions include a branch target address in the instruction op-code, or an offset, so that the branch target address can be easily calculated. For other branch instructions, the branch target address must be computed (if the state evaluation is predicted to be taken).

One known branch target address prediction technique is Branch Target Address Cache (BTAC). Typically, a BTAC is a fully associative cache indexed by a Branch Instruction Address (BIA), with each data location (or cache "line") including a single Branch Target Address (BTA). When a branch instruction evaluates in the pipeline as taken and its actual BTA is calculated, then the BIA and BTA are written to the BTAC (e.g., during the writeback pipeline stage). The BTAC is accessed in parallel with the instruction cache (or I-cache) when a new instruction is being fetched. If the instruction address hits in the BTAC, the processor knows that the instruction is a branch instruction (prior to fetching the instruction from the I-cache being decoded), and provides a predicted BTA (which is the actual BTA of a previously executed branch instruction). If the branch prediction circuit predicts that the branch is taken, instruction fetching begins at the predicted BTA. If the branch is predicted not taken, the fetching of instructions continues in sequence. Note that in the art, the term BTAC is also used to denote a cache that associates a saturation counter with a BIA, thereby providing only a state evaluation prediction (i.e., either a branch taken or a branch not taken).

A high performance processor may fetch more than one instruction at a time from the I-cache. For example, an entire cache line (which may include, for example, four instructions) may be fetched into an instruction fetch buffer, which feeds these instructions into the pipeline in sequence. Using a BTAC for branch prediction for all four instructions may require four read ports on the BTAC. This would require large and complex hardware and would dramatically increase power consumption.

Disclosure of Invention

A Branch Target Address Cache (BTAC) stores at least two branch target addresses in each cache line. The BTAC is indexed by a truncated branch instruction address. The offset obtained from the branch prediction offset table determines which branch target address is taken as the predicted branch target address. The offset table may be indexed in several ways, including: by branch history, by branch history and a hash of part of the branch instruction address, g-share value, randomly, in round robin order, or other method.

One embodiment relates to a method of predicting a branch target address of a branch instruction. At least a portion of the instruction address is stored. At least two branch target addresses are associated with the stored instruction address. When a branch instruction is fetched, one of the branch target addresses is selected as the predicted target address of the branch instruction.

Another embodiment relates to a method of predicting a branch target address. Starting with the first instruction address, a block of n sequential instructions is fetched. The branch target address of each branch instruction in the block that evaluates taken is stored in a cache such that up to n branch target addresses are indexed by a portion of the first instruction address.

Another embodiment relates to a processor. The processor includes a branch target address cache indexed by a portion of the instruction address and operable to store two or more branch target addresses per cache line. The processor further includes a branch prediction offset table operable to store a plurality of offsets. The processor additionally includes an instruction execution pipeline that indexes the cache by an instruction address and selects a branch target address from the indicated cache line in response to an offset obtained from the offset table.

Drawings

FIG. 1 is a functional block diagram of a processor.

FIG. 2 is a functional block diagram of a branch target address cache and accompanying circuitry.

Detailed Description

Fig. 1 depicts a functional block diagram of a processor 10. The processor 10 executes instructions in the instruction execution pipeline 12 in accordance with the control logic 14. In some embodiments, the pipeline 12 may be a superscalar design with multiple parallel pipelines. The pipeline 12 includes various registers or latches 16, organized in pipe stages, and one or more Arithmetic Logic Units (ALU) 18. A General Purpose Register (GPR) file 20 provides registers that form the top level of the memory hierarchy.

The pipeline 12 fetches instructions from an instruction cache (I-cache) 22, with memory address translation and permissions managed by an instruction-side translation lookaside buffer (ITLB) 24. In parallel, the pipeline 12 provides instruction addresses to a Branch Target Address Cache (BTAC) 25. If the instruction address hits in the BTAC25, the BTAC25 may provide a branch target address to the I-cache 22 to immediately begin fetching instructions according to the predicted branch target address. As described more fully below, which of a plurality of possible predicted branch target addresses is provided by the BTAC25 is determined by an offset from a Branch Prediction Offset Table (BPOT) 23. In one or more embodiments, the inputs to the BPOT 23 may include a hash function 21, which includes the branch history, branch instruction address, and other control inputs. The branch history may be provided by a Branch History Register (BHR)26 that stores branch state evaluation results (e.g., taken or not taken) for a plurality of branch instructions.

Data is accessed from a data cache (D-cache) 26, with memory address translation and permissions managed by a main Translation Lookaside Buffer (TLB) 28. In various embodiments, the ITLB may comprise a copy of part of the TLB. Alternatively, the ITLB and TLB may be integrated. Similarly, in various embodiments of the processor 10, the I-cache 22 and D-cache 26 may be integrated or unified together. Misses in the I-cache 22 and/or the D-cache 26 cause an access to main (off-chip) memory 32, under the control of a memory interface 30.

The processor 10 may include an input/output (I/O) interface 34 that controls access to various external devices 36. Those skilled in the art will recognize that many variations may be made to processor 10. For example, the processor 10 may include a second level (L2) cache for either or both of the I or D caches 22, 26. In addition, one or more of the functional blocks depicted in the processor 10 may be omitted from a particular embodiment.

Status branch instructions are common in most code, evaluating that up to one of five instructions may be a branch. However, branch instructions tend not to be evenly distributed. Rather, they are often clustered together to construct logical constructs, e.g., if-then-else decision paths, parallel ("case") branches, and so forth. For example, the following code slice compares the contents of two registers and branches to target P or Q based on the result of the comparison:

CMPr7, r8 compare the results of GPR7 and GPR8 and set a status code or flag to reflect the results of the comparison

If BEQ P is equal to the code mark P, branching

BNE Q branches if not equal to code flag Q

Since high performance processors 10 typically fetch multiple instructions from the I-cache 22 at once, and since branch instructions tend to cluster in code, if a given instruction fetch includes a branch instruction, there is a high likelihood that it will also include additional branch instructions. According to one or more embodiments, multiple Branch Target Addresses (BTAs) are stored in a Branch Target Address Cache (BTAC)25 associated with a single instruction address. Upon an instruction fetch that hits in the BTAC25, one of the BTAs is selected by an offset provided by a Branch Prediction Offset Table (BPOT)23, which may be indexed in various ways.

FIG. 2 depicts a functional block diagram of a BTAC25 and BPOT 23, according to various embodiments. Each entry in the BTAC25 includes an index or instruction address field 40. Each entry includes a cache line 42 that includes two or more BTA fields (fig. 2 depicts four fields denoted BTA0-BTA 3). When an instruction address fetched from the I-cache 22 hits in the BTAC25, the offset selects one of the multiple BTA fields of the cache line 42, functionally depicted in FIG. 2 as a multiplexer 44. Note that in various implementations, the select function may be an integral part of the BTAC25 or, as shown by multiplexer 44, located external to the BTAC 25. The BPOT 23 provides an offset. As described more fully below, the BPOT 23 may store an indicator indicating which BTA field in the cache line 42 contains the BTA that was last taken under a particular set of circumstances.

In particular, the state of the BTAC25 shown in FIG. 2 may be obtained by different iterations of the following exemplary code (where A-C are truncated instruction addresses and T-Z are branch target addresses):

BEQ Z

ADD r1，r3，r4

A：

BNE Y

ADD r6，r3，r7

BEQ X

BNE W

B：

BGE V

B U

CMP r12，r4

BNE T

C：

ADD r3，r8，r9

AND r2，r3，r6

the code is logically divided into n instruction blocks (in the example shown, n-4) by truncating one or more LSBs from the instruction address. If any branch instruction in the block evaluates taken, the BTAC25 entry is written, storing the truncated instruction address in the index field 40, and the BTA of the "taken" branch instruction in the corresponding BTA field of the cache line 42. For example, referring to FIG. 2, at different times, a block of four instructions with truncated address A is executed. Each branch is evaluated taken at least once and the respective actual BTA is written to the cache line 42 using the LSB of the instruction address to select the BTAn field (e.g., BTA0 and BTA 2). Since the instructions corresponding to fields BTA1 and BTA3 are not branch instructions, no data is stored within those fields of cache line 42 (e.g., the "valid" bit associated with those fields may be 0). As each respective BTA is written to the BTAC25 (e.g., at the writeback pipeline stage of the corresponding branch instruction that evaluates taken), the BPOT 23 is updated to store an offset to the relevant BTA field of the cache line 42. In this example, a value of 0 is stored when the BEQ Z branch is executed, and a value of 2 is stored when the BNE Y branch is executed. As described more fully below, these offset values may be stored in the BPOT 23 at locations determined by the processor state at that time.

Similarly, a block of four instructions sharing a truncated instruction address B is also executed multiple times, in which case each instruction is a branch instruction. Each branch is evaluated as taken at least once and its actual BTA most recently written to the corresponding BTA field of the cache line 42 is indexed by the truncated address B. All four BTA fields of the cache line 42 are valid and each stores one BTA. The entry in the BPOT 23 is updated accordingly to point to the relevant BTAC25 BTA field. As another example, FIG. 2 depicts truncated addresses C and BTA T stored in BTAC25, which correspond to BNE T instructions in block C of the example code. Note that the block of n instructions does not start with a branch instruction.

As these examples indicate, 1-n BTAs may be stored in a BTAC25 indexed by a single truncated instruction address. On a subsequent instruction fetch, upon a hit in the BTAC25, one of the up to n BTAs must be selected as the predicted BTA. According to various embodiments, the BPOT 23 maintains an offset table that selects one of the up to n BTAs for use by the specified cache line 42. The offset is written to the BPOT 23 while the BTA is written to the BTAC 25. The location in the BPOT 23 at which the offset is written may depend on the current and/or most recent past state or condition of the processor at the time the offset was written, and is determined by the logic circuit 21 and its inputs. The logic circuit 21 and its inputs may take several forms.

In one embodiment, the processor maintains a Branch History Register (BHR) 26. A simple form of BHR26 may include a shift register. The BHR stores the state evaluation of the state branch instruction as it is evaluated in the pipeline 12. That is, the BHR26 stores whether the branch instruction is taken (T) or not taken (N). The bit width of the BHR26 determines the temporal depth of the branch evaluation history that is maintained.

According to one embodiment, the BPOT 23 is directly indexed by at least a portion of the BHR26 to select the offset. That is, in the example, the only input to the logic circuit 21 is the BHR26, which is simply a "pass through" circuit. For example, when the branch instruction BEQ in block A evaluates to actually taken and produces the actual BTA of Z, the BHR26 contains the value NNN (in at least the LSB bit positions) (i.e., the first three state branches all evaluate to "not taken"). In that case, a0 of field BTA0 corresponding to the cache line 42 indexed by the truncated instruction address A is written to the corresponding location in BOPT 23 (the uppermost location in the example shown in FIG. 2). Similarly, when the branch instruction BNE is executed, BNR 26 includes the value NNT, and writes a2 to the second location of BPOT 23 (which corresponds to the BTAY written to the BTA2 field of the cache line 42 indexed by the truncated instruction address a).

When the BEQ instruction in block A is subsequently fetched, it will hit in the BTAC 25. If the BHR26 at this point is NNN, the BPOT 23 will provide an offset of 0, and the contents of the BTA0 field of the cache line 42, which is BTA Z, is provided as the predicted BTA. Alternatively, if the BHR26 at the time of the fetch is an NNT, the BPOT 23 will provide an offset of 2, and the content or Y of the BTA2 will be the predicted BTA. The latter case is an example of aliasing, where a wrong BTA is predicted for one branch instruction when the recent branch history exactly overlaps with the existing branch history when the BTAs of different branch instructions are written.

In another embodiment, the logic circuit 21 may include a hash function that combines at least a portion of the BHR26 output with at least a portion of the instruction address to prevent or reduce aliasing. This will increase the size of the BPOT 23. In one embodiment, the instruction address bits may be linked together with the output of the BHR26, resulting in a BPOT 23 index similar to the g-select predictor (related to branch state evaluation prediction) known in the art. In another embodiment, the instruction address bits and BHR26 outputs may be mutually exclusive processed, obtaining a g-shared-type BPOT 23 index.

In one or more embodiments, one or more inputs of logic circuit 21 may not be related to branch history or instruction addresses. For example, the BPOT 23 may be indexed incrementally, resulting in a circular index. Alternatively, the index may be random. One or more inputs generated, for example, by pipeline control logic 14 may be combined with one or more of the index generation techniques described above.

According to one or more embodiments described herein, by matching the number of BTan fields in a cache line 42 of the BTAC25 with the number of instructions in a cache line of the I-cache 22, the BTAC25 access may be kept synchronized with the instruction fetch from the I-cache. To select one of up to n possible BTAs as a predicted BTA, the processor state (e.g., recent branch history) may be compared to what is present when the BTA was written to the BTAC 25. Various embodiments of indexing the BPOT 23 to generate an offset for BTA selection provide a rich set of tools suitable for a particular architecture or application.

Although the present invention has been described herein with reference to particular features, aspects and embodiments thereof, it will be apparent that numerous variations, modifications, and other embodiments are possible within the broad scope of the present invention, and accordingly, all variations, modifications and embodiments are to be regarded as being within the scope of the invention. The present embodiments are therefore to be construed in all aspects as illustrative and not restrictive and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method of predicting a branch target address of a branch instruction, comprising:

storing at least a portion of the instruction address;

associating at least two branch target addresses with the stored instruction address; and

when a branch instruction is fetched, one of the branch target addresses is selected as the predicted target address of the branch instruction.

2. The method of claim 1, wherein storing at least a portion of an instruction address comprises writing at least a portion of the instruction address as an index into a cache.

3. The method of claim 2, wherein associating at least two branch target addresses with the instruction address comprises: upon execution of each of the at least two branch instructions, a branch target address of the respective branch instruction is written as data in a cache line indexed by the index.

4. The method of claim 1, further comprising accessing a branch prediction offset table to obtain an offset, and wherein selecting one of the branch target addresses as the predicted target address comprises selecting a branch target address corresponding to the offset.

5. The method of claim 4, wherein accessing a branch prediction offset table comprises indexing the branch prediction offset table by branch history.

6. The method of claim 4, wherein accessing a branch prediction offset table comprises indexing the branch prediction offset table by a hash function of a branch history and the instruction address.

7. The method of claim 4, wherein accessing a branch prediction offset table comprises randomly indexing the branch prediction offset table.

8. The method of claim 4, wherein accessing a branch prediction offset table comprises incrementally indexing the branch prediction offset table to generate a loop selection.

9. The method of claim 4, further comprising: writing an offset to the branch prediction offset table when a branch instruction estimates taken, the offset indicating which of the at least two branch target addresses is associated with the taken branch instruction.

10. The method of claim 1, wherein storing at least a portion of the instruction address comprises: truncating the instruction address by at least one bit such that the truncated instruction address references a block of n instructions.

11. A method of predicting a branch target address, comprising:

fetching a block of n sequential instructions referenced by the truncated instruction address; and

the branch target address of each branch instruction in the block that is estimated to be taken is stored in a cache such that up to n branch target addresses are indexed by the truncated instruction address.

12. The method of claim 11, further comprising: upon subsequent fetching of one of the branch instructions in the block, a branch target address is selected from the cache.

13. The method of claim 12, wherein selecting a branch target address from the cache comprises:

obtaining an offset from an offset table;

indexing the cache with the truncated instruction address; and

one of up to n branch target addresses is selected according to the offset.

14. The method of claim 13, wherein obtaining an offset from an offset table comprises indexing the offset table with a branch history.

15. A processor, comprising:

a branch target address cache indexed by a truncated instruction address and operative to store two or more branch target addresses per cache line;

a branch prediction offset table operative to store a plurality of offsets; and

an instruction execution pipeline operative to index the cache with a truncated instruction address and to select a branch target address from the indexed cache line in response to an offset obtained from the offset table.

16. The processor of claim 15 further comprising an instruction cache having an instruction fetch bandwidth of n instructions, and wherein the truncated instruction address addresses a block of n instructions.

17. The processor of claim 16 wherein the branch target address is operative to store up to n branch target addresses per cache line.

18. The processor of claim 15 further comprising a branch history register operative to store an indication of state evaluation for a plurality of state branch instructions, the contents of the branch history register indexing the branch prediction offset table to obtain the offset to select a branch target address from the indexed cache line.

19. The processor of claim 18, wherein the contents of the branch history register are combined with the truncated instruction address prior to indexing the branch prediction offset table.