WO2000008551A1

WO2000008551A1 - Software directed target address cache and target address register

Info

Publication number: WO2000008551A1
Application number: PCT/US1999/017135
Authority: WO
Inventors: Tse-Yu Yeh; Harshvardhan Sharangpani; Judge K. Arora
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 1998-08-06
Filing date: 1999-07-28
Publication date: 2000-02-17
Anticipated expiration: 2001-02-06
Also published as: TW455814B; AU5239599A

Abstract

A branch prediction system is provided that includes a first, low-latency storage structure (230), a second, higher-latency storage structure (240), and a branch prediction manager (BPM) (260) for updating the first and second storage structures according to software provided hint information (524). For one embodiment, the BPM (260) identifies branch hint information from branch related instructions and writes the identified branch hint information according to the state of an importance bit (524) in the instruction. When the importance bit (524) is in a first state, branch prediction information is stored in the first storage structure (230). When the hint information (524) is in a second state, the branch prediction information is stored in a second structure (240).

Description

SOFTWAREDIRECTEDTARGETADDRESS CACHE

ANDTARGETADDRESS REGISTER

Background of the Invention

Field of the Invention The present invention relates to the field of branch prediction, and in particular, to systems and methods for accessing prediction information related to branch instructions.

Background Art Advanced processors employ pipelining techniques to execute instructions at very high speeds. On such processors, the overall machine is organized as a pipeline consisting of several cascaded stages of hardware. Instruction processing is divided into a sequence of operations, and each operation is performed by hardware in a corresponding pipeline stage ("pipe stage"). Independent operations from several instructions may be processed simultaneously by different pipe stages, increasing the instruction throughput of the processor. Where a pipelined processor includes multiple execution resources in each pipe stage, the throughput of the processor can exceed one instruction per clock cycle. To make full use of this instruction execution capability, the execution resources of the processor must be provided with sufficient instructions from the correct execution path.

Branch instructions pose major challenges to keeping the pipeline filled with instructions from the correct execution path. When a branch instruction is executed and the branch condition met, control flow of the processor jumps to a new code sequence, and instructions from the new code sequence are transferred to the pipeline. Branch execution typically occurs at the back end of the pipeline, while instructions are fetched at the front end of the pipeline. If instruction fetching relies on branch execution to cletermine the correct execution path, the processor pipeline may fill with instructions from the wrong execution path before the branch condition is resolved. These instructions would then have to be flushed from the pipeline, leaving resources m the effected pipe stages idle while instructions from the correct execution path are fetched. The idle pipe stages are referred to as pipeline bubbles, since they provide no useful output until they are filled by instructions from the correct execution path. Modem processors incorporate branch prediction modules at the front ends of their pipelines to reduce the number of pipeline bubbles. When a branch instruction enters the front end of the pipeline, the branch prediction module forecasts whether the branch instruction will be taken when it is executed at the back end of the pipeline. If the branch is predicted taken, the branch prediction module provides a branch target address to the instruction fetch module. The fetch module, which is also located at the front end of the pipeline, begins fetching instructions from the target address.

Conventional branch prediction modules employ branch target buffers (BTBs) to store prediction information such as whether a branch will be taken and the likely target address when the branch is taken. Looking up an instruction in the BTB, determining whether the branch is taken, and providing a target address to the fetch module on a taken prediction delays resteering the processor to the target address. This delay allows instructions from the wrong execution path to enter and propagate down the pipeline. Since these instructions do not add to forward progress on the predicted execution path, they create "bubbles" in the pipeline when they are flushed. The greater the number of clock cycles required to resteer the pipeline, the greater the number of bubbles created in the pipeline. In addition, more accurate and complete the branch prediction algorithms take longer to complete and generate greater delays in the resteer process.

Currently available branch prediction techniques typically require two or more clock cycles to resteer the processor pipeline, reducing but not eliminating pipeline bubbles. When these bubbles occur in selected branch instructions, such as those controlling tight loops, the performance degradation can be significant. For example, if a bubble of one cycle is introduced in a loop that executes in four clock cycles, execution of the loop may be degraded by 20%.

Summary of the Invention

The present invention is a hierarchical branch prediction system in which branch prediction information is provided to selected branch prediction structures under software direction . In accordance with the present invention, a branch prediction system includes first and second storage structures to store branch prediction information for first and second categories of branch instructions, respectively. A branch prediction manager (BPM) provides branch prediction information for a branch to the first or second storage structure according to an importance hint associated with the branch .

In one embodiment of the invention, the first storage structure is a register file that can be accessed in a single clock cycle and the second storage structure is a cache that can be accessed in two clock cycles.

Brief Description of the Drawings

The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are provided to illustrate selected embodiments of the present invention and are not intended to limit the scope of the invention.

Fig. 1 is a block diagram of the front end stages of a processor pipeline including a conventional branch prediction module.

Fig. 2 is a block diagram of a processor pipeline including a branch prediction system in accordance with the present invention.

Fig. 3 is a block diagram of one embodiment of a branch prediction manager (BPM) for updating branch storage structures of the branch prediction system in Fig. 1.

Fig. 4 is a block diagram of a write pipeline for updating the TAR and TAC of Fig. 2.

Figs. 5A-4C are block diagrams of embodiments of branch and branch-related instructions that are suitable for providing importance hint information to the branch prediction system of Fig. 2.

Fig. 6 is a flowchart of a method for updating branch prediction information in accordance with the present invention. Fig. 7 is a flowchart of a method for providing branch prediction information from a hierarchy of branch prediction structures in accordance with the present invention.

Detailed Discussion of the Invention

The following discussion sets forth num ous specific details to provide a thorough understanding of the invention. However, those of ordinary skill in the art, having the benefit of this disclosure, will appreciate that the invention may be practiced without these specific details. In addition, various well known methods, procedures, components, and circuits have not been described in detail in order to focus attention on the features of the present invention.

Fig. 1 shows various elements in front end pipe stages 102, 104 of a conventional processor pipeline 100. In the disclosed example, stage 102 includes resources for indicating an instruction to be provided to pipeline 100, while stage 104 includes resources for fetching the indicated instruction. Instructions are typically indicated by a corresponding instruction pointer (IP). Accordingly, stage 102 includes an IP multiplexer (MUX) 130 and portions of an instruction cache (I-cache) 110 and branch prediction module 120. Remaining portions of I-cache 110 and branch prediction module 120 extend into fetch stage 104. The positions and sizes of I-cache 110 and branch prediction module 120 in pipeline 100 indicate when they receive the IP and the time they require to process the received IP, respectively. For example, IP multiplexer 130 selects an IP in the first half of stage 102. I-cache 110 and branch prediction module 120 receive the IP approximately halfway through IPG stage 102 and finish processing it during stage 104.

IP multiplexer (MUX) 130 receives IPs from various sources, including branch prediction module 120. Depending on inputs from branch prediction module 120 and other control circuitry (not shown), IP MUX 130 couples the IP at one of its inputs to I- cache 120 and branch prediction module 120.

On receipt of the selected IP, I-cache 110 and branch prediction module 120 initiate look up procedures to fetch information related to the selected IP. In particular, I-cache 110 stores copies of selected instructions, indexed by their corresponding IPs. I-cache 110 compares the received IP with its entries to determine whether it has a copy of the corresponding instruction. When the IP hits, i.e. matches an entry, in I-cache 1 10, the corresponding instruction is passed to circuitry in the next stage of the pipeline (not shown). If the IP misses in I-cache 110, the instruction is retrieved by a longer latency transaction to the memory subsystem (not shown).

Branch prediction module 120 stores branch prediction information for selected branch instructions, indexed by the IPs of the branch instructions. This information includes, for example, an indication as to whether the corresponding branch is likely to be taken and a predicted target address (IP) to which pipeline 100 is steered if the branch is predicted taken. When the IP forwarded by IP MUX 130 hits in branch prediction module 120, the branch prediction information associated with the hit entry is accessed and read to determine whether the branch is predicted taken. If it is, the corresponding target address (IP) is coupled back to IP MUX 130, which resteers the pipeline to the code sequence beginning at the target address. A staging latch 122 controls the timing with which signals from branch prediction module 120 are coupled to MUX 130.

Branch instructions are relatively common in computer code, occurring on average once every 5 to 12 instructions. In order to accommodate prediction information for a reasonable fraction of frequently encountered branch instructions, branch prediction module 120 must be a relatively large structure. The size of branch prediction module 120 is limited by timing consideration in pipeline 100. If branch prediction module 120 is too large, it will have a correspondingly longer access time. In addition to size, the accuracy with which branches are predicted impacts the speed of the branch prediction system. More accurate branch prediction algorithms tend to be more complex, and require additional time to provide a prediction for a branch instruction.

As noted above, each additional clock cycle required by branch prediction module 120 to access its data for a predicted taken branch allows additional instruction(s) from the wrong execution path ("bubble") to enter pipeline 100. For this reason, branch prediction module 120 is typically sized and its prediction algorithm selected so that pipeline 100 can be resteered in a few clock cycles following a branch instruction. There is thus a trade off in conventional branch prediction strategies between providing rapid resteering of the processor pipeline and providing accurate branch prediction hints for the relatively large number of branches present in most computer code.

The tradeoffs between speed and size/accuracy can have significant performance consequences. For example, when resteer requires two clock cycles (1 bubble resteer) a loop having a large repetition count incurs a bubble in the pipeline on each iteration. If there are only a few instructions in the loop body, e.g. a "tight loop", the bubble can represent a significant reduction in resource usage.

The present invention is a hierarchical branch prediction system that optimizes or improves the trade-off between speed and size/accuracy for selected branch instructions. In accordance with the present invention, the branch prediction system includes first and second storage structures and write logic for allocating branch prediction information to the first and second storage structures according to compiler- generated hint information. The first storage structure stores prediction information for selected branch instructions, while the second storage structures stores branch prediction information for a larger group of branch instructions. Alternatively, the second storage structure may implement a more accurate branch prediction algorithm or some combination of more accurate branch prediction and greater branch prediction capacity.

For one embodiment, the first storage structure provides single cycle access to branch prediction information for the selected branches, and the second storage structure provides access to branch prediction information in two or more cycles. The BPM stores branch prediction information in the first or second storage structure according to information provided through hints in various branch-related instructions.

Branch prediction information may be earmarked for storage in the first or second storage structure according to an importance field in a branch-related instruction specifying the branch prediction information. In the disclosed embodiment, a single bit importance field is used to indicate in which of the two storage structures the prediction information should be stored. More generally, an n-bit importance field may be used to store branch prediction information selectively in a 2ⁿ level hierarchy of branch prediction storage structures.

For one embodiment of the invention, branch-related instructions include branch prediction instructions (BRPs), branch instructions (BR), and MOV_TO_BR instructions (MOV). BRPs specify predicted target addresses, predicted directions (taken/not taken), and importance hints for associated IP-relative branches. EP-relative BR instructions specify their target addresses and may also include importance hints. MOV instructions specify target addresses for indirect branches and may also include importance hints. In addition to importance hints, each of these branch-related instructions may provide prediction hints indicating which resources should be used to predict the associated branch. For example, prediction hints may indicate that a branch be predicted statically or dynamically. Static prediction uses compile time information provided through the branch-related instructions, while dynamic prediction uses information gathered during execution of the program code.

Providing branch prediction information early in the processor pipeline facilitates rapid fetch and subsequent execution of instructions along the appropriate instruction path. This strategy is beneficial as long as the structures that store this information do not load critical paths in the processor pipeline or become so unwieldy as to introduce unnecessary pipeline bubbles into frequently taken inner loop branches. By providing a hierarchy of structures for storing branch prediction information, the present invention promotes the use of branch prediction information for all branch instructions without impeding access to branch prediction information for a critical category of branch instructions.

Fig. 2 is a block diagram of a portion of a processor pipeline 200 that includes branch prediction system 280 in accordance with the present invention. An IP MUX 250, an Instruction-cache (I-cache) 210, an instruction buffer 214, dispersal logic 216, and a branch execution unit (BRU) 270 are also shown. IP MUX 250 receives instruction pointers (IP) from various sources and couples a selected IP to I-cache 210 and branch prediction system 280 for processing. Instructions from I-cache 210 or bypasses (not shown) are queued in buffer 214 and routed to execution resources, e.g. BRU 270, through dispersal logic 216. Instructions may also be transferred directly from I-cache 210 to dispersal logic 216 when buffer 214 is empty.

For one embodiment of pipeline 200, instructions that may be processed concurrently are grouped in bundles. A template associated with each bundle indicates instruction types to dispersal logic 216. which routes the instructions to appropriate execution resources. Instruction bundles may include multiple branch instructions, in which case, BRU 270 includes resources for processing multiple branch instructions concurrently. Certain features of the branch prediction system 280 are illustrated with embodiments designed for bundled instructions, but the present invention is independent of how instructions are provided to pipeline 200.

Pipeline 200 is represented as a series of pipeline ("pipe") stages 201- 20x to indicate when different elements of branch prediction system 280 operate on a given instruction. Except as noted, signals propagate from left to right, so that the response of circuitry in, e.g., pipe stage 201 on CLK cycle N is propagated to the circuitry of pipe stage 202 on CLK cycle N+l . Staging latches 218 control the flow of signals between pipe stages 201-20x. Other embodiments of the present invention may employ different relative configurations of branch prediction elements and staging latches 218. The present invention does not depend on the specific configuration employed.

In the disclosed embodiment, branch prediction system 280 includes a target address register (TAR) 230, a target address cache (TAC) 240, a branch prediction table (BPT) 220, selection logic 254, and branch prediction manager (BPM) 260. TAR 230, TAC 240, and BPT 220 (collectively, "branch storage structures") are coupled to receive an EP from IP MUX 250 and provide branch prediction information to selection logic 254, when the received IP matches one of their entries. Selection logic 254 routes branch prediction information from one of the branch storage structures back to IP MUX 250 for processing. The branch prediction structures thus operate in conjunction with selection logic 254 and IP MUX 250 to resteer pipeline 200 when a branch that is predicted taken hits in branch prediction system 280.

The extent of TAR 220, TAC 230, I-cache 210, and BPT 220 with respect to stages 201 and 202 indicate the time required by each structure to process the received IP. In the disclosed embodiment, TAR 230 is designed to process a received EP and provide associated branch prediction information to selection logic 254 in a single clock cycle This is accomplished, for example, by limiting the size of TAR 230 to reduce its access latency In one embodiment, TAR 230 stores predicted target addresses for four branch instructions (BR) in four, fully associative entπes that are indexed by partial address (IP) tags. A hit in TAR 230 on clock cycle N provides a predicted target address to IP MUX 250 by clock cycle N+l, I e in time for a zero bubble resteer

Because it supports zero bubble resteers, TAR 230 is suitable for storing branch prediction information for branch instructions that have the greatest potential to impact processor performance For one embodiment, branch prediction information for loop branch instructions that are predicted taken are earmarked for storage in TAR 230 As discussed below, BPM 260 processes the branch prediction information and writes it to TAR 230 and TAC 240 according to an indication of the branch's importance to processor performance In this embodiment, TAR 230 need not store predicted branch resolutions since each branch instruction for which TAR 230 provides a target address is predicted taken. Selection logic 254 may forward a target address from TAR 230 to MUX 250 whenever an IP hits in TAR 230.

The small size of TAR 230 is compensated in part by BPT 220 and TAC 240, which may be sized to accommodate branch prediction information for a relatively large number of branch instructions. The larger sizes of BPT 220 and TAC 240 prevent them from responding until stage 202 is partially completed. For the disclosed embodiment, this results in two clock cycles of branch prediction latency and single bubble resteers of pipeline 200 for branch instructions that hit in TAC 240 Thus, while the outputs of TAR 230, TAC 240, and BPT 220 are coupled to selection logic 254 in stage 201, those of BPT 220 and TAC 240 are generated a full clock cycle after that of TAR 230. TAC 240 and BPT 220 are shown as separate structures in the disclosed embodiment, but they may also be embodied in a unified structure

In one embodiment of the invention, BPT 220 and TAC 240 are capable of stonng predicted resolution and target address information, respectively, for 64 entπes in a four way set associative configuration. When a branch IP provided by IP MUX 230 on clock cycle N misses in TAR 230, a hit in BPT 220 and/or TAC 240 will be registered in clock cycle N+l, and the corresponding branch prediction information will be available from MUX 250 at clock cycle N+2. Depending on how branch prediction information is updated, BPT 220 and TAC 240 need not register hits for the same IPs.

BPM 260 processes branch prediction information and updates TAR 230. TAC 240, and BPT 220 accordingly. In particular, BPM 260 identifies branch prediction information from various branch-related instructions and provides the identified information to branch storage structures according to hints included with the branch prediction information. BPM 260 may also provide prediction information for branch instructions that miss in TAR 230, TAC 240, and BPT 220 (static prediction), check the accuracy of branch prediction information from TAR 230, TAC 240, and BPT 220, and provide IPs to resteer pipeline 200 when errors are detected.

BPM 260 may also receive branch prediction information from branch execution unit 270. For example, BRU 270 may decode hint and target address information from BR and MOV_TO_BR instructions and provide this information to BPM 260 for updating TAR 230, TAC 240, and BPT 22. This information provides a dynamic component to the branch prediction information that complements the static component provided by software directed hints.

BPM 260 operates in conjunction with TAR 230, TAC 240, and BPT 220 to provide a hierarchy of branch prediction structures. Table 1 summarizes different resteer events, conditions, and targets implemented by one embodiment of branch prediction system 280. For the embodiment represented by TABLE 1, N instructions are grouped into bundles for concurrent processing, multiple branch instructions may be included in a bundle, and instructions are designated as syllable 1-N according to their relative order in the program code.

TABLE 1

In TABLE 1, TK indicates that a branch is predicted/resolved taken, TA refers to target addresses, FTB refers to the first taken branch, and RSB refers to a return stack buffer for return TAs.

Fig. 3 is a block diagram of one embodiment of BPM 260 in accordance with the present invention. For purposes of illustration, the disclosed embodiment includes the check features noted above and is capable of processing up to two instruction bundles concurrently, including bundles with multiple branch instructions("multiway branches"). Persons skilled in the art and having the benefit of this disclosure will recognize the modifications to BPM 260 for processing more or fewer instructions.

The disclosed embodiment of BPM 260 includes a first branch address correction unit (BACl) 310, a second branch address correction unit (BAC2) 360, and routing module 370. Also shown is a way MUX 306 that couples instructions and decoded information from previous stages to BPM 260 and an instruction buffer 308. BACl 310 and BAC2 360 process IP and instruction decode information to identify branch prediction information, check the identified information for errors, and correct the information if it is determined to be erroneous. Routing module 370Routing module processes hint and branch prediction information from BACl 310 and BAC2 360 to determine an appropriate storage structure for the prediction information. Routing module 370Routing module is discussed in greater detail in conjunction with Fig. 4.

One embodiment of BAC 1 310 determines preliminarily a target address for a bundle and a first branch instruction in the bundle(s) that is predicted taken. It also identifies false branch predictions, i.e. hits in TAR 230, TAC 240, or BPT 220 when no branch instruction is present. BACl 310 includes merge logic 320, logic to find a first taken branch in a bundle with multiple branch instructions (FFT logic 324), and a decoder 328 to evaluate (predicted) branch directions. BACl 310 includes, adders 312, 314, 316, 318, and MUX 338 to evaluate branch target addresses. An alias detector 334 and MUXs 330 are included to identify false branch predictions.

Decoder 312 receives instruction information from, e.g. I cache 210, including branch hint information. Hint information may include an indication of whether a branch should be predicted dynamically or statically. If static prediction is indicated, the hint information may also indicate the predicted branch direction (TK/NT). Merge logic 320 combines the hint information with any predicted branch directions from TAR 320 and TAC 340, and FFT logic 324 determines a first taken branch from the decoded and predicted information. FFT logic 324 uses MUX 338 to select an offset for calculating the target address of the (predicted) first taken branch in BAC2 360.

Adders 312 and 316 determine branch target addresses using decoded information from branch predict instructions (BRPs) in first and second instruction bundles, respectively. Target addresses are determined using IPs provided by MUX 250 and address offsets specified in the BRPs. Adders 314 and 318 determine branch target addresses for EP-relative branch instructions using the instruction IP and decoded offset information, e.g. for static resteers. For one embodiment of the invention, adders 314 and 318 may be simplified, e.g. selected carry logic eliminated, to meet the timing constraints necessary to provide resteer IPs to selection logic 254. If an address calculation generates a carry, the correct address can be determined by a full adder (358) in BAC2 360.

To limit the size of BPM 260, not all instruction slots in a bundle need to be provided with adders. For the disclosed embodiment, adders 312 and 316 calculate target addresses for BRPs in specified instruction slots of the first and second bundles, respectively. Similarly, adders 314 and 318 calculate target addresses for BRs in specified instruction slots of the first and second bundles, respectively. For example, where bundles are three instructions wide, the third instruction slot of each concurrently processed bundle may be coupled to adders 312, 314, 316, 318 and branch instructions may be preferentially routed to these instruction slots. In this embodiment, target addresses for branch instructions that are not assigned to the third slot are evaluated by an adder (adder 358) in BAC2 260.

Alias detector 334 ensures that IPs that hit in TAR 230, TAC 240, and BPT 220 correspond to branch-related instructions. For example, TAC 230, TAR 240, and BPT 220 may employ only a portion of an IP to index the branch prediction information they store. This saves silicon area, but the partial IP may correspond to the addresses of more than one instruction ("aliasing"). Detector 334 uses instruction decode information to ensure that the IP that generated a hit in branch storage structures corresponds to a branch-related instruction. MUX 330 selects an IP under control of alias detector 334 and couples it back to selection logic 254 to resteer pipeline 200 in the event an error is detected in branch prediction information (BACl Resteer).

One embodiment of B AC2 360 performs validation checks on preliminary results generated by BACl 310 and updates TAR 230, TAC 240, and BPT 220, accordingly. In the disclosed embodiment, TAC/BPT check module 344 determines whether a branch predicted taken by BPT 220 matches the branch associated with the target address provided by TAC 240. If the results disagree, TAC/BPT check module 344 uses the address generated by BAC2 and resteers based upon static prediction information derived from the branch. Adder check module 348 is included to detect overflow errors when carry logic is eliminated from adders 314, 318 to meet timing constraints. Adder check module 348 determines whether a target address calculated by adder 314 or 318 generated a carry and triggers full adder 358 to recalculate the target address when a carry is detected.

Mask validation module 350 determines when a predicted first taken branch is not in one of the slots serviced by the adders in BAC 1 310. For one embodiment of BPM 260, a mask indicates the slots that are not serviced. Mask validation module 350 uses MUX 354 to select the target address from adder 358 when the taken branch is in a masked slot. The output of MUX 354 provides the target address to selection logic 254 for resteering the pipeline (BAC2 resteer). Routing module 370 updates TAR 230, TAC 240, and BPT 220 with data from decoded BR and BRP instructions.

Fig. 4 represents a pipeline 400 for updating TAC 230 and TAR 240. While TAC 230 and TAR 240 are physically located ahead of BPM 260 and BRU 270 in pipeline 200, they are shown following BPM 260 in Fig. 4 to emphasize the relative timing of the various decode, execute, and update operations. In addition, the disclosed embodiment of write pipeline 400 is suitable for the case in which processor pipeline 200 (Fig. 2) processes up to two bundles of instructions concurrently. Accordingly, TAR 230 and TAC 240 are each coupled to two read/write ports (portO, portl) to accommodate two concurrent update operations.

In the disclosed embodiment, BACl 310 and BAC2 360 of BPM 260 are shown in pipe stages 402 and 403. TAR 230 is represented as a TAR read module 430 and a TAR write module 434, which are accessed in stages 403 and 404, respectively. Similarly, TAC 240 is represented as a TAC read module 440 and a TAC write module 444, which are accessed in stages 403 and 404, respectively. Routing module 370 includes source select MUXs 412, 414, write control MUXs 460, 464, and their control logic (not shown).

Source select MUXs 412 and 414 select target addresses from various sources and couple the selected target addresses to portO and portl, respectively, of TAR 230 and TAC 240. In the disclosed embodiment, BACl 310 provides target addresses from decoded BRP instructions to MUX 412, 414, and BRU 170 provides target addresses from decoded BR and MOV_TO_BR instructions to MUX 412 and MUX 414, respectively. MUX 414 may also receive a target address from BAC2 360 when a predicted target address is incorrect, or a target address calculation exceeds the capacity of BACl adders 314, 316.

Write control MUXs 460, 464 use decoded hint information from various sources to control whether target addresses from MUXs 412, 414 are written to TAR 230, TAC 240, or both. For one embodiment, BAC 1 360 provides importance hints from decoded BPR instructions, while BRU 270 provides importance hints from BR and MOV_TO_BR instructions.

Control signals to MUXs 412, 414 select a target address source according to events detected in pipeline 200, e.g. instruction types, prediction hits, check misses, etc. The selected target addresses are provided to portO and portl of TAR read module 430 through staging latches 418. TAR read module 430 determines whether the selected target address matches ("hits") one of its entries. If a hit is detected and the corresponding hint bit is set, the target address is written to the hit entry in TAR write module 434. If no hit is detected, but the corresponding hint bit is set, an entry is allocated in TAR write module 434 and the target address is written to the allocated entry. In one embodiment, TAR 230 implements a first-in, first-out (FIFO) allocation algorithm. Since TAR 230 is dual ported, two entries can be read and written concurrently.

A similar write procedure is implemented with TAC read and write modules 440, 444, respectively. The target address is compared with entries in TAC read module 440. If a hit is detected, the target address is written to the hit entry of TAC write module 440. If no hit is detected, an entry is allocated and the target address is written to the allocated entry. In one embodiment, TAC 240 implements a least recently used (LRU) allocation algorithm.

Write pipeline 400 thus allows branch prediction information to be written to TAR 230 and TAC 240 under control of various branch-related instructions (BR, BPR, MOV_TO_BR)- For example, importance fields may be included in these instructions indicate whether branch prediction information specified in the instruction should be allocated to TAR 230 or TAC 240. A compiler may set one or more bits in the importance fields of these instructions according to information about the associated branch that is available at compile time. Various criteria may be employed to earmark branch prediction information for TAR 230 or TAC 240. These criteria are related to characteristics of the branch with which the prediction information is associated. For example, since TAR 230 provides efficient resteering of pipeline 200 when a branch is taken, branch prediction information may be advantageously stored in TAR 230 when the associated branch is predicted taken. Further, since branch prediction information is encoded by the compiler, branches that can be predicted taken based on information available at compile time are particularly suitable candidates.

Due to its relatively small size, TAR 230 can not accommodate prediction information for all branches that are predicted taken. Additional criteria can be employed to reduce the number of candidate branches. For example, the branches that control loops are accessed repeatedly until the loop terminates. Loops that are repeated through a taken branch, e.g. counted top loops (CTOP) and while top loops (WTOP), benefit from zero bubble resteers each time the loop is repeated. This is particularly true for top loops that have relatively few instructions in their loop bodies ("tight loops"). In these cases, even a single pipeline bubble can represent a significant portion of the iteration time. Accordingly, branches associated with top loops (counted or while) are good candidates to have their prediction information stored in TAR 230.

Branches for which TAC 240 provides branch prediction information are those that are less critical to processor performance than the branches associated with TAR 230. TAC 230 may also accommodate branch prediction information that has been displaced from TAR 220 by subsequently provided branch prediction information. The availability of TAC 240 thus allows the widespread use of software hints to allocate branch prediction information in branch prediction system 280 without degrading branch prediction speed/accuracy for branches in performance-critical code segments.

Fig. 5A is a block diagram of a branch prediction instruction (BRP) 500 suitable for conveying branch prediction information to branch prediction system 280. BPR 500 includes an opcode field 510, a "whether" field 514, an importance hint field 518, a target field 520, and a tag field 524. Opcode field 510 indicates that the instruction is a branch prediction instruction. Whether field 514 indicates how the branch should be predicted, e.g. taken/not taken, dynamic/static prediction. Tag field 518 indicates an address of the associated branch instruction (BR), and target field 520 indicates a predicted target address for the BR instruction. Importance hint field 524 indicates the relative importance of providing low latency branch prediction for the associated branch. BPM 260 uses importance hint field 524 to route data from target field 520 to one or more branch storage structures

Fig. 5B is a block diagram of a BR instruction 540 suitable of conveying importance information to branch prediction system 280 for IP-relative branches. BR 540 includes an opcode field 544, a branch type field 548, a target field 550, a whether field 554, and an importance field 558. Opcode field 544 identifies BR 540 as an indirect BR instruction while branch type field 548 indicates the type of branch, e.g. call, return, counted loop, modulo scheduled counted loop, modulo-scheduled while loop, etc. BPM 260 routes data from target field 550 to one or more of branch storage structures according to the status of importance field 558.

Fig. 5C is a block diagram of a MOV_TO_BR instruction 560 suitable for conveying branch hint information for indirect branches to branch prediction system 280. MOV instruction 560 includes an opcode field 564, a branch type field 568, a target field 570, a register field 574, an importance field 578, and a whether field 580. Opcode field 564 identifies the instruction as a MOV_TO_BR instruction, target field 570 specifies the target address, and register field 574 specifies the branch register in which the target address is to be stored. Data from target field 570 is provided to TAR 230 and or TAC 240 according to the status of importance field 578.

Fig. 6 is a flowchart of a method 600 for stoπng branch prediction information in accordance with the present invention. Ln the disclosed embodiment, method 600 may be initiated, for example, when BPM 260 or BRU 270 decodes a branch-related instruction. When a suitable instruction is detected 610, included branch prediction information is extracted 620 and it is determined 630 whether the importance bit in the instruction is set. If the importance bit is set 630, the branch prediction information is stored in the lowest latency branch prediction structure, e.g. TAR 230. In one embodiment, the branch prediction information may also be stored in TAC 240 m case it is subsequently evicted from TAR 230. If the importance bit is not set 630, the branch prediction information is stored in a higher latency branch prediction structure, e.g. TAC 240 and BPT 220.

In some instances, BR instructions may closely follow their associated BRP instructions through processor pipeline 200, and there may be insufficient time to store the branch prediction information from the BPR instruction before it is needed by the BR instruction. In these case, the branch prediction information may be coupled directly to IP MUX 250 through bypass structures (not shown).

Fig. 7 is a flowchart of a method 700 for using branch prediction information in accordance with the present invention. Method 700 is initiated when a new IP is sent 710 to the branch prediction structures, e.g. in stage 202, during a first clock cycle. If the IP hits 720 an entry in the low latency branch prediction structure (TAR 230), a predicted target EP associated with the entry is returned in time for the next clock cycle. If the IP misses 720 in the low latency structure, method 700 waits 740 for a response from the higher latency structure (TAC 240, BPT 220).

If the higher latency structure(s) indicates a hit 740, a target IP associated with the hit entry is returned in time for the next clock cycle plus one. A miss 740 in the lower latency structure, indicates that the IP does not correspond to a BR instruction or there is no branch prediction information available for the BR instruction in the lower latency structure. If additional storage structures are available for branch prediction information, method 700 continues looking for hits in those structures. If the new IP does represent a branch instruction 770, e.g. as determined by BPM 260, branch prediction information contained in the instruction may be used to update 780 the first or second BPS according to the included hint information.

The present invention has been described for a system in which branch instructions are indexed by their corresponding instruction pointers (IP). However, this is not necessary, and a number of representations may be used for the branch instruction for this purpose, including, for example, the branch instruction opcode. In addition, the invention has been described for the case of a branch prediction hierarchy that includes two branch prediction structures. Those skilled in the art will recognize that the invention is readily applicable to branch prediction hierarchies having more than two levels of branch prediction structures. In these cases, BPR instructions will employ correspondingly larger hint fields and additional categories will be provided for the BR instructions.

There has thus been provided a system and method for speeding branch prediction operations, using a hierarchy of branch prediction structures that operate under software direction. Branch prediction information for a first category of branch instructions is stored in a small, fast branch prediction structure that can be accessed in a single clock cycle. Branch prediction information for another category of branch instructions is stored in a larger, slower branch prediction structure. Branch instructions are assigned to the first and second categories. When a branch instruction hits in the first structure, a target IP is provided to the first stage of the pipeline in the clock cycle following the one in which the branch instruction began. The disclosed invention provides single cycle turnaround of branch predictions for the most significant branches, even for processors that run at high frequencies.

Claims

What is claimed is:

1. A branch prediction system comprising: a first storage structure for branch target addresses; a second storage structure for branch target addresses; and a branch prediction manager (BPM) coupled to the first and second storage structures, the BPM being capable of identifying a branch target address and an importance indicator in an instruction and writing the branch target address to the first or second storage structure according to the importance indicator.

2. The branch prediction system of claim 1 , wherein the first storage structure has a response latency that is shorter than a response latency of the second storage structure.

3. The branch prediction system of claim 1, wherein the BPM includes decode logic to identify instructions that include branch prediction information and extract branch prediction information from the decoded instructions.

4. The branch prediction system of claim 1 , wherein the BPM is coupled to receive from a branch execution unit branch prediction information for selected instructions.

5. A processor comprising: an execution pipeline; a target address register coupled to the execution pipeline, the target address register being capable of storing selected branch prediction information to alter an instruction flow through the execution pipeline; a target address cache coupled to the execution pipeline, the target address cache being capable of storing branch prediction information to alter the instruction flow through the execution pipeline; and a branch prediction manager (BPM) coupled to the execution pipeline, the TAR, and the TAC, the execution pipeline detecting branch prediction information in instructions in the execution pipeline and providing the detected branch prediction information to the TAR and the TAC according to an importance hint specified by the instruction.

6. The processor of claim 5, wherein the execution unit includes a branch execution unit, the branch execution unit being capable of extracting branch prediction information from instructions and coupling the extracted branch prediction information to the BPM.

7. The processor of claim 5, wherein the target address register is coupled to a fetch stage of the execution pipeline to receive an instruction pointer and to return a target address corresponding to the IP to the fetch stage in a single clock cycle.

8. The processor of claim 7, wherein the target address cache is coupled to the fetch stage to receive an instruction pointer and to return a target address corresponding to the P to the fetch stage in two or more clock cycles.

9. The processor of claim 8, wherein the target address cache suppresses returning the target address when the target address register returns the target address.

10. A processor comprising: an instruction fetch module; a target address register (TAR) coupled to receive an instruction pointer (EP) from the fetch module and return a target IP in a single clock cycle when the IP matches a TAR entry; a target address cache (TAC) coupled to receive an EP from the fetch module and return a target IP in two or more clock cycles when the IP matches a TAC entry; and a branch prediction manager (BPM)to identify branch prediction information and update the TAR and TAC with a target EP from the branch prediction information according to an importance indicator in the branch prediction information.

11. The processor of claim 10, wherein the BPM writes the target IP to the TAR and the TAC when the importance indicator is set and to the TAC alone when the importance indicator is not set.

12. The processor of claim 10, wherein the BPM includes a decoder that is coupled to receive instructions from the fetch module, identify branch related instructions, and extract branch prediction information from the identified branch related instructions.

13. The processor of claim 11, further comprising a branch execution unit coupled to the fetch module and the BPM, the branch execution unit being capable of extracting branch prediction information from branch and branch-related instructions and providing the extracted information to the BPM.

14. A branch prediction system comprising: a target address register (TAR) to store target addresses for a first group of branch instructions; a target address cache (TAC) to store target addresses for a second group of branch instructions; and a branch prediction manager(BPM)to identify branch prediction information associated with the first and second groups of branch instructions and update the TAC and TAR with a target address from the identified branch prediction information.

15. The branch prediction system of claim 14, wherein the branch prediction information includes an importance bit and the BPM writes the target address to the TAR when the importance bit is set.

16. A branch prediction manager for a hierarchical branch prediction system, the branch prediction manager comprising: a branch address calculator to identify branch prediction information in a branch-related instruction; and a routing module to provide branch prediction information from the branch address calculator to an indicated location in the hierarchical branch prediction system.

17. The branch prediction manager of claim 16, wherein the branch prediction information includes hint and target address information, and the routing module provides the target address information to a location indicated by the hint information..

18. The branch prediction manager of claim 17, wherein the branch prediction information is provided by a branch predict instruction.

19. The branch prediction manager of claim 16, wherein the branch address calculator comprises a first branch address calculator to determine preliminarily a predicted branch direction and target address from the branch prediction information and a second branch address calculator to validate the preliminarily determined branch direction and target address.

20. The branch prediction manager of claim 19, wherein the branch prediction information is provided by a branch instruction.

21. The branch prediction manager of claim 16, wherein the indicated location in the hierarchical branch prediction system is an entry in a first or second branch prediction structure.

22. The branch prediction manager of claim 21 , wherein the first branch prediction structure is a target address register file that supports single-cycle pipeline resteers for selected branches and the second branch prediction structure is a target address cache that supports pipline resteers in two or more cycles .

23. The branch prediction manager of claim 22, wherein a predicted target address is routed to the target address register file or the target address cache according to a field in the branch prediction information.

24. A processor comprising: first means for storing branch prediction information for a first group of branch instructions; second means for storing branch information for a second group of branch instructions; and means for managing branch prediction information to detect branch prediction information in a branch-related instruction and route the branch prediction information to the first or second storing means according to hint information in the branch-related instruction.

25. The processor of claim 24, wherein the managing means includes: means for generating a branch target address from the branch prediction information; and means for routing the branch target address to the first or second storing means according to the hint information.

26. The processor of claim 25, wherein multiple branches may be processed concurrently and the generating means includes a plurality of adders to calculate target addresses for multiple branches and logic to identify a first taken branch among the multiple branches.

27. The processor of claim 26, wherein the first taken branch is determined using predicted branch information.

28. The processor of claim 25, wherein the managing means includes: a first address calculator calculator to determine a preliminary target address and branch identity from branch prediction information; and a second address calculator to validate the preliminary target address and branch identity.

29. The processor of claim 28, wherein the branch prediction information is provided by a branch instruction.

30. The processor of claim 25, wherein the first storage means comprises a plurality of registers that may provide a target address to resteer the processor in a single clock cycle and the second storage means comprises a target address cache that may provide a target address to resteer the processor in two or more clock cycles.