CN1208716C

CN1208716C - Processor with replay architecture with fast and slow replay paths

Info

Publication number: CN1208716C
Application number: CNB008194211A
Authority: CN
Inventors: M·D·乌普顿; D·A·萨格尔; D·D·博格斯; G·J·欣顿
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2000-02-14
Filing date: 2000-12-29
Publication date: 2005-06-29
Anticipated expiration: 2020-12-29
Also published as: GB2376328B; HK1048872B; GB2376328A; KR20030007425A; DE10085438B4; AU2001224640A1; WO2001061480A1; DE10085438T1; KR100508320B1; GB0221325D0; CN1452736A; HK1048872A1

Abstract

According to one aspect of the invention, a microprocessor is provided that includes an execution core, a first replay mechanism and a second replay mechanism. The execution core performs data speculation in executing a first instruction. The first replay mechanism is used to replay the first instruction via a first replay path if an error of a first type is detected which indicates that the data speculation is erroneous. The second replay mechanism is used to replay the first instruction via a second replay path if an error of a second type is detected which indicates that the data speculation is erroneous.

Description

Processor with replay architecture with fast and slow replay paths

对相关申请的交叉引用Cross References to Related Applications

本申请是申请编号09/222,805的后继部分，该申请于12/30/98提交，而它又是申请编号08/746,547的后继部分，该申请于11/23/96被提出，现在是美国专利号5,966,544。本申请和上面确定的申请都被转让给了加利福尼亚州圣克拉拉的Intel公司。This application is a continuation-in-part of Application Serial No. 09/222,805, filed 12/30/98, which in turn is a successor-in-part of Application Serial No. 08/746,547, filed 11/23/96, which is now a U.S. Patent No. 5,966,544. This application and the applications identified above are assigned to Intel Corporation of Santa Clara, California.

发明背景Background of the invention

1. 发明领域 1. Field of invention

本发明通常与处理器领域有关，尤其与用于促进数据预测操作的有快和慢重放路径的重放体系结构有关。The present invention relates generally to the field of processors, and more particularly to a playback architecture with fast and slow playback paths for facilitating data prediction operations.

2. 背景信息 2. Background information

图1显示了美国专利号5,966,544中所公开的处理器100的一种实施方案的框图。图1中所示的处理器100包括以第一时钟频率(I/O时钟)操作的I/O环111、以第二时钟频率(例如慢时钟)操作的延迟-容许执行核心121、以第三时钟频率(例如，中等时钟)操作的延迟-不容许执行子-核心131以及以第四时钟频率(例如，快时钟)操作的延迟-关键(latency-critical)执行子-核心141。图1中所示的处理器100还包括时间乘和/或除单元110、120和130，像先前的申请中所讲的一样，它们可被配置用来给处理器100的子-核心的不同部分提供适当的计时。先前的申请中讲授这里什么最适当的特定部分是执行核心可以包括两个或多个以不同时钟速率操作的部分(子-核心)。Figure 1 shows a block diagram of one embodiment of a processor 100 disclosed in US Patent No. 5,966,544. The processor 100 shown in FIG. 1 includes an I/O ring 111 operating at a first clock frequency (I/O clock), a delay-tolerant execution core 121 operating at a second clock frequency (e.g., a slow clock), Latency-intolerant execution sub-core 131 operating at three clock frequencies (eg, medium clock) and latency-critical execution sub-core 141 operating at a fourth clock frequency (eg, fast clock). Processor 100 shown in FIG. 1 also includes time multiply and/or divide units 110, 120, and 130, which may be configured to give processor 100 sub-cores different Sections provide proper timing. The previous application teaches what particular part is most appropriate here is that an execution core may consist of two or more parts (sub-cores) operating at different clock rates.

在操作中，I/O环111通过以I/O时钟频率执行不同的I/O操作和计算机系统中的其它部分(未显示)通信，所执行的操作如存储器读和写。例如，处理器100可以I/O时钟频率在I/O环111上执行I/O操作以从外部存储设备读入数据。不同的执行子-核心121、131和141能够根据输入指令和/或输入数据以它们各自的时钟频率执行不同的功能或操作。例如，延迟-容许执行子-核心121可以在输入数据上完成执行操作以产生第一个结果。延迟-不容许子-核心131可以在第一个结果上完成执行操作以产生第二个结果。同样，延迟-临界执行子-核心141可以在第二个结果上完成另一个执行操作以产生第三个结果。由不同的执行子-核心执行的不同操作包括算术操作、逻辑操作和其它操作等等。本领域的技术人员应该理解并懂得执行这些不同操作的顺序不必要按照多个执行子-核心的层次顺序。例如，输入数据可以立即直接进入最内层的子-核心，在那里获得的结果可以从最内层子-核心到任意其它子-核心或回到I/O环111以便写回。另外，像在先前的申请中所公开并讲授的那样，可以把片上高速缓冲存储器结构分割到处理器100的两个或更多部分。同样，根据片上高速缓冲存储器中所存储的数据的一个特征以一种时钟频率执行特定的操作和/功能，同时还可以根据存储在片上高速缓冲存储器中的数据的另一个特征以不同的时钟频率执行其它操作和/或功能。例如，可以在一个子-核心中以一种时钟频率执行关于片上高速缓冲存储器的路线预测器未命中，同时在另一个子-核心上以不同的频率执行TLB命中/未命中检测和/或页未命中检测。同样，在执行过程中可以比其它错误和状态更早地检测到特定的错误和状态。In operation, the I/O ring 111 communicates with other parts of the computer system (not shown) by performing various I/O operations at the I/O clock rate, such as memory reads and writes. For example, the processor 100 may perform an I/O operation on the I/O ring 111 at an I/O clock frequency to read data from an external storage device. Different execution sub-cores 121, 131 and 141 are capable of performing different functions or operations at their respective clock frequencies according to input instructions and/or input data. For example, delay-tolerant execution sub-core 121 may complete an execution operation on input data to produce a first result. Latency-intolerant sub-core 131 may complete execution operations on the first result to produce the second result. Likewise, latency-critical execution sub-core 141 may complete another execution operation on the second result to generate the third result. Different operations performed by different executive sub-cores include arithmetic operations, logical operations, and other operations, among others. Those skilled in the art should understand and understand that the order of performing these different operations does not necessarily follow the hierarchical order of multiple execution sub-cores. For example, input data can immediately go directly to the innermost sub-core, where results obtained can be passed from the innermost sub-core to any other sub-core or back to the I/O ring 111 for writing back. Additionally, the on-chip cache structure may be partitioned into two or more portions of processor 100 as disclosed and taught in prior applications. Likewise, certain operations and/or functions may be performed at one clock frequency based on one characteristic of the data stored in the on-chip cache memory while also being clocked at a different frequency based on another characteristic of the data stored in the on-chip cache memory Perform other operations and/or functions. For example, way predictor misses on on-chip cache memory may be performed at one clock frequency in one sub-core while TLB hit/miss detection and/or page detection are performed at a different frequency on another sub-core. Miss detection. Also, certain errors and conditions can be detected earlier in the execution process than others.

图2描绘了先前的申请中所公开的处理器200的一种实施方案的框图，它包括一般的重放结构来促进数据预测操作。在这个实施方案中，处理器200包括调度器231，它耦合到复用器241以提供从指令高速缓冲存储器(I-cache)211接收到的指令给执行核心251以便执行。执行核心251可以在执行从复用器241接收到的不同指令当中执行数据预测。图2中所示的处理器200包括检查器单元281，以在确定数据检测出错时把执行过的指令的副本发回到执行核心251重新执行(重放)。然而，在这个一般重放结构中，检查器单元281位于执行核心251、TLB和标记逻辑261以及高速缓冲存储器命中/未命中逻辑271之后。可能在这个检查器定位允许检测之前已经知道不正确地执行了一些指令(即，因为数据预测错误)。准确地说，在一些情况下能够更早地检测到特定的错误和状态，甚至可以在TLB/TAG逻辑261和命中/未命中逻辑271被执行之前，这些错误和状态指出这些情况中的数据预测是错误的。不幸的是，因为检查器单元281的当前定位，因为错误的数据预测而被错误执行的相应指令在它们到达检查器281之前不会被送回到执行核心251重新执行或重放。因而，在知道已经因为错误的数据预测而不正确地执行了一条指令的时间直到相应的指令被送回重新执行的时间之间有一个不必要的延迟。因而，系统性能没有被优化到和让这些被不正确执行的指令在进程中较早地重新执行或重放本应该有的程度。FIG. 2 depicts a block diagram of one embodiment of a processor 200 disclosed in the prior application, which includes a general replay structure to facilitate data prediction operations. In this embodiment, processor 200 includes scheduler 231 coupled to multiplexer 241 to provide instructions received from instruction cache (I-cache) 211 to execution core 251 for execution. Execution core 251 may perform data prediction among execution of different instructions received from multiplexer 241 . The processor 200 shown in FIG. 2 includes a checker unit 281 to send a copy of the executed instruction back to the execution core 251 for re-execution (replay) when a data detection error is determined. However, in this general replay structure, checker unit 281 is located after execution core 251 , TLB and tag logic 261 , and cache hit/miss logic 271 . Some instructions may have been known to be executed incorrectly (ie, because of data mispredictions) before this checker location allows detection. Specifically, in some cases certain errors and conditions can be detected earlier, even before TLB/TAG logic 261 and hit/miss logic 271 are executed, which indicate data prediction in these cases it is wrong. Unfortunately, because of the current positioning of checker unit 281 , corresponding instructions that were mis-executed due to erroneous data predictions are not sent back to execution core 251 for re-execution or replay until they reach checker 281 . Thus, there is an unnecessary delay between the time it is known that an instruction has been executed incorrectly due to erroneous data predictions, until the time the corresponding instruction is sent back for re-execution. Thus, system performance is not optimized to the extent that these incorrectly executed instructions are re-executed or replayed earlier in the process than they should be.

发明概述Summary of the invention

根据本发明的一个方面，所提供的微处理器包括执行核心、第一重放机制和第二重放机制。执行核心在执行第一条指令中进行数据预测。第一重放机制用来在检测到指示数据预测错误的第一类型的错误时通过第一重放路径重放第一条指令。第二重放机制用来在检测到指示数据预测错误的第二类型的指令时通过第二重放路径重放第一条指令。According to one aspect of the invention, a microprocessor is provided that includes an execution core, a first replay mechanism, and a second replay mechanism. The execution core performs data prediction during execution of the first instruction. A first replay mechanism is used to replay a first instruction via a first replay path upon detection of a first type of error indicative of a data misprediction. The second replay mechanism is used to replay the first instruction through the second replay path upon detection of an instruction of the second type indicating a data misprediction.

附图概述Figure overview

参考附图将更完整地理解本发明的特征和优点，附图中：The features and advantages of the present invention will be more fully understood with reference to the accompanying drawings, in which:

图1是包括以不同频率操作的多个子-核心的处理器的一种实施方案的框图；1 is a block diagram of one embodiment of a processor including multiple sub-cores operating at different frequencies;

图2显示了有一般重放结构的处理器的一种实施方案的框图；Figure 2 shows a block diagram of one embodiment of a processor with a general replay architecture;

图3描绘了其中实现了本发明的讲授的处理器流水线的一种实施方案的框图；Figure 3 depicts a block diagram of one embodiment of a processor pipeline in which the teachings of the present invention are implemented;

图4显示了有第一和第二重放机制的处理器的一种实施方案的框图；Figure 4 shows a block diagram of an embodiment of a processor with first and second replay mechanisms;

图5显示了有第一和第二重放机制的处理器的一种实施方案更详细的框图；Figure 5 shows a more detailed block diagram of an embodiment of a processor with first and second replay mechanisms;

图6显示了依照本发明的讲授的方法的一种实施方案的流程图。Figure 6 shows a flow diagram of one embodiment of a method in accordance with the teachings of the present invention.

详细描述A detailed description

在下面的详细描述中阐明了多种特定的细节以便提供对本发明的彻底理解。然而，本领域的技术人员应该理解没有这些特定细节也可以实现本发明。In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details.

在下面的论述中，利用对本发明的讲授来实现用于促进执行指令中的数据预测的方法、设备和系统。为了减少执行时间，执行单元在执行输入指令中完成数据预测。如果数据预测错误，就由执行单元重新执行输入指令直到获得正确的结果为止。一种实施方案中，如果检测到第一类型和第二类型的错误就确定数据预测错误。第一类型的错误可以在第二类型的错误之前被检测到。一种实施方案中，如果根据输入指令的执行检测到第一类型的错误就由第一检查器负责把输入指令的第一副本送回执行单元重新执行或重放。如果根据输入指令的执行检测到第二类型的错误就由第二检查器负责把输入指令的第二副本送回执行单元重新执行或重放。一种实施方案中，用选择器来根据预先确定的优先级机制提供后继输入指令、被不正确执行的指令的第一副本或第二副本给执行单元执行。对本发明的讲授适用于在执行指令中完成数据预测的任意处理器或机器。然而，本发明并不局限于执行数据预测的处理器或机器，并且能够应用于在其中需要多级重放机制的任意处理器和机器。In the following discussion, the teachings of the present invention are utilized to implement methods, devices, and systems for facilitating data prediction in executing instructions. To reduce execution time, the execution unit performs data prediction during execution of input instructions. If the data prediction is wrong, the input instruction is re-executed by the execution unit until the correct result is obtained. In one embodiment, a data prediction error is determined if errors of the first type and the second type are detected. Errors of the first type can be detected before errors of the second type. In one embodiment, the first checker is responsible for sending a first copy of the input instruction back to the execution unit for re-execution or replay if a first type of error is detected upon execution of the input instruction. The second checker is responsible for sending a second copy of the input instruction back to the execution unit for re-execution or replay if an error of the second type is detected upon execution of the input instruction. In one embodiment, the selector is used to provide subsequent input instructions, the first copy or the second copy of the incorrectly executed instruction, to the execution unit for execution according to a predetermined priority mechanism. The teachings of the present invention apply to any processor or machine that performs data prediction in executing instructions. However, the present invention is not limited to processors or machines that perform data prediction, and can be applied to any processor and machine where a multi-stage replay mechanism is required.

图3是可以在其中实现本发明的处理器流水线300的一种实施方案的框图。为了说明本发明的目的，术语“处理器”指能够执行指令序列的任意机器，它包括但不局限于通用微处理器、专用微处理器、图形控制器、音频处理器、视频处理器、多媒体控制器和微控制器。处理器流水线300包括多个处理级，并从取出级310开始。在这一级，指令被取出并被提供给流水线300。例如，可以从整合在处理器中或与其紧密相连的高速缓冲存储器中或者通过系统总线从外部存储器中取出宏指令。在取出级310取到的指令随后被输入解码级320，在那里指令或宏指令被解码成微指令或微操作(这里也称为UOP或uop)由处理器执行。在分配级330，分配执行微指令必需的处理器资源。流水线中的下一级是重命名级340，在那里对外部寄存器的引用被转换成内部寄存器引用以消除由寄存器重用产生的错误的依赖关系。在调度/分配级350，每个微指令或UOP被调度并分配到一个执行单元。随后在执行级360执行微指令或UOP。执行之后，在退出级退出微指令或UOP。Figure 3 is a block diagram of one embodiment of a processor pipeline 300 in which the present invention may be implemented. For purposes of this description, the term "processor" refers to any machine capable of executing a sequence of instructions, including but not limited to general-purpose microprocessors, special-purpose microprocessors, graphics controllers, audio processors, video processors, multimedia controllers and microcontrollers. Processor pipeline 300 includes a plurality of processing stages and begins with fetch stage 310 . At this stage, instructions are fetched and provided to pipeline 300 . For example, macroinstructions may be fetched from cache memory integrated in or closely coupled to the processor or from external memory via the system bus. Instructions fetched in fetch stage 310 are then input to decode stage 320, where the instructions or macroinstructions are decoded into microinstructions or micro-operations (also referred to herein as UOPs or uops) for execution by the processor. At the allocation stage 330, the processor resources necessary to execute the microinstructions are allocated. The next stage in the pipeline is the rename stage 340, where references to external registers are converted to internal register references to eliminate false dependencies created by register reuse. At the dispatch/dispatch stage 350, each microinstruction or UOP is dispatched and dispatched to an execution unit. Microinstructions or UOPs are then executed at execution stage 360 . After execution, the microinstruction or UOP is exited at the exit level.

一种实施方案中，可以把上面描述的多个级组织成三个阶段。第一阶段可以称为是有序前端，包括取出级310、解码级320、分配级330和重命名级340。在有序前端阶段，指令以它们的初始程序顺序通过流水线300。第二阶段可以称为无序执行阶段，包括调度/分配级350和执行级360。在这个阶段，只要解决了一条指令的数据依赖并且适当的执行单元可用就调度、分配并执行该指令，而不管原始程序中它的顺序位置。第三阶段，称为有序退出阶段，它包括退出级370，在这一阶段以指令的原始、顺序程序顺序退出以保持程序的完整性和语义。In one embodiment, the multiple stages described above can be organized into three stages. The first stage, which may be referred to as the in-order front end, includes fetch stage 310 , decode stage 320 , allocate stage 330 and rename stage 340 . In the in-order front-end stage, instructions pass through pipeline 300 in their original program order. The second stage, which may be referred to as an out-of-order execution stage, includes a scheduling/allocation stage 350 and an execution stage 360 . At this stage, an instruction is scheduled, dispatched, and executed whenever its data dependencies are resolved and appropriate execution units are available, regardless of its sequential position in the original program. The third stage, referred to as the orderly exit stage, includes the exit stage 370 in which instructions are retired in their original, sequential program order to preserve program integrity and semantics.

在有重放结构的处理器中，可以根据对输入指令的调度和执行采取特定的非常规行为。例如，即使一个输入UOP的源数据还没有准备好或未知，也可以把它分配到执行单元执行。如果根据该输入UOP的执行确定数据预测是错误的，就把相应的UOP发回到执行单元重新执行(重放)直到获得正确的结果为止。当然，希望限制重放或重新执行的次数，因为每个重放的UOP都使用可用资源并降低整体系统性能。此外，通过这样的冒险可以获得净性能的提高。例如，如果多数UOP以减少的周期数获得了正确的执行，而且只有少数UOP必须重放，那么相对于使所有UOP都等待和最坏情况可能占用的时间一样长的最小公分母情况整体吞吐量将会有所提高。In a processor with a replay structure, certain non-routine behaviors can be taken based on the scheduling and execution of incoming instructions. For example, even if the source data for an input UOP is not yet ready or unknown, it can be dispatched to an execution unit for execution. If the execution of the input UOP determines that the data prediction is wrong, the corresponding UOP is sent back to the execution unit for re-execution (replay) until the correct result is obtained. Of course, it is desirable to limit the number of replays or re-executions, since each replayed UOP uses available resources and reduces overall system performance. Additionally, there is a net performance gain to be gained from such ventures. For example, if a majority of UOPs get correct execution with a reduced number of cycles, and only a few UOPs have to be replayed, what is the overall throughput relative to the least common denominator case of making all UOPs wait as long as the worst case could take will improve.

像先前的申请中所讲授的那样，可以分割片上数据高速缓冲存储器(也称为零级或L0数据高速缓冲存储器)，使得它的数据存储阵列驻留在比根据数据存储阵列提供命中/未命中测定的逻辑更高的时钟区域中。TLB和标志逻辑也可以驻留在比数据存储阵列更慢的时钟区域中。TLB和标志逻辑也可以和命中/未命中逻辑位于相同时钟区域中，但并不要求这样。As taught in the prior application, an on-chip data cache (also known as level zero or L0 data cache) can be partitioned such that its data storage array resides in determined in a logic higher clock region. The TLB and flag logic can also reside in a slower clock region than the data storage array. The TLB and flag logic can also be in the same clock region as the hit/miss logic, but this is not required.

能够获得净性能提高的一种情况是UOP的执行依赖于或者使用来自L0数据高速缓冲存储器的数据。而不是使所有UOP等到它们的源数据被确定为有效为止，在处理中早些投机性地分配并执行一些UOP—即使还不知道它，但觉得它有可能被执行—以使它们的源数据驻留在L0数据高速缓冲存储器中。在大多数情况下，将命中L0数据高速缓冲存储器并用有效数据作为源数据。只有在少数情况下，数据预测是错误的并不得不重放UOP。同样，大多数UOP在减少的周期数中得到了正确地执行，因而提高了整体性能。One situation where a net performance gain can be obtained is when the execution of UOPs relies on or uses data from the L0 data cache. Instead of making all UOPs wait until their source data is determined to be valid, speculatively allocate and execute some UOPs earlier in the process—even if they don't know it yet, but think it's likely to be executed—so that their source data Residing in the L0 data cache. In most cases, the L0 data cache will be hit and valid data will be used as source data. There are only a few cases where the data prediction is wrong and the UOP has to be replayed. Also, most UOPs are executed correctly in a reduced number of cycles, thus improving overall performance.

图4是处理器400的一种实施方案的框图，处理器400有第一(也称为快或早)和第二(也称为慢或晚)重放路径以促进执行指令中的数据预测。如图4所示，处理器400包括调度器/分配器411，它耦合到指令高速缓冲存储器(未显示)来通过选择器(或复用器)421调度并分配从指令高速缓冲存储器接收到的第一条指令到执行核心431以便执行。一种实施方案中，执行核心431在执行输入指令当中完成数据投机。如上所述，即使输入指令的源数据可能还没有准备好或者是未知的，也可以对其进行分配。例如，输入指令的执行可能需要在或不在L0高速缓冲存储器中的源数据。然而，像上面所解释的那样，可以通过预测输入指令的执行所需的源数据驻留在L0数据高速缓冲存储器中来提高净性能。处理器400还包括第一重放机制441以在检测到指示数据预测错误的第一类型的错误时重新执行输入指令。一种实施方案中，第一类型的错误在第一阶段是可检测的。处理器400还包括第二重放机制451以在检测到指示数据预测错误的第二类型的错误时重新执行输入指令。一种实施方案中，第二类型的错误在第二阶段中是可以检测到的，第二阶段长于第一阶段。同样，如果已经检测到了第一类型的错误，本发明允许以比在指令必须等到检测到第二类型的错误时快得多的速度重新执行被不正确执行的指令。如图4所示，如果确定因为已经检测到的指示数据预测错误的第一类型的错误而不正确地执行了输入指令的执行，第一重放机制(快或早检查器)441将通过复用器421把相应的指令送回执行核心431重新执行(重放)。同样，如果确定因为检测到的指示数据预测错误或者存在其它错误条件的第二类型的错误而使输入指令的执行不正确，第二重放机制(慢或晚检查器)451将通过复用器421把相应的指令送回执行核心431重新执行(重放)。下面将详细描述图4中所示的第一和第二重放机制的功能和操作。4 is a block diagram of one embodiment of a processor 400 with first (also referred to as fast or early) and second (also referred to as slow or late) replay paths to facilitate data prediction in executing instructions . As shown in FIG. 4, processor 400 includes a scheduler/distributor 411 coupled to an instruction cache (not shown) to schedule and distribute through a selector (or multiplexer) 421 received from the instruction cache The first instruction goes to execution core 431 for execution. In one embodiment, execution core 431 performs data speculation during execution of input instructions. As mentioned above, even though the source data of the input instruction may not be ready or unknown, it can be allocated. For example, execution of an incoming instruction may require source data that may or may not be in the L0 cache. However, as explained above, net performance can be improved by predicting that the source data required for the execution of an incoming instruction resides in the L0 data cache. The processor 400 also includes a first replay mechanism 441 to re-execute an input instruction upon detection of a first type of error indicating a data misprediction. In one embodiment, errors of the first type are detectable in the first stage. Processor 400 also includes a second replay mechanism 451 to re-execute an input instruction upon detection of a second type of error indicating a data misprediction. In one embodiment, errors of the second type are detectable in a second stage, which is longer than the first stage. Also, if an error of the first type has been detected, the invention allows re-execution of an incorrectly executed instruction much faster than if the instruction had to wait until an error of the second type was detected. As shown in FIG. 4, if it is determined that the execution of the input instruction was incorrectly executed because of a first type of error that has been detected indicating a data misprediction, the first replay mechanism (fast or early checker) 441 will pass the replay The processor 421 sends the corresponding instruction back to the execution core 431 for re-execution (playback). Likewise, the second replay mechanism (slow or late checker) 451 will pass the multiplexer 421 sends the corresponding instruction back to execution core 431 for re-execution (replay). The function and operation of the first and second playback mechanisms shown in FIG. 4 will be described in detail below.

图5显示了处理器500的一种实施方案的更详细的框图，处理器500有上面参考图4描述的第一和第二重放路径。如图5所示，处理器500包括调度器511，它通过复用器521调度并分配从指令高速缓冲存储器(未显示)接收到的指令到执行核心531以便执行。下面详细描述复用器521的功能和操作。一种实施方案中，执行核心531在执行从复用器521接收到的输入指令中完成数据预测。处理器500还包括第一延迟单元541以产生输入指令的第一副本并在第一时钟区域内保存该输入指令的第一副本至少一个时钟周期。处理器500还包括第一检查器545，它耦合到第一延迟单元541和执行核心531。一种实施方案中，第一检查器545可配置用来确定数据预测关于第一错误类型集是否正确并在数据预测关于第一错误类型集错误时通过第一缓冲区547把输入指令的第一副本送回执行核心重新执行。如图5所示，处理器500还包括第二延迟单元551，它耦合到第一延迟单元，并且一种实施方案中它可配置用来产生输入指令的第二副本并在第二时钟区域内保存它至少一个时钟周期。处理器500包括第二检查器555，它耦合到第二延迟单元551和第一检查器545。一种实施方案中，第二检查器可配置用来确定指令的执行关于第二个错误类型集是否错误并在执行关于第二个错误类型集出错时通过第二缓冲区557把输入指令送回到执行核心531重复执行。如图5所示，复用器521耦合到调度器511、执行核心531、第一延迟单元541、第一检查器545、第二检查器555、第一缓冲区547和第二缓冲区557。一种实施方案中，复用器521可配置用来从指令高速缓冲存储器接收输入指令和后继指令，从第一检查器接收输入指令的第一副本，从第二检查器接收输入指令的第二副本。一种实施方案中，复用器521可以进一步配置用来根据预先确定的优先级方案有选择地提供后继指令、输入指令的第一副本或者输入指令的第二副本给执行核心531执行。一种实施方案中，给予输入指令的第二副本第一执行优先级，给予输入指令的第一副本第二执行优先级，给予后继指令第三执行优先级。一种实施方案中，第一优先级高于第二优先级，第二优先级高于第三优先级。一种实施方案中，第一个错误类型集是第二个错误类型集的子集。另一种实施方案中，第一个错误类型集是第二个错误类型集的补集。一种实施方案中，第一错误类型集包括指示0级高速缓冲存储器路线预测器未命中的错误，指示0级高速缓冲存储器CAM扩展不匹配的错误，以及指示存储转发缓冲数据未知的错误。一种实施方案中，第二错误类型集包含指示TLB未命中的错误，指示页未命中的错误，或者指示指令被不正确执行以及各自指令需要重新执行的任意其它错误，等等。一种实施方案中，第一延迟单元541可配置用来在第一时钟区域中的预先确定的时钟周期数之后提供输入指令的第一副本给第一检查器545。一种实施方案中，第一时钟区域中预先确定的时钟周期数近似对应于输入指令通过执行核心的时间延迟。FIG. 5 shows a more detailed block diagram of one embodiment of a processor 500 having the first and second playback paths described above with reference to FIG. 4 . As shown in FIG. 5, processor 500 includes a scheduler 511 that schedules and distributes instructions received from an instruction cache (not shown) through a multiplexer 521 to execution cores 531 for execution. The function and operation of the multiplexer 521 are described in detail below. In one embodiment, execution core 531 performs data prediction during execution of input instructions received from multiplexer 521 . The processor 500 also includes a first delay unit 541 to generate a first copy of the input instruction and hold the first copy of the input instruction in the first clock region for at least one clock cycle. Processor 500 also includes a first checker 545 coupled to first delay unit 541 and execution core 531 . In one embodiment, the first checker 545 may be configured to determine whether the data prediction is correct with respect to the first set of error types and pass the first data of the input instruction through the first buffer 547 when the data prediction is wrong with respect to the first set of error types. The copy is sent back to the execution core for re-execution. As shown in FIG. 5, the processor 500 also includes a second delay unit 551, which is coupled to the first delay unit and which in one embodiment can be configured to generate a second copy of the incoming instruction and clock in a second clock region Save it for at least one clock cycle. Processor 500 includes a second checker 555 coupled to second delay unit 551 and first checker 545 . In one embodiment, the second checker is configurable to determine whether the execution of the instruction was erroneous with respect to the second set of error types and to send the incoming instruction back through the second buffer 557 when the execution was erroneous with respect to the second set of error types to the execution core 531 for repeated execution. As shown in FIG. 5 , the multiplexer 521 is coupled to the scheduler 511 , the execution core 531 , the first delay unit 541 , the first checker 545 , the second checker 555 , the first buffer 547 and the second buffer 557 . In one embodiment, multiplexer 521 may be configured to receive an incoming instruction and a successor instruction from an instruction cache, receive a first copy of an incoming instruction from a first checker, and receive a second copy of an incoming instruction from a second checker. copy. In one embodiment, the multiplexer 521 may be further configured to selectively provide the subsequent instruction, the first copy of the incoming instruction, or the second copy of the incoming instruction to the execution core 531 for execution according to a predetermined priority scheme. In one embodiment, the second copy of the incoming instruction is given a first execution priority, the first copy of the incoming instruction is given a second execution priority, and the subsequent instruction is given a third execution priority. In one embodiment, the first priority is higher than the second priority, and the second priority is higher than the third priority. In one embodiment, the first set of error types is a subset of the second set of error types. In another embodiment, the first set of error types is the complement of the second set of error types. In one embodiment, the first set of error types includes errors indicating a level 0 cache way predictor miss, errors indicating a level 0 cache CAM extension mismatch, and errors indicating store-and-forward buffer data is unknown. In one embodiment, the second set of error types includes errors indicating a TLB miss, errors indicating a page miss, or any other error indicating that an instruction was executed incorrectly and the respective instruction needs to be re-executed, and so on. In one embodiment, the first delay unit 541 may be configured to provide the first copy of the incoming instruction to the first checker 545 after a predetermined number of clock cycles in the first clock region. In one embodiment, the predetermined number of clock cycles in the first clock region approximately corresponds to a time delay of an incoming instruction through the execution core.

有些情况下处理器500中的另一个单元能够产生它自己的指令来执行它的对应功能。例如，处理器500中的存储器控制单元或存储器执行单元(未显示)有时可能需要在它自己的流水线中分配指令以便执行，它的流水线包括完整的存储操作或UOP来处理页面分割和TLB重新加载，等等。这些类型的指令被称为是制造指令，因为它们是由处理器500中的一个单元产生或制造的而且不在来自指令高速缓冲存储器的指令流中。一种实施方案中，复用器521还被耦合用来接收制造指令并把它们发送到执行核心531以便执行。因为复用器521可以同时从不同的路径接收指令，就需要一种预先确定的优先级机制来协调从不同路径送往复用器521的指令之间的执行优先级。例如，复用器可以在相同的处理周期或时钟周期中从调度器511接收后继指令、从第一检查器545接收将要重放的输入指令的第一副本、从第二检查器55接收将要重放的另一条输入指令、并从另一个单元(例如，存储器控制或执行单元)接收制造指令。一种实施方案中，复用器521给予来自指令高速缓冲存储器的指令以低优先级，给予来自第一检查器的重放指令以中优先级，给予来自第二检查器的重放指令以高优先级，给予制造指令以最高优先级。In some cases another unit in processor 500 can generate its own instructions to perform its corresponding function. For example, a memory control unit or memory execution unit (not shown) in processor 500 may sometimes need to dispatch instructions for execution in its own pipeline, which pipeline includes complete store operations or UOPs to handle page splits and TLB reloads ,etc. These types of instructions are referred to as fabricated instructions because they are generated or fabricated by a unit in processor 500 and are not in the instruction stream from the instruction cache. In one embodiment, multiplexer 521 is also coupled to receive fabrication instructions and send them to execution core 531 for execution. Because the multiplexer 521 can receive instructions from different paths at the same time, a predetermined priority mechanism is needed to coordinate the execution priorities among the instructions sent to the multiplexer 521 from different paths. For example, the multiplexer may receive a subsequent instruction from the scheduler 511, a first copy of an incoming instruction to be replayed from the first checker 545, a copy of an incoming instruction to be replayed from the second checker 55, all in the same processing cycle or clock cycle. place another input instruction, and receive fabrication instructions from another unit (eg, a memory control or execution unit). In one embodiment, the multiplexer 521 gives low priority to instructions from the instruction cache, medium priority to replay instructions from the first checker, and high priority to replay instructions from the second checker. Priority, to give the manufacturing order the highest priority.

如上所示，一种实施方案中，由第一检查器545检测到的错误条件可以是由第二检查器555检测到的错误条件的一个子集。这种情况下，第二检查器555需要提供健壮的检查，因为一旦UOP到达就不能由第二检查器555重放。另一种实施方案中，由第一检查器545处理的错误条件可以是由第二检查器555处理的错误条件的补集。这种情况下，第一检查器545将需要在它的错误情况集上提供健壮的检查，而不是上面所描述的“高度自信但不保证”的检查，因为后来的检查器将不再重新检查前面的检查器的结果。因而，子集模式是优选的。As indicated above, in one embodiment, the error conditions detected by the first checker 545 may be a subset of the error conditions detected by the second checker 555 . In this case, the second checker 555 needs to provide a robust check, since a UOP cannot be replayed by the second checker 555 once it arrives. In another embodiment, the error conditions handled by the first checker 545 may be the complement of the error conditions handled by the second checker 555 . In this case, the first checker 545 will need to provide robust checks on its set of error conditions, rather than the "highly confident but not guaranteed" checks described above, since later checkers will not recheck The result of the previous checker. Thus, subset mode is preferred.

如前所述，第二检查器555能够在不由第一检查器545处理器的错误情况上提供附加的和/或互补的检查。至于什么情况由哪个检查器处理器的决定可以由多个因素来确定，包括处理器性能的利害关系、设计复杂度、印模面积等等，但并不局限于这些因素。一种实施方案中，第二检查器555负责因为TLB未命中和可能出现在处理器500的存储器控制单元(未显示)的其它不同问题而重放指令。这些不同问题可能包括短时间内难以检测到的问题或错误，例如基于全物理地址检查的高速缓冲存储器未命中，基于全物理地址检查的不正确的转发存储，等等。As previously mentioned, the second checker 555 can provide additional and/or complementary checks on error conditions not handled by the first checker 545 . The decision as to which checker processor to use can be determined by a number of factors including, but not limited to, processor performance concerns, design complexity, die area, etc. In one embodiment, the second checker 555 is responsible for replaying instructions due to TLB misses and other various problems that may arise in the memory control unit (not shown) of the processor 500 . These various problems may include problems or errors that are difficult to detect for a short period of time, such as cache misses based on full physical address checks, incorrect forward stores based on full physical address checks, and so on.

一种实施方案中，第一检查器545和第二检查器555合作控制复用器521的操作。如图5所示，复用器521根据从第一检查器545、第二检查器555接收到的选择信号以及从另一单元接收到的任选的另一个选择信号执行它的对应功能，所说的另一单元如产生不在来自指令高速缓冲存储器的指令流中的制造指令的存储器控制单元(未显示)。如果有来自多个不同路径的不止一条指令在等待执行，复用器521用这些不同的选择信号来确定哪条指令将被发送给执行核心531以在给定的处理周期中执行。一种实施方案中，制造指令被给予第一执行优先级，来自第二检查器555的指令被给予第二优先级，第二优先级低于第一优先级，来自第一检查器545的指令被给予第三优先级，第三优先级低于第二优先级，通过调度器511来自指令高速缓冲存储器的后继指令被给予第四优先级，第四优先级低于第二优先级。In one embodiment, the first checker 545 and the second checker 555 cooperate to control the operation of the multiplexer 521 . As shown in Figure 5, the multiplexer 521 performs its corresponding function according to the selection signal received from the first checker 545, the second checker 555 and an optional another selection signal received from another unit, so Said another unit such as a memory control unit (not shown) that generates fabrication instructions that are not in the instruction stream from the instruction cache. If more than one instruction from multiple different paths is waiting to be executed, multiplexer 521 uses these different select signals to determine which instruction to send to execution core 531 for execution in a given processing cycle. In one embodiment, manufacturing instructions are given a first execution priority, instructions from the second checker 555 are given a second priority, the second priority being lower than the first priority, and instructions from the first checker 545 is given a third priority, which is lower than the second priority, and subsequent instructions from the instruction cache via the scheduler 511 are given a fourth priority, which is lower than the second priority.

一种实施方案中，一旦一个特定的UOP已经由第一检查器545送出用于快速回放，该UOP的相同示例将不会由第二检查器555送出用于慢速回放，因为那样将会存在副本。为了防止这种情况发生，一种实施方案中，每个UOP可以包括一些特殊的字段，由第一检查器545和第二检查器555用它们来协调两个检查器之间的回放活动。例如，一种实施方案中，一个UOP可以包括被称为NEEDS_FAST_REPLAY的字段，它由第一检查器545设置来指示第一检查器545想要把它发送出去用于快速回放。相应的UOP还可以包括另一个称为GOT_FAST_REPLAY字段。GOT_FAST_REPLAY字段，在一种实施方案中，由第一检查器545和第二检查器555之间的合作来设置。例如，假定因为已经检测到第一类型的错误，第一检查器想要发送第一条指令用于快速回放。这种情况下，第一检查器545将设置相应UOP的对应NEEDS_FAST_REPLAY字段以指示这个特定的UOP需要在快速回放路径上回放。如果在相同时钟周期中第二检查器555想要发送第二个UOP用于慢回放，第一条指令的GOT_FAST_REPLAY字段将被清除而且将控制复用器521选择慢回放UOP代替寻找快速回放的那个。然后，当第一个UOP到达第二检查器555时，它将被发送出去用于慢回放路径上的回放，因为已经设置了它的对应NEEDS_FAST_REPLAY字段。In one embodiment, once a particular UOP has been sent out for fast playback by the first inspector 545, the same instance of that UOP will not be sent out for slow playback by the second inspector 555, because that would exist copy. To prevent this from happening, in one embodiment, each UOP may include special fields that are used by the first inspector 545 and the second inspector 555 to coordinate playback activity between the two inspectors. For example, in one embodiment, a UOP may include a field called NEEDS_FAST_REPLAY, which is set by the first checker 545 to indicate that the first checker 545 wants to send it out for fast playback. The corresponding UOP may also include another field called GOT_FAST_REPLAY. The GOT_FAST_REPLAY field, in one embodiment, is set by cooperation between the first checker 545 and the second checker 555 . For example, assume that the first checker wants to send the first instruction for fast playback because a first type of error has been detected. In this case, the first checker 545 will set the corresponding NEEDS_FAST_REPLAY field of the corresponding UOP to indicate that this particular UOP needs to be played back on the fast playback path. If in the same clock cycle the second checker 555 wants to send a second UOP for slow playback, the GOT_FAST_REPLAY field of the first instruction will be cleared and the multiplexer 521 will be controlled to select the slow playback UOP instead of the one looking for fast playback . Then, when the first UOP reaches the second checker 555, it will be sent out for playback on the slow playback path because its corresponding NEEDS_FAST_REPLAY field has been set.

图6描绘了方法600的一种实施方案的流程图，方法600使用快和慢回放路径来促进数据预测操作。方法600从模块601开始并进行到模块605。在模块605，执行核心或单元在执行输入指令中执行数据预测。方法600然后从模块605进行到模块609。在模块609，确定是否已经检测到第一类型的错误。像上面所解释的那样，一种实施方案中，如果L0数据高速缓冲存储器路线预测器未命中就会发生第一类型的错误，这种情况下数据不可能在L0数据高速缓冲存储器中，L0数据高速缓冲存储器CAM扩展不匹配(即，路线预测器命中但标记不匹配)，或者存储转发缓冲区数据未知(即，数据被假定为从从存储转发缓冲区转发，但存储数据不要课堂)，等等。在模块613，如果已经检测到第一类型的错误就重新执行输入指令。如上所述，当检测到第一类型的错误时，第一检查器单元(即，快或早检查器)将发送输入指令的一个副本用于在快重放路径上重放或重新执行。方法600进行到模块617。在模块617，确定是否已经检测到了第二类型的错误。在这种实施方案中，第二检查器(即，慢或晚检查器)负责确定第二类型的错误是否已经发生。在模块621，如果第二类型的错误已经发生，就重新执行输入指令。如上所述，如果已经发生了第二类型的错误，第二检查器负责发送输入指令的一个副本用于在慢重放路径上重放。FIG. 6 depicts a flowchart of one embodiment of a method 600 that uses fast and slow playback paths to facilitate data prediction operations. Method 600 begins at block 601 and proceeds to block 605 . At block 605, the execution core or unit performs data prediction in executing the input instruction. Method 600 then proceeds from block 605 to block 609 . At block 609, it is determined whether a first type of error has been detected. As explained above, in one embodiment, the first type of error occurs if the L0 data cache way predictor misses, in which case the data cannot be in the L0 data cache, the L0 data Cache CAM extension mismatch (i.e., route predictor hit but tag mismatch), or store-forward buffer data unknown (i.e., data is assumed to be forwarded from store-forward buffer, but store data does not class), etc. wait. At block 613, the input instruction is re-executed if an error of the first type has been detected. As mentioned above, when a first type of error is detected, the first checker unit (ie fast or early checker) will send a copy of the input instruction for replay or re-execution on the fast replay path. Method 600 proceeds to block 617 . At block 617, it is determined whether a second type of error has been detected. In such an embodiment, a second checker (ie, a slow or late checker) is responsible for determining whether a second type of error has occurred. At block 621, if an error of the second type has occurred, the input command is re-executed. As mentioned above, the second checker is responsible for sending a copy of the input command for replay on the slow replay path if an error of the second type has occurred.

已经结合优选实施方案描述了本发明。按照前面的描述，对本领域的技术人员来说显然有很多方案、改进、变体和用途是非常明显的。The invention has been described in connection with the preferred embodiments. In view of the foregoing description it will be apparent to those skilled in the art that many arrangements, modifications, variations and uses are readily apparent.

Claims

1. microprocessor comprises:

Carry out core, actual figure be it is predicted in carrying out first instruction;

First replay mechanism, with the mistake of the first kind that determines whether to detect the designation data prediction error, and if the mistake that has detected the first kind just the first authentic copy of article one instruction is beamed back and is carried out core and reset; And

Second replay mechanism determines whether to detect the mistake of second type, and if the mistake that has detected second type just the triplicate of article one instruction is beamed back and is carried out core and reset.

2. the microprocessor of claim 1, wherein the mistake of the first kind is detectable in first cycle, and the mistake of second type is detectable in second period, and second period is longer than first cycle.

3. the microprocessor of claim 1 also comprises:

First delay cell, this first authentic copy that produces article one instruction also keeps the first authentic copy at least one clock period in the first clock zone.

4. the microprocessor of claim 3 also comprises:

Second delay cell produces this triplicate of article one instruction, and keeps at least one clock period of triplicate in the second clock zone.

5. the microprocessor of claim 4 further comprises:

Instruction cache stores and provides article one instruction and a successor instruction to carrying out core.

6. the microprocessor of claim 5 further comprises:

Selector switch, coupling be used for from instruction cache receive successor instruction, from first replay mechanism receive article one instruction the first authentic copy, receive another instruction from second replay mechanism, selector switch according to predetermined precedence scheme offer carry out that core carries out be from instruction cache successor instruction, from the first authentic copy of article one instruction of first replay mechanism or from another instruction of second replay mechanism.

7. the microprocessor of claim 6, wherein selector switch comprises a multiplexer.

8. the microprocessor of claim 6, wherein another instruction is given first execution priority, article one, Zhi Ling the first authentic copy is given second execution priority, successor instruction is given the 3rd execution priority, second execution priority is lower than first execution priority, and the 3rd execution priority is lower than second execution priority.

9. the microprocessor of claim 1, wherein the mistake of the first kind is the subclass of the mistake of second type.

10. the microprocessor of claim 1, wherein the mistake of the first kind is the supplementary set of the mistake of second type.

11. the microprocessor of claim 1, wherein the mistake of the first kind is chosen from one group of mistake, and this group is wrong to be expanded unmatched mistake and indicated the mistake of storage intransit buffering district's data the unknown to form by the miss mistake of 0 grade of cache ways available line fallout predictor of indication, 0 grade of cache memory CAM of indication.

12. the microprocessor of claim 1, wherein the mistake of second type is chosen from one group of mistake, and this group is wrong to be made up of with the mistake from the incorrect forwarding of storing that indication is checked according to full physical address the miss mistake of indication TLB.

13. the microprocessor of claim 3, wherein first delay cell is the first authentic copy in order to provide article one to instruct after the predefine number of clock in the first clock zone, and predefined clock period quantity is roughly corresponding to the delay of article one instruction by the execution core in the first clock zone.

14. the microprocessor of claim 6 further comprises:

Be used for making not at device from the instruction of the instruction stream of instruction cache.

15. the microprocessor of claim 14, wherein selector switch is coupled to be used for receiving the instruction that produces and they are sent to the execution core and carries out.

16. the microprocessor of claim 15, wherein selector switch gives the instruction low priority from instruction cache, give the playback instructions medium priority from first detector, give the playback instructions high priority from second detector, the instruction that produces is with limit priority.

17. the microprocessor of claim 5 also comprises:

The scheduler that is coupled to instruction cache and carries out core, scheduling also distributes article one instruction that receives from instruction cache to be used to carry out core.

18. a method comprises:

Actual figure be it is predicted in the instruction of execution article one in carrying out core;

Detect the mistake of the first kind of designation data prediction error;

Response detects the mistake of the first kind, sends the first authentic copy of article one instruction by first playback path of carrying out the core execution;

Detect the mistake of second type of designation data prediction error; When detecting the first kind wrong, by carrying out the first authentic copy that core re-executes article one instruction;

Response detects the mistake of second type, sends the triplicate of article one instruction by second playback path of carrying out the core execution; And

When detecting second type wrong, re-execute the triplicate of article one instruction.

19. the method for claim 18, wherein the triplicate of article one instruction is assigned with execution priority, and the first authentic copy of article one instruction is assigned with execution priority.

20. the method for claim 19, wherein the execution priority of the triplicate of article one instruction is higher than the execution priority of the first authentic copy of article one instruction.