HK1246441B

HK1246441B - Bulk allocation of instruction blocks to a processor instruction window

Info

Publication number: HK1246441B
Application number: HK18105887.5A
Authority: HK
Inventors: D‧C‧伯格; A‧史密斯; J‧格雷
Original assignee: 微软技术许可有限责任公司
Priority date: 2015-06-26
Filing date: 2016-06-23
Publication date: 2022-02-25

Description

Batch allocation of instruction blocks to processor instruction windows

背景技术Background Art

指令集架构(ISA)和处理器的设计者对功耗和性能进行权衡。例如，如果设计者选择具有递送更高性能的指令的ISA，则处理器的功耗可能也会更高。或者，如果设计者选择具有功耗较低的指令的ISA，则性能可能更低。功耗可以与在执行期间由指令所使用的处理器的硬件资源(诸如算术逻辑单元(ALU)、高速缓存线或寄存器)的数量相关。使用大量这样的硬件资源可以以较高的功耗为代价递送较高的性能。或者，使用少量这样的硬件资源可以以较低的性能为代价产生较低的功耗。编译器可以用于将高级代码编译成与ISA和处理器架构兼容的指令。The designers of instruction set architecture (ISA) and processor weigh power consumption and performance. For example, if the designer selects an ISA with instructions that deliver higher performance, the power consumption of the processor may also be higher. Alternatively, if the designer selects an ISA with instructions that have lower power consumption, the performance may be lower. Power consumption can be related to the quantity of hardware resources (such as arithmetic logic unit (ALU), cache lines or registers) of the processor used by the instruction during execution. Using a large amount of such hardware resources can deliver higher performance at the cost of higher power consumption. Alternatively, using a small amount of such hardware resources can produce lower power consumption at the cost of lower performance. A compiler can be used to compile high-level code into instructions compatible with ISA and processor architecture.

发明内容Summary of the Invention

基于指令块的微架构中的处理器内核包括控制单元，控制单元通过同时提取指令块和包括控制位及操作数的相关联的资源，来以批量方式将指令分配到指令窗口中。这样的批量分配通过在执行期间在块中的所有指令上实施一致的管理和策略实施方式来支持处理器内核操作的效率提高。例如，当指令块自行向后分支时，它可以在刷新过程中被重新使用，而不是从指令高速缓存中被重新提取。由于该指令块的所有资源都位于一个地方，所以指令可以保持在合适的位置，并且只需要清除有效位。批量分配还支持通过块中的指令进行操作数共享以及指令之间显式的消息传递。The processor core in an instruction block-based microarchitecture includes a control unit that dispatches instructions into instruction windows in batches by simultaneously fetching the instruction block and associated resources, including control bits and operands. Such batch dispatch supports efficient operation of the processor core by enforcing consistent management and policy implementation across all instructions in a block during execution. For example, when an instruction block branches backward on itself, it can be reused during a flush instead of being re-fetched from the instruction cache. Since all resources for the instruction block are located in one place, the instructions can remain in place and only need to clear valid bits. Batched dispatch also supports operand sharing across instructions in a block and explicit message passing between instructions.

提供本“发明内容”以便以简化的形式介绍将在以下“具体实施方式”中进一步描述的一些概念。本“发明内容”不旨在标识所要求保护的主题的关键特征或基本特征，也不旨在用于帮助确定所要求保护的主题的范围。此外，所要求保护的主题不限于解决本公开的任何部分中提到的任何或全部缺点的实现。This Summary is provided to introduce in a simplified form some of the concepts that will be further described in the Detailed Description below. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1示出了说明性计算环境，其中编译器提供在包括多个处理器内核的架构上所运行的编码指令；FIG1 shows an illustrative computing environment in which a compiler provides coded instructions that execute on an architecture including multiple processor cores;

图2是用于示例性处理器内核的说明性微架构的框图；FIG2 is a block diagram of an illustrative microarchitecture for an exemplary processor core;

图3示出了块标头(block header)的说明性布置；以及FIG3 shows an illustrative arrangement of a block header; and

图4-15是说明性方法的流程图。4-15 are flow charts of illustrative methods.

相似的附图标记表示附图中的相似的元素。除非另有说明，否则元素不是按比例绘制的。Like reference numerals refer to like elements in the drawings. Unless otherwise noted, elements are not drawn to scale.

具体实施方式DETAILED DESCRIPTION

图1示出了可以与其一起利用指令块的当前批量分配的说明性计算环境100。该环境包括编译器105，其可以用于从程序115生成编码的机器可执行指令110。指令110可以由处理器架构120处理，处理器架构120被配置为处理具有可变尺寸内容(例如，在4到128个指令之间)的指令块。1 shows an illustrative computing environment 100 with which the current batch allocation of instruction blocks can be utilized. The environment includes a compiler 105 that can be used to generate encoded machine-executable instructions 110 from a program 115. The instructions 110 can be processed by a processor architecture 120 that is configured to process instruction blocks having variable-sized contents (e.g., between 4 and 128 instructions).

处理器架构120通常包括平铺配置的多个处理器内核(由附图标记125代表性地指示)，这些处理器内核由片上网络(未示出)互连，并且还与一个或多个2级(L2)高速缓存(由附图标记130代表性地指示)交互操作。尽管内核和高速缓存的数目和配置可以根据实施方式而变化，但是应当注意，物理内核可以在程序115的运行时期间在被称为“合成”的过程中被一起合并到一个或多个更大的逻辑处理器中，更大的逻辑处理器可以使得更多的处理能力致力于程序执行。或者，当程序执行支持合适的线程级并行性时，内核125可以在被称为“分解”的过程中被拆分以独立地工作，并且执行来自独立线程的指令。The processor architecture 120 typically includes a plurality of processor cores (representatively indicated by reference numeral 125) in a tiled configuration, interconnected by an on-chip network (not shown), and further interoperating with one or more level 2 (L2) caches (representatively indicated by reference numeral 130). Although the number and configuration of cores and caches may vary depending on the implementation, it should be noted that the physical cores can be merged together into one or more larger logical processors during the runtime of the program 115 in a process referred to as "composition", which can dedicate more processing power to program execution. Alternatively, when the program execution supports suitable thread-level parallelism, the cores 125 can be split up to operate independently and execute instructions from independent threads in a process referred to as "decomposition".

图2是说明性处理器内核125的一部分的简化框图。如图所示，处理器内核125可以包括前端控制单元202、指令高速缓存204、分支预测器206、指令解码器208、指令窗口210、左操作数缓冲区212、右操作数缓冲区214、算术逻辑单元(ALU)216、另一ALU 218、寄存器220以及加载/存储队列222。在一些情况下，总线(由箭头指示)可以携带数据和指令，而在其他情况下，总线可以携带数据(例如，操作数)或控制信号。例如，前端控制单元202可以经由仅携带控制信号的总线来与其他控制网络通信。尽管图2示出了以特定安排来被布置的用于处理器内核125的一定数目的说明性部件，但是取决于特定实施方式的需要，可以具有不同布置的更多或更少的部件。FIG2 is a simplified block diagram of a portion of an illustrative processor core 125. As shown, the processor core 125 may include a front-end control unit 202, an instruction cache 204, a branch predictor 206, an instruction decoder 208, an instruction window 210, a left operand buffer 212, a right operand buffer 214, an arithmetic logic unit (ALU) 216, another ALU 218, registers 220, and a load/store queue 222. In some cases, a bus (indicated by an arrow) may carry both data and instructions, while in other cases, a bus may carry data (e.g., operands) or control signals. For example, the front-end control unit 202 may communicate with other control networks via a bus that carries only control signals. Although FIG2 shows a number of illustrative components for the processor core 125 arranged in a particular arrangement, more or fewer components may be arranged differently depending on the needs of a particular implementation.

前端控制单元202可以包括被配置为控制通过处理器内核的信息流的电路以及用于协调其内的活动的电路。前端控制单元202还可以包括用于实施有限状态机(FSM)的电路，在FSM中状态列举了处理器内核可以采用的每个操作配置。通过使用操作码(如下所述)和/或其他输入(例如，硬件级信号)，前端控制单元202中的FSM电路可以确定下一状态并且控制输出。The front-end control unit 202 may include circuitry configured to control the flow of information through the processor core and circuitry for coordinating activities therein. The front-end control unit 202 may also include circuitry for implementing a finite state machine (FSM), in which a state enumerates each operational configuration that the processor core can adopt. Using opcodes (described below) and/or other inputs (e.g., hardware-level signals), the FSM circuitry in the front-end control unit 202 may determine the next state and control outputs.

因此，前端控制单元202可以从指令高速缓存204提取指令，以用于由指令解码器208进行处理。前端控制单元202可以通过控制网络或总线来与处理器内核125的其他部分交换控制信息。例如，前端控制单元可以与后端控制单元224交换控制信息。在一些实施方式中，前端控制单元和后端控制单元可以被集成到单个控制单元中。Thus, the front-end control unit 202 can fetch instructions from the instruction cache 204 for processing by the instruction decoder 208. The front-end control unit 202 can exchange control information with other parts of the processor core 125 via a control network or bus. For example, the front-end control unit can exchange control information with the back-end control unit 224. In some embodiments, the front-end control unit and the back-end control unit can be integrated into a single control unit.

前端控制单元202还可以协调和管理对处理器架构120(图1)的各个内核和其他部分的控制。因此，例如，指令块可以同时在多个内核上执行，并且前端控制单元202可以经由控制网络来与其他内核交换控制信息，以根据需要确保各种指令块执行的同步。The front-end control unit 202 may also coordinate and manage control of the various cores and other portions of the processor architecture 120 ( FIG. 1 ). Thus, for example, an instruction block may be executed simultaneously on multiple cores, and the front-end control unit 202 may exchange control information with the other cores via a control network to ensure synchronization of the execution of the various instruction blocks as needed.

前端控制单元202可以进一步处理关于以原子态执行的指令块的控制信息和元信息。例如，前端控制单元202可以处理与指令块相关联的块标头。如下面更详细地讨论的，块标头可以包括关于指令块的控制信息和/或元信息。相应地，前端控制单元202可以包括组合逻辑、状态机和暂态存储单元(诸如触发器)，以处理块标头中的各个字段。The front-end control unit 202 may further process control information and meta-information about the instruction block executed in the atomic state. For example, the front-end control unit 202 may process a block header associated with the instruction block. As discussed in more detail below, the block header may include control information and/or meta-information about the instruction block. Accordingly, the front-end control unit 202 may include combinational logic, a state machine, and a transient storage unit (such as a flip-flop) to process various fields in the block header.

前端控制单元202可以在每个时钟周期提取并解码单个指令或多个指令。已解码指令可以存储在指令窗口210中，指令窗口210在处理器内核硬件中被实施作为缓冲区。指令窗口210可以支持指令调度器230，在一些实施方式中，指令调度器230可以保持每个已解码指令的输入(诸如断言和操作数)的就绪状态。例如，当其所有输入(如果有的话)就绪时，给定的指令可以由指令调度器230唤醒并且准备好被发出。The front-end control unit 202 can fetch and decode a single instruction or multiple instructions per clock cycle. The decoded instructions can be stored in an instruction window 210, which is implemented as a buffer in the processor core hardware. The instruction window 210 can support an instruction scheduler 230, which, in some embodiments, can maintain the readiness of the inputs (such as predicates and operands) of each decoded instruction. For example, a given instruction can be woken up by the instruction scheduler 230 and ready to be issued when all of its inputs (if any) are ready.

在发出指令之前，根据需要，指令所需要的任何操作数可以被存储在左操作数缓冲区212和/或右操作数缓冲区214中。取决于指令的操作码，可以使用ALU 216和/或ALU218或其他功能单元来对操作数执行操作。ALU的输出可以存储在操作数缓冲区中，或者存储在一个或多个寄存器220中。以数据流顺序发出的存储操作可以在加载/存储队列222中排队，直到指令块提交。当指令块提交时，加载/存储队列222可以将所提交的块存储写入存储器。分支预测器206可以处理与分支出口类型相关的块标头信息，并且在进行分支预测时将该信息作为因素。Before issuing an instruction, any operands required by the instruction may be stored in the left operand buffer 212 and/or the right operand buffer 214, as needed. Depending on the opcode of the instruction, the ALU 216 and/or ALU 218 or other functional units may be used to perform operations on the operands. The output of the ALU may be stored in the operand buffer or in one or more registers 220. Store operations issued in data flow order may be queued in the load/store queue 222 until the instruction block is committed. When the instruction block is committed, the load/store queue 222 may write the committed block stores to memory. The branch predictor 206 may process block header information related to the branch exit type and factor this information into the branch prediction.

如上所述，处理器架构120通常利用在以原子态提取、执行和提交的块中所组织的指令。因此，处理器内核可以集中地提取属于单个块的指令，将它们映射到处理器内核内的执行资源，执行指令，并且以原子态提交它们的结果。处理器可以提交所有指令的结果，或者可以取消整个块的执行。块内的指令可以按照数据流顺序执行。另外，处理器可以允许块内的指令使用消息或其他合适形式的通信来彼此直接通信。因此，产生结果的指令可以将该结果传送给块中利用该结果的另一指令，而非将结果写入寄存器文件。作为示例，对存储在寄存器R1和R2中的值进行累加的指令可以被表示为如下面的表1所示：As described above, the processor architecture 120 typically utilizes instructions organized in blocks that are extracted, executed, and submitted atomically. Thus, the processor core can centrally extract instructions belonging to a single block, map them to execution resources within the processor core, execute the instructions, and submit their results atomically. The processor can submit the results of all instructions, or can cancel the execution of the entire block. The instructions within the block can be executed in data flow order. In addition, the processor can allow the instructions within the block to communicate directly with each other using messages or other suitable forms of communication. Therefore, an instruction that produces a result can transmit the result to another instruction in the block that utilizes the result, rather than writing the result to a register file. As an example, an instruction that accumulates the values stored in registers R1 and R2 can be represented as shown in Table 1 below:

表1Table 1

以这种方式，源操作数不是用指令指定的，而是由以ADD指令为目标的指令来指定的。编译器105(图1)可以在指令110的编译期间显式地对控制和数据依赖关系进行编码，从而使得处理器内核无需在运行时重新发现这些依赖关系。这可以有利地使得在这些指令的执行期间处理器负载减少并且节能。作为示例，编译器可以使用断言来将所有的控制依赖关系转换成数据流指令。通过使用这些技术，可以减少对耗电的寄存器文件的访问次数。下面的表2显示了这样的指令的通用指令格式的示例：In this way, the source operand is not specified by the instruction, but by the instruction with the ADD instruction as the target. The compiler 105 (Figure 1) can explicitly encode the control and data dependencies during the compilation of the instruction 110, so that the processor core does not need to rediscover these dependencies at runtime. This can advantageously reduce the processor load and save energy during the execution of these instructions. As an example, the compiler can use assertions to convert all control dependencies into data flow instructions. By using these techniques, the number of accesses to the power-consuming register file can be reduced. Table 2 below shows an example of a general instruction format for such an instruction:

表2Table 2

OPCODEOPCODE PRPR BIDBID XOPXOP TARGET1TARGET1 TARGET2TARGET2

每个指令可以具有合适的尺寸，诸如32位、64位或另一尺寸。在表2所示的示例中，每个指令可以包括OPCODE字段、PR(断言)字段、BID(广播ID)字段、XOP(扩展的OPCODE)字段、TARGET1字段和TARGET2字段。OPCODE字段可以为指令或指令块指定唯一的操作码，诸如加法、读取、写入或乘法。PR(预测)字段可以指定与指令相关联的任何断言。例如，可以使用2位的PR字段，如下：00-未被断言，01-保留，10-是虚假断言，11-真实断言。因此，例如，如果只有在比较结果为真的情况下才执行指令，则可以根据执行比较的另一指令的结果来对该指令进行断言。BID(广播ID)字段可以支持将操作数发送到块中的任何数目的消费者指令。2位的BID字段可以用来对指令在其上接收其操作数中的一项的广播信道进行编码。XOP(扩展的OPCODE)字段可以支持对操作码的类型进行扩展。TARGET1和TARGET2字段最多可以允许待编码两个目标指令。目标字段可以指定生产者指令的结果的消费者指令，从而允许指令之间的直接通信。Each instruction can have a suitable size, such as 32 bits, 64 bits, or another size. In the example shown in Table 2, each instruction can include an OPCODE field, a PR (prediction) field, a BID (broadcast ID) field, an XOP (extended OPCODE) field, a TARGET1 field, and a TARGET2 field. The OPCODE field can specify a unique opcode for an instruction or instruction block, such as addition, read, write, or multiplication. The PR (prediction) field can specify any predicates associated with the instruction. For example, a 2-bit PR field can be used as follows: 00 - not asserted, 01 - reserved, 10 - false assertion, 11 - true assertion. Thus, for example, if an instruction is only executed if a comparison result is true, the instruction can be asserted based on the result of another instruction that performs the comparison. The BID (broadcast ID) field can support sending operands to any number of consumer instructions in a block. The 2-bit BID field can be used to encode the broadcast channel on which the instruction receives one of its operands. The XOP (extended OPCODE) field can support expanding the type of opcode. The TARGET1 and TARGET2 fields may allow up to two target instructions to be encoded. The target field may specify a consumer instruction for the result of a producer instruction, thereby allowing direct communication between instructions.

每个指令块可以具有与该指令块相关联的特定信息，诸如与该块相关的控制信息和/或元信息。这个信息可以在将程序编译成指令110的过程中由编译器105生成，以用于在处理器架构120上执行。这些信息中的一些可以在指令块的编译过程中被编译器提取，并且然后检查运行时期间的指令性质。Each instruction block may have specific information associated with it, such as control information and/or meta-information related to the block. This information may be generated by the compiler 105 during the process of compiling a program into instructions 110 for execution on the processor architecture 120. Some of this information may be extracted by the compiler during the compilation of the instruction block and then examined for instruction properties during runtime.

另外，与指令块相关联的信息可以是元信息。作为示例，这样的信息可以使用专用指令或者提供与寄存器或其他存储器相关的目标编码的指令被提供给处理器内核，其中寄存器或其他存储器可以具有与指令块相关联的相关信息。在专用指令的情况下，这样的指令的操作码字段可以用于传送与指令块相关的信息。在另一示例中，这样的信息可以被维持作为处理器状态字(PSW)的一部分。例如，该信息可以有利地帮助处理器更有效地执行指令块。In addition, the information associated with the instruction block can be meta-information. As an example, such information can be provided to the processor core using a dedicated instruction or an instruction that provides a target encoding associated with a register or other memory, wherein the register or other memory may have the relevant information associated with the instruction block. In the case of a dedicated instruction, the opcode field of such an instruction can be used to transmit information related to the instruction block. In another example, such information can be maintained as part of a processor status word (PSW). For example, this information can advantageously help the processor to execute the instruction block more efficiently.

各种类型的信息可以使用块标头、专用指令、存储器引用的位置、处理器状态字(PSW)或其各种组合来被提供给处理器内核。说明性指令块标头300在图3中示出。在该说明性示例中，块标头300是128位，并且开始于距块的程序计数器的偏移量0处。还示出了每个字段的相应的开始和结束。这些字段在下面的表3中描述：Various types of information can be provided to the processor core using a block header, a dedicated instruction, the location of a memory reference, a processor status word (PSW), or various combinations thereof. An illustrative instruction block header 300 is shown in FIG3 . In this illustrative example, the block header 300 is 128 bits and begins at offset 0 from the program counter of the block. The corresponding start and end of each field are also shown. These fields are described in Table 3 below:

表3Table 3

尽管在图3中示出和表3中描述的块标头包括多个字段，但是其旨在是说明性的，并且其他字段布置可以用于特定实施方式。Although the block header shown in FIG. 3 and described in Table 3 includes multiple fields, this is intended to be illustrative, and other field arrangements may be used for particular implementations.

在说明性示例中，编译器105(图1)可以选择用于包括在块标头中的信息，或者用于专用指令的信息，专用指令可以基于指令的性质和/或基于处理要求的性质(诸如高性能或低功耗)向处理器内核提供这样的信息。这可以有利地允许在性能和功耗之间的权衡的更好平衡。对于某些类型的处理应用(诸如大量内核的高性能计算)，大量信息可能是理想的选择。或者，对于诸如在物联网中使用的嵌入式处理器、移动设备、可穿戴设备、头戴式显示器(HMD)设备或其他嵌入式计算类型的应用等其他类型的处理应用，较少信息可能是理想的选择。In an illustrative example, compiler 105 ( FIG. 1 ) may select information for inclusion in a block header, or for use in a dedicated instruction that may provide such information to a processor core based on the nature of the instruction and/or based on the nature of the processing requirements (such as high performance or low power consumption). This may advantageously allow for a better balance of tradeoffs between performance and power consumption. For certain types of processing applications (such as high performance computing with a large number of cores), a large amount of information may be an ideal choice. Alternatively, for other types of processing applications, such as embedded processors used in the Internet of Things, mobile devices, wearable devices, head mounted display (HMD) devices, or other embedded computing type applications, less information may be an ideal choice.

使用块标头或专用指令传送的信息的范围可以根据块中指令的性质来裁剪。例如，如果指令块包括以循环方式执行的循环，则可能需要更广泛的信息来封装与该块相关联的控制信息。附加控制信息可以允许处理器内核更有效地执行循环，从而提高性能。The scope of information conveyed using a block header or dedicated instructions can be tailored based on the nature of the instructions in the block. For example, if a block of instructions includes a loop that executes in a round-robin fashion, more extensive information may be required to encapsulate the control information associated with the block. The additional control information can allow the processor core to execute the loop more efficiently, thereby improving performance.

或者，如果存在将被很少执行的指令块，则相对较少的信息就足够了。例如，如果指令块包括若干预测的控制回路，则可能需要更多的信息。类似地，如果指令块具有大量的指令级并行性，则可能需要更多信息作为块标头或专用指令的部分。Alternatively, if there is an instruction block that will be executed infrequently, relatively less information may be sufficient. For example, if the instruction block includes several predicted control loops, more information may be required. Similarly, if the instruction block has a large amount of instruction-level parallelism, more information may be required as part of the block header or dedicated instructions.

块标头中的附加控制信息或专用指令例如可以用于有效地利用指令块中的指令级并行性。如果指令块包括若干分支预测，则可能需要更多的信息。有关分支预测的附加控制信息通常将使代码执行更为高效，因为这可能导致更少的管线刷新。Additional control information or dedicated instructions in the block header can, for example, be used to effectively exploit instruction-level parallelism within an instruction block. If the instruction block includes several branch predictions, more information may be needed. Additional control information about branch predictions generally makes code execution more efficient because it may result in fewer pipeline flushes.

注意，与块标头中的字段相对应的功能可以被组合或者被进一步分离。类似地，专用指令可以提供与图3和表3中所示的任何一个字段相关的信息，或者可以对来自这些字段的信息进行组合。例如，尽管图3和表3的说明性块标头包括单独的ID字段和SIZE字段，但是这两个字段可以组合成单个字段。Note that the functions corresponding to the fields in the block header can be combined or further separated. Similarly, a dedicated instruction can provide information related to any one of the fields shown in Figure 3 and Table 3, or can combine information from these fields. For example, although the illustrative block header of Figure 3 and Table 3 includes separate ID and SIZE fields, these two fields can be combined into a single field.

同样，当被解码时，单个专用指令可以提供关于指令块的尺寸的信息和ID字段中的信息。除非另有说明，否则专用指令可以被包括在指令块中的任何地方。例如，BLOCK_SIZE#size(尺寸)指令可以包含包括指令块的尺寸的值的直接(immediate)字段。直接字段可以包含提供尺寸信息的整数值。或者，直接字段可以包括与尺寸信息相关的编码值，使得尺寸信息可以通过对编码值进行解码来获得，例如，通过在可以使用逻辑、寄存器、存储器或代码流中的一项来表示的尺寸表中查找该值。在另一示例中，BLOCK_ID#id专用指令可以传送块ID号码。Similarly, when decoded, a single dedicated instruction can provide information about the size of the instruction block and the information in the ID field. Unless otherwise specified, a dedicated instruction can be included anywhere in the instruction block. For example, the BLOCK_SIZE#size (size) instruction can include an immediate field that contains the value of the size of the instruction block. The immediate field can contain an integer value that provides size information. Alternatively, the immediate field can include a coded value related to the size information so that the size information can be obtained by decoding the coded value, for example, by looking up the value in a size table that can be represented using one of logic, registers, memory, or code streams. In another example, the BLOCK_ID#id dedicated instruction can transmit a block ID number.

单独的数学函数或基于存储器的表可以将块ID映射到块标头的存储器地址。作为这样的指令的部分被传送的块ID对于每个指令块可以是唯一的。在另一示例中，BLOCK_HDR_ID#id指令可以传送块标头ID号。单独的数学函数或基于存储器的表可以将块ID映射到块标头的存储器地址。作为这样的指令的部分被传送的块ID可以由具有相同标头结构或字段的若干指令块所共享。A separate mathematical function or a memory-based table may map a block ID to a memory address in a block header. The block ID transmitted as part of such an instruction may be unique to each instruction block. In another example, a BLOCK_HDR_ID#id instruction may transmit a block header ID number. A separate mathematical function or a memory-based table may map a block ID to a memory address in a block header. The block ID transmitted as part of such an instruction may be shared by several instruction blocks having the same header structure or fields.

在另一示例中，BLOCK_INFO#size(尺寸)、#exit types(出口类型)、#store mask(存储掩码)、#write mask(写入掩码)指令可以提供关于指令的枚举字段的信息。这些字段可以对应于上面关于表3所讨论的任何一个字段。根据给定实施方式的要求，可以对块标头结构和格式以及专用指令进行其他改变。例如，可以提供包括与指令块的特性相关的信息的附加字段。基于指令块执行的频率，可以包括特定字段。In another example, the BLOCK_INFO#size, #exit types, #store mask, and #write mask instructions may provide information about the enumerated fields of the instruction. These fields may correspond to any of the fields discussed above with respect to Table 3. Other changes may be made to the block header structure and format, as well as to the specialized instructions, depending on the requirements of a given implementation. For example, additional fields may be provided that include information related to the characteristics of the instruction block. Specific fields may be included based on the frequency with which the instruction block is executed.

被包括在块标头结构中的字段、或者经由先前讨论的专用指令或其他机制所提供的信息可以是特定处理器或处理器系列的公共可用标准指令集架构(ISA)的一部分。这些字段的子集可以是ISA的专属扩展。该字段中的某些位值可能是处理器的标准ISA的一部分，但是该字段中的某些其他位值可以提供专属功能。该示例性字段可以允许ISA设计者向ISA添加专属扩展，而不完全公开与专属扩展相关联的性质和功能。因此，在这种情况下，由ISA设计人员分发的编译器工具将支持该字段中的专属位值、完全独立的专属字段、或专用指令。这样的字段的使用可以与某些处理器设计专有的硬件加速器特别相关。因此，程序可以包括块标头字段或不可识别的专用指令；但是程序还可以包括用于解密字段或解码指令的办法(recipe)。The fields included in the block header structure, or the information provided via the previously discussed dedicated instructions or other mechanisms, may be part of a publicly available standard instruction set architecture (ISA) for a particular processor or processor family. A subset of these fields may be a proprietary extension of the ISA. Certain bit values in the field may be part of the standard ISA of the processor, but certain other bit values in the field may provide proprietary functionality. This exemplary field may allow an ISA designer to add proprietary extensions to the ISA without fully disclosing the properties and functions associated with the proprietary extensions. Therefore, in this case, the compiler tools distributed by the ISA designer will support proprietary bit values in the field, completely independent proprietary fields, or dedicated instructions. The use of such fields may be particularly relevant to hardware accelerators that are proprietary to certain processor designs. Therefore, a program may include block header fields or unrecognizable dedicated instructions; however, a program may also include a recipe for decrypting fields or decoding instructions.

编译器105(图1)可以处理通常被配置为由一个或多个处理器内核原子态地执行的指令块，以便生成关于指令块的信息，包括元信息和控制信息。一些程序可以被编译以仅用于一个ISA，例如与物联网的处理器、移动设备、HMD设备、可穿戴设备或其他嵌入式计算环境一起使用的ISA。编译器可以采用诸如静态代码分析或代码剖析等技术来生成与指令块相关的信息。在一些情况下，编译器可以考虑诸如指令块的特性和其执行频率等因素。指令块的相关特性可以包括例如但不限于：(1)指令级并行性、(2)循环的数目、(3)断言的控制指令的数目、以及(4)分支预测的数目。The compiler 105 ( FIG. 1 ) may process blocks of instructions that are typically configured to be executed atomically by one or more processor cores to generate information about the blocks of instructions, including meta-information and control information. Some programs may be compiled for use with only one ISA, such as an ISA used with a processor for the Internet of Things, a mobile device, an HMD device, a wearable device, or other embedded computing environment. The compiler may employ techniques such as static code analysis or code profiling to generate information related to the blocks of instructions. In some cases, the compiler may consider factors such as the characteristics of the blocks of instructions and their execution frequency. Relevant characteristics of the blocks of instructions may include, for example, but are not limited to: (1) instruction-level parallelism, (2) the number of loops, (3) the number of control instructions that are asserted, and (4) the number of branch predictions.

图4是用于管理处理器内核中被处置的指令窗口中的指令块的说明性方法400的流程图。除非特别说明，否则图4的流程图中的方法或步骤以及附图中所示和下面描述的其他流程图中的方法或步骤不受限于特定的顺序或序列。另外，方法或其步骤中的一些可以同时发生或者同时执行，并不是所有的方法或步骤都必须在给定的实施方式中执行，这取决于这样的实施方式的要求，并且可以可选地使用一些方法或步骤。同样，在一些实施方式中可以省略一些步骤以减少开销，但是这例如可能导致脆性增加。可以将可以在任何给定应用中实施的各种特征、成本、开销、性能和稳健性折衷典型地视为设计选择问题。Fig. 4 is a flow chart of an illustrative method 400 for managing instruction blocks in an instruction window disposed in a processor core. Unless otherwise specified, the method or steps in the flow chart of Fig. 4 and the methods or steps in other flow charts shown in the accompanying drawings and described below are not limited to a specific order or sequence. In addition, some of the methods or steps thereof may occur simultaneously or be performed simultaneously, and not all methods or steps must be performed in a given embodiment, depending on the requirements of such an embodiment, and some methods or steps may be optionally used. Similarly, some steps may be omitted in some embodiments to reduce overhead, but this may, for example, cause increased brittleness. The various features, costs, overhead, performance, and robustness trade-offs that may be implemented in any given application may be typically considered a design choice issue.

在步骤405中，使用例如存活期矢量来显式地跟踪所提取的指令块的存活期。因此，控制单元不是在指令窗口中使用指令块顺序(即，位置)(其通常用于隐式地跟踪存活期)，而是保持显式状态。在步骤410中，维持指令块按存活期排序的列表。在一些实施方式中，也可以跟踪指令块优先级(其中在一些情况下可以由编译器确定优先级)，并且也可以维持指令块按优先级排序的列表。In step 405, the lifetime of the extracted instruction blocks is tracked explicitly using, for example, a lifetime vector. Thus, rather than using the instruction block order (i.e., position) in the instruction window (which is typically used to implicitly track lifetime), the control unit maintains explicit state. In step 410, a list of instruction blocks sorted by lifetime is maintained. In some embodiments, instruction block priority (which may in some cases be determined by the compiler) may also be tracked, and a list of instruction blocks sorted by priority may also be maintained.

在步骤415中，当指定块被标识用于处理时，搜索按存活期排序的列表以找到匹配的指令块。在一些实施方式中，也可以搜索按优先级排序的列表以寻找匹配。如果找到匹配的指令块，则在步骤420中可以刷新它，而不必从指令高速缓存中重新提取它，这可以提高处理器内核效率。这样的刷新使得在例如程序以紧密循环执行并且指令自行向后分支的情况下可以重新使用指令块。当多个处理器内核组成大规模阵列时，这样的效率增加也可能是复杂的。当刷新指令块时，指令被保持在合适的位置，并且只清除操作数缓冲区和加载/存储队列中的有效位。In step 415, when the designated block is identified for processing, a list sorted by lifetime is searched to find a matching instruction block. In some embodiments, a list sorted by priority may also be searched to find a match. If a matching instruction block is found, it can be refreshed in step 420 without having to re-extract it from the instruction cache, which can improve processor core efficiency. Such refreshing makes it possible to reuse the instruction block when, for example, a program is executed in a tight loop and the instructions branch backward on their own. When multiple processor cores form a large-scale array, such efficiency increases may also be complicated. When refreshing the instruction block, the instruction is kept in place and only the valid bits in the operand buffer and load/store queue are cleared.

如果没有找到指令块的匹配，则可以再次利用按存活期排序的列表(或按优先级排序的列表)来寻找指令块，该指令块可以被提交以在指令窗口中为新的指令块的打开槽。例如，最旧的指令块或最低优先级的指令块可以被提交(其中高优先级块可以由于存在将来重新使用的可能性而需要保持被缓存)。在步骤425中，将新的指令块映射到可用槽中。可以使用批量分配过程来分配指令块，其中块中的指令和与指令相关联的所有资源被同时(即，集中地)提取。If no match is found for the instruction block, the list sorted by lifetime (or by priority) can be used again to find an instruction block that can be submitted to open a slot for the new instruction block in the instruction window. For example, the oldest instruction block or the lowest priority instruction block can be submitted (wherein high priority blocks may need to remain cached due to the possibility of future reuse). In step 425, the new instruction block is mapped to an available slot. A batch allocation process can be used to allocate instruction blocks, wherein the instructions in the block and all resources associated with the instructions are extracted simultaneously (i.e., centrally).

在步骤430中，执行新的指令块，使得其指令被原子态地提交。在步骤435中，其他指令块可以以与传统的重排序缓冲区类似的方式按照存活期顺序地执行，以便以原子态提交它们各自的指令。In step 430, the new instruction block is executed so that its instructions are committed atomically. In step 435, other instruction blocks may be executed sequentially by lifetime in a manner similar to a conventional reorder buffer so that their respective instructions are committed atomically.

图5是可以由基于指令块的微架构执行的说明性方法500的流程图。在步骤505中，处理器内核中的控制单元使得所提取的指令块利用连续替换或非连续替换被缓存。在步骤510中，利用连续指令块替换，可以将操作缓冲区作为循环缓冲区操作。在步骤515中，使用不连续指令块替换，可以无序地替换指令块。例如，在步骤520中，可以执行显式的基于存活期的跟踪，使得以与上述类似的方式来基于跟踪的存活期提交和替换指令块。步骤525中，也可以跟踪优先级，并且可以使用跟踪的优先级来提交和替换指令块。FIG5 is a flow chart of an illustrative method 500 that can be performed by an instruction block based microarchitecture. In step 505, a control unit in a processor core causes the fetched instruction blocks to be cached using either continuous replacement or non-continuous replacement. In step 510, using continuous instruction block replacement, the operation buffer can be operated as a circular buffer. In step 515, using non-continuous instruction block replacement, the instruction blocks can be replaced out of order. For example, in step 520, explicit lifetime-based tracking can be performed such that instruction blocks are committed and replaced based on the tracked lifetime in a manner similar to that described above. In step 525, priorities can also be tracked, and the tracked priorities can be used to commit and replace instruction blocks.

图6是可以由布置在处理器内核中的控制单元执行的说明性方法600的流程图。在步骤605中，跟踪所缓存的指令块的状态，并且在步骤610中使用所跟踪的状态来维持指令块的列表。例如，取决于具体的实施方式要求，状态可以包括存活期、优先级或其他信息或上下文。在步骤615中，当指令块被标识用于映射时，则如步骤620所示，检查列表以寻找匹配。在步骤625，刷新来自列表的匹配的指令块而不重新提取。当没有在列表中找到匹配的指令块时，则在步骤630中，以类似于上述的方式，从指令高速缓存提取指令块，并且将其映射到指令窗口中的可用槽中。Fig. 6 is a flow chart of an illustrative method 600 that can be performed by a control unit arranged in a processor core. In step 605, the state of the instruction blocks cached is tracked, and in step 610, the state tracked is used to maintain a list of instruction blocks. For example, depending on the specific implementation requirements, the state may include lifetime, priority, or other information or context. In step 615, when the instruction block is identified for mapping, the list is checked to find a match as shown in step 620. In step 625, the matching instruction blocks from the list are refreshed without re-extraction. When no matching instruction block is found in the list, then in step 630, in a manner similar to that described above, the instruction block is extracted from the instruction cache and mapped to an available slot in the instruction window.

图7是用于管理布置在处理器内核中的指令窗口中的指令块的说明性方法700的流程图。在步骤705中，在处理器内核中维持指令块尺寸的尺寸表。尺寸表可以用各种方式表示，例如，使用逻辑、寄存器、存储器、代码流或其他合适的结构中的一项。在步骤710中，读取在指令块的标头中编码的索引。指令块包括一个或多个解码指令。因此，不是使用图3和表3中示出的SIZE字段来硬编码指令块尺寸，而是可以使用该字段来编码或存储到尺寸表的索引。也就是说，索引可以用作指向尺寸窗口中条目的指针，以使得特定尺寸能够与指令块相关联。FIG7 is a flow chart of an illustrative method 700 for managing instruction blocks arranged in an instruction window in a processor core. In step 705, a size table of instruction block sizes is maintained in the processor core. The size table can be represented in various ways, for example, using one of logic, registers, memory, code stream, or other suitable structures. In step 710, an index encoded in the header of the instruction block is read. The instruction block includes one or more decoded instructions. Therefore, instead of using the SIZE field shown in FIG3 and Table 3 to hard-code the instruction block size, this field can be used to encode or store an index to the size table. In other words, the index can be used as a pointer to an entry in the size window so that a specific size can be associated with the instruction block.

尺寸表中所包括的尺寸条目的数目可以根据实施方式而变化。可以使用更多数目的尺寸条目来实现更大的粒度，这在与给定的程序相关联的指令块尺寸的分布相对较宽的情况下可能是有益的，但是在典型的实施方式中以开销增加为代价。在一些情况下，可以由编译器选择被包括在表中的尺寸的数目，以便以能够优化整个指令封装密度的方式来覆盖指令块尺寸的特定分布，并且使无操作(no op)最少化。例如，可以选择被包括在尺寸表中的尺寸以匹配程序中常用的块指令尺寸。在步骤715中，使用索引从尺寸表中查找指令块尺寸。在步骤720中，基于其尺寸而将指令块映射到指令窗口中的可用槽中。The number of size entries included in the size table can vary according to implementation. A larger number of size entries can be used to achieve larger granularity, which may be beneficial when the distribution of the instruction block size associated with a given program is relatively wide, but at the expense of increased overhead in a typical implementation. In some cases, the number of sizes included in the table can be selected by the compiler to cover the specific distribution of instruction block sizes in a manner that can optimize the whole instruction packing density, and to minimize no-ops. For example, the size included in the size table can be selected to match the block instruction size commonly used in the program. In step 715, an index is used to search the instruction block size from the size table. In step 720, the instruction block is mapped to the available slot in the instruction window based on its size.

在一些实施方式中，如步骤725所示，指令窗口可以例如被分割为使用两个或更多个不同尺寸的两个或更多个子窗口。经分割的子窗口中的这样的变化可以使得能够进一步适应指令块尺寸的给定分布，并且可以进一步增加指令封装密度。分割在一些场景中也可以被动态执行。In some embodiments, as shown in step 725, the instruction window can be split into two or more sub-windows of two or more different sizes. Such changes in the split sub-windows can enable further adaptation to a given distribution of instruction block sizes and can further increase instruction packing density. Splitting can also be performed dynamically in some scenarios.

图8是可以由基于指令块的微架构执行的说明性方法800的流程图。在步骤805中，实施尺寸表。如上所述，尺寸表可以使用逻辑、寄存器、存储器、代码流或其他合适的构造中的一项来实施，并且可以包括与在由给定程序使用的指令块的分布中通常使用的尺寸相对应的尺寸。在步骤810中，检查指令块标头以寻找指向尺寸表中的条目的指针。在步骤815中，使用由表条目所标识的尺寸来确定指令窗口中指令块的放置。FIG8 is a flow chart of an illustrative method 800 that can be performed by an instruction block-based microarchitecture. In step 805, a size table is implemented. As described above, the size table can be implemented using one of logic, registers, memory, code flow, or other suitable structures and can include sizes corresponding to sizes commonly used in the distribution of instruction blocks used by a given program. In step 810, the instruction block header is checked for a pointer to an entry in the size table. In step 815, the size identified by the table entry is used to determine the placement of the instruction block in the instruction window.

在步骤820中，批量分配与指令块相关联的资源。当在步骤825中映射指令窗口中的指令块时，使用在指令块标头中所指定的限制。这些限制可以例如包括对准上的限制和指令窗口用以缓冲指令块的容量。在步骤830中，由控制单元跟踪指令窗口中的指令块的顺序，并且在一些情况下，可以无序地提交块。例如，不是使用其中根据块在指令窗口中的位置来处理块的指令块的循环缓冲区，而是可以对块进行优先级排序，使得高度使用的或特别重要的指令块被无序地处理，这可以提高处理效率。In step 820, resources associated with the instruction blocks are allocated in batches. When mapping the instruction blocks in the instruction window in step 825, the restrictions specified in the instruction block header are used. These restrictions may include, for example, restrictions on alignment and the capacity of the instruction window to buffer instruction blocks. In step 830, the order of the instruction blocks in the instruction window is tracked by the control unit, and in some cases, the blocks may be submitted out of order. For example, instead of using a circular buffer of instruction blocks in which blocks are processed according to their position in the instruction window, the blocks may be prioritized so that highly used or particularly important instruction blocks are processed out of order, which may improve processing efficiency.

在步骤835中，在一些情况下，可以显式地跟踪指令块的存活期，并且可以基于这样的显式地跟踪的存活期来提交指令块。在步骤840中，刷新指令块(即，重新使用而不必从指令高速缓存重新提取指令块)。In step 835, in some cases, the lifetime of the instruction block may be explicitly tracked, and the instruction block may be committed based on such explicitly tracked lifetime. In step 840, the instruction block is flushed (i.e., reused without having to re-fetch the instruction block from the instruction cache).

图9是可以由布置在处理器内核中的控制单元执行的说明性方法900的流程图。在步骤905中，以与上述类似的方式将指令窗口配置为具有多个分段，该多个分段具有两个或更多个不同尺寸。在步骤910中，检查块指令标头以寻找被编码在其中的索引。在步骤915中使用索引在尺寸表中执行查找，并且在步骤920中，基于尺寸查找将指令块放置到适于块的特定尺寸的指令窗口分段中。在步骤925中，使用批量分配来提取与指令块相关联的资源。FIG9 is a flow chart of an illustrative method 900 that may be performed by a control unit disposed in a processor core. In step 905, an instruction window is configured to have multiple segments having two or more different sizes in a manner similar to that described above. In step 910, the block instruction header is examined for an index encoded therein. In step 915, a lookup is performed in a size table using the index, and in step 920, the instruction block is placed into an instruction window segment appropriate for the particular size of the block based on the size lookup. In step 925, a batch allocation is used to extract resources associated with the instruction block.

图10是用于管理布置在处理器内核中的指令窗口中的指令块的说明性方法1000的流程图。在步骤1005中，将指令块从指令高速缓存映射到指令窗口中。指令块包括一个或多个解码指令。在步骤1010中，分配与指令块中的每个指令相关联的资源。资源通常包括控制位和操作数，并且分配可以使用批量分配过程来执行，在批量分配过程中集中地获取或提取所有资源。FIG10 is a flow chart of an illustrative method 1000 for managing instruction blocks arranged in an instruction window within a processor core. In step 1005, an instruction block is mapped from an instruction cache into an instruction window. The instruction block includes one or more decoded instructions. In step 1010, resources associated with each instruction in the instruction block are allocated. Resources typically include control bits and operands, and allocation can be performed using a batch allocation process in which all resources are collectively acquired or extracted.

代替将资源和指令紧密耦合，将指令窗口和操作数缓冲区解耦合，以使得它们可以通过维持块中的资源和已解码指令之间的一个或多个指针而被独立地操作，如步骤1015所示。当在步骤1020中刷新指令块(即，重新使用而不必从指令高速缓存重新提取指令块)时，则在步骤1025中，可以通过跟随指针回到原始控制状态来重新使用资源。Instead of tightly coupling the resources and instructions, the instruction window and operand buffer are decoupled so that they can be operated independently by maintaining one or more pointers between the resources and the decoded instructions in the block, as shown in step 1015. When the instruction block is flushed (i.e., reused without having to re-fetch the instruction block from the instruction cache) in step 1020, then the resources can be reused by following the pointers back to the original control state in step 1025.

这样的解耦合可以使得处理器内核效率被提高，特别是当指令块被刷新而没有通常所发生的重新提取时，例如，当程序在紧密循环中执行并且指令被重复利用时。采用通过指针建立控制状态，资源被有效地预验证，而不需要附加的处理周期支出和其他费用。当多个处理器内核组成大规模阵列时，这样的效率增加也可能是复杂的。This decoupling can improve processor core efficiency, particularly when instruction blocks are flushed without the usual refetching, for example, when a program executes in a tight loop and instructions are reused. By establishing control state through pointers, resources are effectively pre-verified without the need for additional processing cycles and other overhead. Such efficiency gains can also be compounded when multiple processor cores are deployed in large arrays.

图11是可以由基于指令块的微架构执行的说明性方法1100的流程图。在步骤1105中，以其中新的指令块替换提交的指令块的方式将指令块映射到指令窗口中。如步骤1110所示，映射可以受到在指令块的标头中所指定的各种限制，例如，对准的限制和指令窗口缓冲指令块的容量。在步骤1115，分配资源以用于新的指令块，如上所述，这通常使用批量分配过程来实施。FIG11 is a flow chart of an illustrative method 1100 that can be performed by an instruction block-based microarchitecture. In step 1105, the instruction block is mapped into the instruction window in a manner such that the new instruction block replaces the submitted instruction block. As shown in step 1110, the mapping can be subject to various constraints specified in the instruction block header, such as alignment constraints and the capacity of the instruction window to buffer instruction blocks. In step 1115, resources are allocated for the new instruction block, which is typically implemented using a batch allocation process as described above.

在步骤1120中，由控制单元跟踪指令窗口中指令块的顺序，并且在一些情况下，可以无序地提交块。例如，不是使用其中根据块在指令窗口中的位置来处理块的指令块的循环缓冲区，而是可以对块进行优先级排序，使得高度使用的或特别重要的指令块被无序地处理，这可以提高处理效率。In step 1120, the order of the instruction blocks in the instruction window is tracked by the control unit, and in some cases, the blocks may be submitted out of order. For example, rather than using a circular buffer of instruction blocks in which blocks are processed based on their position in the instruction window, the blocks may be prioritized so that highly used or particularly important instruction blocks are processed out of order, which may improve processing efficiency.

在步骤1125中，将指令窗口与操作数缓冲区解耦合，使得例如指令块和操作数块被独立地管理(即，不使用指令与操作数之间的严格对应关系)。如上所述，通过使得能够在刷新指令块时预验证资源，解耦合增加了效率。In step 1125, the instruction window is decoupled from the operand buffer so that, for example, instruction blocks and operand blocks are managed independently (i.e., a strict correspondence between instructions and operands is not used). As described above, decoupling increases efficiency by enabling pre-verification of resources when refreshing instruction blocks.

图12是可以由布置在处理器内核中的控制单元执行的说明性方法1200的流程图。在步骤1205中，维持用于缓冲一个或多个指令块的指令窗口。在步骤1210中，维持用于缓冲与指令块中的指令相关联的资源的一个或多个操作数缓冲区。如上所述，资源通常包括控制位和操作数。在步骤1215中，使用指令与资源之间的指针来跟踪状态。FIG12 is a flow chart of an illustrative method 1200 that may be performed by a control unit disposed in a processor core. In step 1205, an instruction window is maintained for buffering one or more instruction blocks. In step 1210, one or more operand buffers are maintained for buffering resources associated with the instructions in the instruction blocks. As described above, resources typically include control bits and operands. In step 1215, pointers between instructions and resources are used to track status.

当刷新指令块时，在框1220中，可以跟随指针回到被跟踪的状态。在步骤1225中，当提交指令块时，清除操作数缓冲区中的控制位，并且设置新的指针。与上面讨论的方法一样，在步骤1230，将指令窗口与操作数缓冲区解耦合，使得能够由控制单元在非对应的基础上维持指令块和操作数块。When the instruction block is flushed, the pointer can be followed back to the tracked state in block 1220. When the instruction block is committed, the control bit in the operand buffer is cleared and a new pointer is set in step 1225. As with the method discussed above, in step 1230, the instruction window is decoupled from the operand buffer, allowing the control unit to maintain the instruction block and operand block on a non-corresponding basis.

图13是用于管理布置在处理器内核中的指令窗口中的指令块的说明性方法1300的流程图。在步骤1305中，使用批量分配过程来分配指令块，其中块中的指令和与指令相关联的所有资源被同时(即，集中地)提取。与其中指令和资源以较小的块被重复提取的传统架构相比，这里的批量分配使得块中的所有指令能够被同时和一致地管理，这可以提高处理器内核操作的效率。在给定的编程结构(例如，使分支最小化的结构)使得编译器能够生成相对较大的指令块的情况下，这种改进可以更加显著。例如，在一些实施方式中，指令块可以包含多达128个指令。Figure 13 is a flow chart of an illustrative method 1300 for managing the instruction blocks in the instruction window arranged in the processor core. In step 1305, batch allocation process is used to allocate instruction blocks, wherein the instructions in the block and all resources associated with the instructions are extracted simultaneously (that is, centrally). Compared with the traditional architecture in which instructions and resources are repeatedly extracted with smaller blocks, the batch allocation here enables all instructions in the block to be managed simultaneously and in unison, which can improve the efficiency of the processor core operation. In the case where a given programming structure (for example, a structure that minimizes branches) enables a compiler to generate relatively large instruction blocks, this improvement can be more significant. For example, in some embodiments, an instruction block can comprise up to 128 instructions.

指令块的批量分配还通过刷新特征来提高处理器内核的效率，其中，指令块被重新使用而无需如典型地发生地被重新提取，例如当程序在紧密循环中执行并且指令自行向后分支时。当多个处理器内核组成大规模阵列时，这样的效率增加也可能是复杂的。当刷新指令块时，指令被保持在合适的位置，并且只清除操作数缓冲区和加载/存储队列中的有效位。这使得刷新后的指令块的提取能够完全被绕过。The batch allocation of instruction blocks also improves the efficiency of the processor core through a flush feature, in which instruction blocks are reused without having to be re-fetched as typically occurs, such as when a program executes in a tight loop and instructions branch backward on themselves. Such efficiency gains can also be compounded when multiple processor cores are arranged in large arrays. When an instruction block is flushed, the instructions are kept in place and only the valid bits in the operand buffer and load/store queues are cleared. This allows the fetch of the flushed instruction block to be bypassed entirely.

当一组指令和资源就位时，指令块的批量分配支持附加的处理效率。例如，操作数和显式消息可以从块中的一个指令被发送到另一指令。这样的功能在传统架构中是不支持的，因为一个指令不能发送任何东西到尚未被分配的另一指令。生成常量的指令也可以将值锁定在操作数缓冲区中，使得它们在刷新之后保持有效，使得不需要在每次执行指令块时重新生成这些值。When a set of instructions and resources are in place, batch allocation of instruction blocks supports additional processing efficiency. For example, operands and explicit messages can be sent from one instruction in the block to another. Such functionality is not supported in traditional architectures because one instruction cannot send anything to another instruction that has not yet been allocated. Instructions that generate constants can also lock values in the operand buffer so that they remain valid after a flush, eliminating the need to regenerate these values each time the instruction block is executed.

在步骤1310中，当指令块被映射到指令窗口中时，它们受到可以在步骤1315中通过映射策略、在块标头中指定的限制或二者而应用的约束。在一些情况下，策略可以由编译器根据给定程序的特定要求来设置。指定的限制可以包括例如对准上的限制和指令窗口缓冲指令块的容量的限制。In step 1310, when the instruction blocks are mapped into the instruction window, they are subject to constraints that can be applied in step 1315 by a mapping policy, restrictions specified in the block header, or both. In some cases, the policy can be set by the compiler based on the specific requirements of a given program. The specified restrictions can include, for example, restrictions on alignment and restrictions on the capacity of the instruction window to buffer instruction blocks.

在步骤1320中，在一些实施方式中，可以将指令窗口分割成相同尺寸或不同尺寸的子窗口。由于指令块尺寸对于给定的程序通常是随机或不均匀分布的，因此经分割的子窗口中的这种变化可以更有效地适应给定的指令块尺寸分布，从而增加指令窗口中的指令封装密度。根据处理器内核当前正在处理的块尺寸的分布，在一些情况下还可以动态地执行分割。In step 1320, in some embodiments, the instruction window may be segmented into sub-windows of equal or different sizes. Since instruction block sizes for a given program are typically randomly or unevenly distributed, this variation in the segmented sub-windows may more effectively accommodate a given instruction block size distribution, thereby increasing instruction packing density within the instruction window. In some cases, segmentation may also be performed dynamically, based on the distribution of block sizes currently being processed by the processor core.

在一些实施方式中，指令块标头可以对索引进行编码或者包括到使用逻辑、寄存器、存储器或代码流中的一项实施方式的尺寸表的指针。尺寸表可以包括指令块尺寸条目，使得在步骤1325中可以从表中查找指令块尺寸。例如，当块在实施方式分支时包括相对少量的指令时，使用编码的索引和尺寸表可以通过在可用块中提供更多的粒度以减少无操作的发生来增强指令块中的指令封装密度。In some embodiments, the instruction block header may encode an index or include a pointer to a size table of an embodiment in the usage logic, registers, memory, or code flow. The size table may include an instruction block size entry so that the instruction block size can be looked up from the table in step 1325. For example, when a block includes a relatively small number of instructions when an embodiment branches, using an encoded index and size table can increase the instruction packing density in the instruction block by providing more granularity in the available blocks to reduce the occurrence of no-ops.

图14是可以由基于指令块的微架构执行的说明性方法1400的流程图。在步骤1405中，处理器内核中的控制单元应用用于处理指令块的策略。在步骤1410中，使用上述批量分配过程来分配指令块，其中同时提取指令和所有相关联的资源。在步骤1415中，将指令块映射到指令窗口中，其中映射可能受到各种限制，诸如如上所述在指令块的标头中所指定的对准的限制和指令窗口缓冲指令块的容量的限制。FIG14 is a flow chart of an illustrative method 1400 that may be performed by an instruction block-based microarchitecture. In step 1405, a control unit in a processor core applies a policy for processing an instruction block. In step 1410, the instruction block is allocated using the batch allocation process described above, where the instructions and all associated resources are fetched simultaneously. In step 1415, the instruction block is mapped into an instruction window, where the mapping may be subject to various restrictions, such as alignment restrictions specified in the instruction block header and the capacity of the instruction window to buffer instruction blocks, as described above.

在步骤1420，可以应用包括由控制单元来跟踪指令窗口中的指令块的顺序的策略。例如，在某些情况下块可以被无序地提交，而不是使用指令块的循环缓冲区，其中根据块在指令窗口中的位置来处理块。在步骤1425中，可以应用如下策略：其包括基于优先级来处理块(其在一些场景中可以由编译器指定)，使得高度使用的或者特别重要的块被无序地处理，这可以进一步增加处理效率。At step 1420, a strategy may be applied that includes the control unit tracking the order of instruction blocks in the instruction window. For example, in some cases, blocks may be submitted out of order, rather than using a circular buffer of instruction blocks where blocks are processed based on their position in the instruction window. At step 1425, a strategy may be applied that includes processing blocks based on priority (which may be specified by the compiler in some scenarios) so that highly used or particularly important blocks are processed out of order, which may further increase processing efficiency.

在步骤1430中，在一些情况下，可以应用包括显示地跟踪指令块的存活期并且可以基于这样的显示地跟踪的存活期来提交指令块的策略。在步骤1435中，可以应用包括根据指令窗口(或窗口的分段)中的适当尺寸的槽的可用性来映射指令块的策略。在步骤1440，可以应用包括使用循环缓冲区来将指令块映射到指令窗口中的策略。In some cases, a strategy may be applied that explicitly tracks the lifetime of instruction blocks and may commit instruction blocks based on such explicitly tracked lifetimes in step 1430. In step 1435, a strategy may be applied that maps instruction blocks based on the availability of appropriately sized slots in an instruction window (or a segment of a window). In step 1440, a strategy may be applied that uses a circular buffer to map instruction blocks into instruction windows.

在一些实施方式中，可以利用策略的各种组合来进一步增强处理器内核效率。例如，控制单元可以在策略之间动态地切换，以应用为给定指令块或指令块组提供更优化的操作的策略。例如，在一些情况下，使用循环缓冲技术可能更高效，其中指令块以连续的方式按顺序被处理。在其他情况下，无序和基于存活期的处理可以提供更优化的操作。In some embodiments, various combinations of strategies can be used to further enhance processor core efficiency. For example, the control unit can dynamically switch between strategies to apply a strategy that provides more optimized operation for a given instruction block or group of instruction blocks. For example, in some cases, it may be more efficient to use a circular buffer technique, in which instruction blocks are processed sequentially in a continuous manner. In other cases, out-of-order and lifetime-based processing can provide more optimized operation.

图15是可以由布置在处理器内核中的控制单元执行的说明性方法1500的流程图。在步骤1505中，以与上述类似的方式将指令窗口配置为具有多个分段，该多个分段具有两个或更多个不同尺寸。在步骤1510中，提取指令块，并且在步骤1515中，提取其所有相关联的资源。FIG15 is a flow chart of an illustrative method 1500 that may be performed by a control unit disposed in a processor core. In step 1505, an instruction window is configured to have multiple segments having two or more different sizes in a manner similar to that described above. In step 1510, an instruction block is extracted, and in step 1515, all of its associated resources are extracted.

在步骤1520中，将指令块放置在使得窗口中的指令密度最大化的窗口的合适的分段中。例如，如果编译器产生包括具有较低指令计数的相对大量的块的块尺寸分布(例如，以实施程序分支等)，则指令窗口可以具有针对小指令块而被特别地确定尺寸的分段。类似地，如果存在相对大量的高指令计数块(例如，用于科学和类似的应用)，则可以针对这样的较大指令块对分段特别地确定尺寸。因此，指令窗口分段定尺寸可以根据特定尺寸分布来被调节，或者在一些情况下当分布变化时被动态地调节。在框1525中，如上所述，指令块可以受到指令块标头中所指定的限制。In step 1520, the instruction block is placed in a suitable segmentation of the window that maximizes the instruction density in the window. For example, if the compiler generates a block size distribution that includes a relatively large number of blocks with lower instruction counts (e.g., to implement program branches, etc.), the instruction window can have a segmentation that is specifically sized for small instruction blocks. Similarly, if there are relatively large numbers of high instruction count blocks (e.g., for scientific and similar applications), the segmentation can be specifically sized for such larger instruction blocks. Therefore, the instruction window segmentation can be sized according to a specific size distribution, or dynamically sized in some cases when the distribution changes. In box 1525, as described above, the instruction block can be subject to the restrictions specified in the instruction block header.

现在通过说明而不是所有实施例的详尽列表的方式来呈现指令块到处理器指令窗口的当前批量分配的各种示例性实施例。一个示例包括一种用于管理布置在处理器中的指令窗口中的指令块的方法，包括：批量分配指令块，使得用于指令块中的一个或多个指令的资源被同时提取，其中资源包括与一个或多个指令相关联的控制位和操作数；将包括一个或多个指令的指令块从指令高速缓存映射到指令窗口中，其中指令块包括标头；以及在执行映射时应用一个或多个约束，其中约束通过在标头中所指定的映射策略或限制中的一项来施加。在另一示例中，映射策略使用控制单元来实施，控制单元基于存活期、尺寸、位置或优先级中的一项来处理指令块。在另一示例中，该方法还包括将指令窗口分割为子窗口，其中经分割的子窗口共享共同的尺寸或具有不同尺寸。在另一示例中，分段得到的子窗口的尺寸根据指令块尺寸的分布被动态地确定。在另一示例中，指定的限制包括以下一项：对准限制或指令窗口的指令块容量限制。在另一示例中，指令块尺寸在标头中使用指向尺寸表的指针来指示，尺寸表使用以下一项来表达：逻辑、寄存器、存储器或代码流。Various exemplary embodiments of the current batch allocation of instruction blocks to a processor instruction window are now presented by way of illustration, not an exhaustive list of all embodiments. One example includes a method for managing instruction blocks arranged in an instruction window in a processor, comprising: batching the instruction blocks so that resources for one or more instructions in the instruction block are fetched simultaneously, wherein the resources include control bits and operands associated with the one or more instructions; mapping the instruction block including the one or more instructions from an instruction cache into the instruction window, wherein the instruction block includes a header; and applying one or more constraints when performing the mapping, wherein the constraints are imposed by one of a mapping policy or restrictions specified in the header. In another example, the mapping policy is implemented using a control unit that processes the instruction blocks based on one of lifetime, size, location, or priority. In another example, the method further includes segmenting the instruction window into sub-windows, wherein the segmented sub-windows share a common size or have different sizes. In another example, the sizes of the segmented sub-windows are dynamically determined based on the distribution of instruction block sizes. In another example, the specified restrictions include one of an alignment restriction or an instruction block capacity restriction for the instruction window. In another example, the instruction block size is indicated in the header using a pointer to a size table, where the size table is expressed in one of the following: logic, registers, memory, or code flow.

另一示例包括一种基于指令块的微架构，包括：控制单元；一个或多个操作数缓冲区；以及被配置为存储要在控制单元的控制下的已解码指令块的指令窗口，其中控制包括进行以下的操作：应用用于处理指令块的多个策略中的一个或多个策略；以及批量分配指令块，包括：对于指令块中的所有指令，将资源提取到一个或多个操作数缓冲区中，以允许指令块中的指令向指令块中的另一指令发送消息或操作数。在另一示例中，资源包括被缓冲在操作数缓冲区中的控制位或操作数中的一项。在另一示例中，策略包括用以基于在指令块的标头中所指定的限制来映射指令块的配置，其中所指定的限制包括以下一项：对准限制或指令窗口的指令块能力限制。在另一示例中，策略包括用以跟踪指令窗口中的指令块的顺序以及无序地提交指令块的配置。在另一示例中，策略包括用以显式地跟踪当前在指令窗口中被映射的指令块的存活期并且基于显示地跟踪的存活期来提交指令块的配置。在另一示例中，策略包括用以在指令窗口中适合指令块的槽可用时将指令块映射到指令窗口的配置。在另一示例中，策略包括用以使用循环缓冲区将指令块映射到指令窗口的配置。在另一示例中，策略包括用以将指令块映射到指令窗口或者基于优先级来提交指令块的配置。Another example includes an instruction block-based microarchitecture comprising: a control unit; one or more operand buffers; and an instruction window configured to store decoded instruction blocks to be controlled by the control unit, wherein the control includes applying one or more of a plurality of policies for processing instruction blocks; and batching instruction blocks, including fetching resources into the one or more operand buffers for all instructions in the instruction block to allow instructions in the instruction block to send messages or operands to other instructions in the instruction block. In another example, the resources include one of control bits or operands buffered in the operand buffers. In another example, the policy includes a configuration for mapping instruction blocks based on constraints specified in the instruction block header, wherein the specified constraints include one of alignment constraints or instruction block capacity constraints for the instruction window. In another example, the policy includes a configuration for tracking the order of instruction blocks in the instruction window and for committing instruction blocks out of order. In another example, the policy includes a configuration for explicitly tracking the lifetime of instruction blocks currently mapped in the instruction window and for committing instruction blocks based on the explicitly tracked lifetime. In another example, the policy includes a configuration to map an instruction block to an instruction window when a slot in the instruction window that fits the instruction block is available. In another example, the policy includes a configuration to map the instruction block to the instruction window using a circular buffer. In another example, the policy includes a configuration to map the instruction block to the instruction window or to submit the instruction block based on priority.

另一示例包括一种布置在处理器中的控制单元，该处理器被布置为执行用于指令块管理的方法，该方法包括：配置具有多个分段的指令窗口，其中分段具有两个或更多个不同的尺寸；从指令高速缓存提取包括一个或多个指令的指令块；提取与指令块中的指令相关联的所有资源；以及将指令块放置到指令窗口的分段中，使得指令窗口中的指令密度最大化。在另一示例中，控制单元还包括检查指令块的标头，以获取对在指令窗口内的放置的指定的限制，并且根据指定的限制执行放置，其中指定的限制包括以下一项：对准限制或指令块容量限制。在另一示例中，控制单元还包括将经分割的指令窗口配置为分布在多个处理器内核上的逻辑分段的指令窗口。在另一示例中，控制单元还包括使用通过芯片内网络承载的通信跨逻辑分段的指令窗口维持状态。在另一示例中，控制单元还包括执行指令块和资源的提取作为批量分配。在另一示例中，控制单元还包括基于在标头中编码的指令块尺寸或者基于由标头中指向尺寸表的指针指示的指令块尺寸来为所放置的指令块选择分段，其中尺寸表使用以下一项来表示：逻辑、寄存器、存储器或代码流。Another example includes a control unit disposed in a processor, the processor being configured to perform a method for instruction block management, the method comprising: configuring an instruction window having a plurality of segments, wherein the segments have two or more different sizes; extracting an instruction block comprising one or more instructions from an instruction cache; extracting all resources associated with the instructions in the instruction block; and placing the instruction block into the segments of the instruction window such that instruction density in the instruction window is maximized. In another example, the control unit further comprises inspecting a header of the instruction block to obtain specified constraints on placement within the instruction window, and performing the placement according to the specified constraints, wherein the specified constraints comprise one of: an alignment constraint or an instruction block capacity constraint. In another example, the control unit further comprises configuring the segmented instruction window into logically segmented instruction windows distributed across a plurality of processor cores. In another example, the control unit further comprises maintaining state across the logically segmented instruction windows using communications carried over an intra-chip network. In another example, the control unit further comprises performing the extraction of instruction blocks and resources as a batch allocation. In another example, the control unit further comprises selecting a segment for the placed instruction block based on an instruction block size encoded in the header or based on an instruction block size indicated by a pointer in the header to a size table, wherein the size table is represented using one of the following: logic, registers, memory, or code flow.

上述主题被提供仅作为说明，而不应当被解释为限制。可以对本文中描述的主题进行各种修改和改变，而不遵循示出和描述的示例实施例和应用，并且不偏离在以下权利要求中阐述的本公开的真实精神和范围。The above-described subject matter is provided for illustration only and should not be construed as limiting. Various modifications and changes may be made to the subject matter described herein without following the example embodiments and applications shown and described, and without departing from the true spirit and scope of the present disclosure as set forth in the following claims.

Claims

1. A method for managing instruction blocks in an instruction window, the instruction window being arranged in a processor, the method comprising:

The instruction block is allocated in batches such that resources for one or more instructions in the instruction block are simultaneously extracted, wherein the resources include control bits and operands associated with the one or more instructions;

Mapping an instruction block comprising one or more instructions from the instruction cache into the instruction window, wherein the instruction block includes a header; and

When performing the mapping, one or more constraints are applied, wherein the constraints are imposed by one of the mapping strategies or restrictions specified in the header; and

The instruction block size is indicated in the header using a pointer to a size table, which is expressed using one of the following: logic, register, memory, or code stream.

2. The method of claim 1, wherein the mapping strategy is implemented using a control unit that processes instruction blocks based on one of lifetime, size, position, or priority.

3. The method of claim 1 further includes dividing the instruction window into sub-windows, wherein the divided sub-windows share a common size or have different sizes.

4. The method of claim 3, wherein the size of the segmented sub-window is dynamically determined based on the distribution of instruction block sizes.

5. The method of claim 1, wherein the specified limitation includes one of the following: an alignment limitation or a limit on the instruction block capacity of the instruction window.

6. An instruction block-based processor, comprising:

Control unit;

One or more operand buffers; and

A command window, configured to store blocks of decoded commands to be controlled by the control unit, wherein the control includes performing the following operations:

One or more of a plurality of strategies for processing an instruction block are applied, wherein the strategies include mapping the configuration of the instruction block based on constraints specified in the header of the instruction block, and wherein the size of the instruction block is indicated in the header using a pointer to a size table, the size table being expressed using one of the following: logic, registers, memory, or code stream; and

The batch allocation instruction block includes: for all instructions in the instruction block, extracting resources into one or more operand buffers to allow instructions in the instruction block to send messages or operands to another instruction in the instruction block.

7. The instruction block-based processor of claim 6, wherein the resource includes a control bit or an operand buffered in the operand buffer.

8. The instruction block-based processor of claim 6, wherein the specified limitation includes one of the following: an alignment limitation or an instruction block capacity limitation of the instruction window.

9. The instruction block-based processor of claim 6, wherein the strategy includes a configuration for tracking the order of the instruction blocks in the instruction window and for submitting instruction blocks out of order.

10. The instruction block-based processor of claim 6, wherein the strategy includes explicitly tracking the lifetime of instruction blocks currently mapped in the instruction window and configuring instruction blocks to be committed based on the explicitly tracked lifetime.

11. The instruction block-based processor of claim 6, wherein the strategy includes a configuration for mapping the instruction block to the instruction window when a slot suitable for the instruction block is available in the instruction window.

12. The instruction block-based processor of claim 6, wherein the strategy includes a configuration for mapping instruction blocks to the instruction window using a circular buffer.

13. The instruction block-based processor of claim 6, wherein the strategy includes a configuration for mapping instruction blocks to the instruction window or a configuration for submitting instruction blocks based on priority.

14. A control unit disposed in a processor, the control unit being configured to perform a method for instruction block management, comprising:

Configure a command window with multiple segments, where each segment has two or more different sizes;

Retrieve a block of instructions, including one or more instructions, from the instruction cache;

Retrieve all resources associated with the instruction in the instruction block;

The instruction blocks are placed into segments of the instruction window to maximize the instruction density within the instruction window; and

The header of the instruction block is examined to obtain specified constraints on its placement within the instruction window, wherein the size of the instruction block is indicated in the header using a pointer to a size table, which is expressed using one of the following: logic, register, memory, or code stream.

15. The control unit of claim 14, further comprising performing the placement according to a specified constraint, wherein the specified constraint includes one of an alignment constraint or an instruction block capacity constraint.

16. The control unit of claim 14, further comprising an instruction window that configures the segmented instruction window as logically segmented instructions distributed across a plurality of processor cores.

17. The control unit of claim 16 further includes using communication carried on an on-chip network to maintain state across logical segmentation instruction windows.

18. The control unit of claim 14, further comprising the retrieval of instruction blocks and resources as a batch allocation.

19. The control unit of claim 15, further comprising selecting segments based on instruction block sizes encoded in the header or based on instruction block sizes indicated by a pointer in the header to a size table, the segments being selected for placement of the instruction block, wherein the size table is represented using one of: logic, registers, memory, or code stream.