CN1997973B

CN1997973B - Processor, method, apparatus and apparatus for dynamic cache engine instructions

Info

Publication number: CN1997973B
Application number: CN200480040023XA
Authority: CN
Inventors: 斯里德哈·拉克舍玛纳默西; 威尔逊·廖; 普拉香特·钱德拉; J-Y·闵; Y·潘
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2003-11-06
Filing date: 2004-10-29
Publication date: 2010-06-09
Anticipated expiration: 2024-10-29
Also published as: DE602004029485D1; EP1680743B1; WO2005048113A3; ATE484027T1; JP2007510989A; WO2005048113A2; US20050102474A1; CN1997973A; EP1680743A2

Abstract

In general, in one aspect, the disclosure describes a processor that includes an instruction store to store instructions of at least a portion of at least one program and a set of multiple engines coupled to the instruction store. The engines include an engine instruction cache and circuitry to request a subset of the at least the portion of the at least one program.

Description

Processor, method, apparatus and apparatus for dynamic cache engine instructions

相关申请引用：本申请和以下与本申请在同一天递交的申请相关： RELATED APPLICATIONS REFERENCED : This application is related to the following applications filed on the same date as this application:

a.律师案卷号P16851-“Service Engine Cache Request(服务引擎缓存请求)”a. Lawyer file number P16851-"Service Engine Cache Request (service engine cache request)"

b.律师案卷号P16852-“Thread-Based Engine Cache Partitioning(基于线程的引擎缓存分区)”b. Lawyer case file number P16852-"Thread-Based Engine Cache Partitioning (thread-based engine cache partition)"

背景background

网络使得计算机和其他设备能够通信。例如网络可以携带表征视频、音频、电子邮件等的数据。通常，通过网络发送的数据被划分(divide)成称为分组(packet)的更小的信息(message)。类似地，分组非常像你投入邮箱的信封。分组通常包括“有效负荷”和“头部”。分组的“有效负荷”类似于信封里的信。分组的“头部”非常像写在信封本身上的信息。头部可以包括用于帮助网络设备适当地处理分组的信息。例如，头部可以包括标识分组目的地的地址。Networks enable computers and other devices to communicate. For example, a network may carry data representing video, audio, email, and the like. Typically, data sent over a network is divided into smaller messages called packets. Similarly, packets are very much like envelopes you drop in the mailbox. A packet generally includes a "payload" and a "header". The "payload" of a packet is like a letter in an envelope. The "header" of the packet is much like the information written on the envelope itself. The header may include information to help network devices process the packet appropriately. For example, the header may include an address identifying the destination of the packet.

给定的分组可以在到达其目的地之前“跳(hop)”过许多不同的中间网络设备(例如“路由器”、“桥路器”和“交换机”)。这些中间设备通常执行多种分组处理操作。例如，中间设备通常执行操作来确定如何更进一步向其目的地转发分组或确定在处理分组中要用到的服务质量。A given packet may "hop" through many different intermediate network devices (eg, "routers," "bridges," and "switches") before reaching its destination. These intermediary devices typically perform various packet processing operations. For example, intermediary devices typically perform operations to determine how to forward a packet further toward its destination or to determine the quality of service to use in processing the packet.

随着网络连接速度的增长，中间设备所具有的用于处理分组的时间量继续缩小。为了实现快速分组处理，许多设备以专用的、“硬连线”设计(例如专用集成电路(ASIC))为特征。然而，这些设计通常难以适合新兴的网络技术和通信协议。As network connection speeds increase, the amount of time intermediary devices have to process packets continues to shrink. To achieve fast packet processing, many devices feature dedicated, "hardwired" designs such as application-specific integrated circuits (ASICs). However, these designs are often difficult to adapt to emerging network technologies and communication protocols.

为了将灵活性和通常与ASIC相关联的速度结合起来，一些网络设备以可编程网络处理器为特征。网络处理器使得软件工程师能够迅速地重编(reprogram)网络处理器操作。To combine the flexibility and speed typically associated with ASICs, some networking equipment features programmable network processors. Network processors allow software engineers to quickly reprogram network processor operations.

通常，同样是由于不断增长的网络连接速度，处理分组所花费的时间大大地超过了分组到达的速率。因此，一些网络处理器的体系结构以同时处理分组的多个处理引擎为特征。例如，当一个引擎确定如何转发一个分组时，另一个引擎确定如何转发不同的一个分组。尽管处理给定分组的时间可以保持相同，同时处理多个分组使得网络处理器能够应对大量到来的分组。Often, also due to ever-increasing network connection speeds, the time it takes to process packets greatly exceeds the rate at which packets arrive. Accordingly, the architecture of some network processors features multiple processing engines that process packets concurrently. For example, while one engine determines how to forward one packet, another engine determines how to forward a different one. Although the time to process a given packet may remain the same, processing multiple packets simultaneously enables the network processor to cope with a large number of incoming packets.

附图简要说明Brief description of the drawings

图1是描述网络处理器的指令缓存的图。FIG. 1 is a diagram describing an instruction cache of a network processor.

图2是描述用于将指令取到引擎的指令缓存中的指令操作的图。Figure 2 is a diagram describing instruction operations for fetching instructions into an engine's instruction cache.

图3是描述由网络处理器引擎执行的指令处理的流程图。Figure 3 is a flowchart describing instruction processing performed by a network processor engine.

图4是描述指令缓存的流程图。Figure 4 is a flowchart describing instruction caching.

图5是描述用于搜索被缓存指令的引擎电路的图。Figure 5 is a diagram depicting engine circuitry for searching cached instructions.

图6是被分配给网络处理器引擎的不同线程的指令缓存存储器的映射图(map)。FIG. 6 is a map of instruction cache memories allocated to different threads of a network processor engine.

图7是网络处理器引擎图。Figure 7 is a network processor engine diagram.

图8是网络处理器图。Figure 8 is a network processor diagram.

图9是网络设备图。Fig. 9 is a diagram of network equipment.

详细说明Detailed description

图1示出了包括多个处理引擎102的网络处理器100。引擎102可以被编程为执行各种分组处理操作，例如确定分组的下一跳、应用服务质量(Qos)，测量分组流量等等。在示出的体系结构中，引擎102执行储存在引擎102的高速本地存储器104的程序指令108。由于尺寸和成本的限制，由引擎102提供的指令存储器104的数量通常是有限的。为了防止引擎存储器104的有限的储存量(storage)对程序108的整体大小和复杂性施加过于严格的约束，图1示出了指令缓存方案的实施例，所述指令缓存方案在引擎102对程序108的执行进行时，动态地将更大的程序108的段(例如108b)下载到引擎102。FIG. 1 shows a network processor 100 including a plurality of processing engines 102 . Engine 102 may be programmed to perform various packet processing operations, such as determining a packet's next hop, applying quality of service (Qos), measuring packet traffic, and the like. In the illustrated architecture, engine 102 executes program instructions 108 stored in high speed local memory 104 of engine 102 . Due to size and cost constraints, the amount of instruction memory 104 provided by the engine 102 is typically limited. To prevent the limited storage of engine memory 104 from imposing too strict constraints on the overall size and complexity of program 108, FIG. As execution of 108 proceeds, segments of larger program 108 (eg, 108b ) are dynamically downloaded to engine 102 .

在图1中示出的实施例中，每个引擎102包括存有程序108指令的子集的指令缓存(cache)104。例如，分组引擎102a的指令缓存104a保持程序108的段108b。程序108的余下部分储存在引擎102所共享的指令储存器(instruction store)106中。In the embodiment shown in FIG. 1 , each engine 102 includes an instruction cache 104 that stores a subset of program 108 instructions. For example, instruction cache 104a of packet engine 102a holds segment 108b of program 108 . The remainder of the program 108 is stored in an instruction store 106 shared by the engines 102 .

最后，引擎102a可能需要访问除段108b外的程序段。例如，程序可以分支或顺序地前进到程序108内段108b之外的点。为了使引擎102可以继续程序108的执行，网络处理器100将把所请求/所需的段下载到引擎102a的缓存104a。因此，由缓存储存的段随着程序执行的进行动态地改变。Finally, engine 102a may need to access program segments other than segment 108b. For example, the program may branch or advance sequentially to a point within program 108 outside segment 108b. In order for the engine 102 to continue execution of the program 108, the network processor 100 will download the requested/needed segments to the cache 104a of the engine 102a. Therefore, the segments stored by the cache change dynamically as program execution progresses.

如图1所示，多个引擎102从指令储存器106接收要缓存的指令。共享的指令储存器106可以依次缓存来自层次上更高的指令储存器110的指令，所述储存器110位于处理器的内部或外部。换句话说，指令储存器104、106和110可以形成缓存层次，所述缓存层次包括引擎的L1指令缓存104和由不同引擎共享的L2指令缓存106。As shown in FIG. 1 , a plurality of engines 102 receive instructions to be cached from an instruction store 106 . The shared instruction store 106 may in turn cache instructions from a hierarchically higher instruction store 110, which may be internal or external to the processor. In other words, instruction stores 104, 106, and 110 may form a cache hierarchy that includes an engine's L1 instruction cache 104 and an L2 instruction cache 106 shared by different engines.

尽管图1将指令储存器106描述为服务所有的引擎102，网络处理器100可以另外以服务不同的引擎102组的多个共享的储存器106为特征。例如，一个共享的指令储存器106可以为引擎#1到#4储存程序指令，而另一个为引擎#5到#8储存程序指令。此外，尽管图1将引擎缓存104和指令储存器106描述为储存单个程序108的指令，它们可以另外储存属于不同程序的指令集。例如，共享的指令存储器106可以为每个引擎102或甚至不同的引擎102线程储存不同的程序指令。Although FIG. 1 depicts the instruction memory 106 as serving all of the engines 102 , the network processor 100 may additionally feature multiple shared memories 106 serving different groups of engines 102 . For example, one shared instruction memory 106 may store program instructions for engines #1 to #4, while the other stores program instructions for engines #5 to #8. Furthermore, although FIG. 1 depicts engine cache 104 and instruction store 106 as storing instructions of a single program 108, they may additionally store sets of instructions belonging to different programs. For example, shared instruction memory 106 may store different program instructions for each engine 102 or even different engine 102 threads.

图1将指令108描述为源代码以简化描述。由共享的储存器106储存和由引擎分发的实际指令通常会是在由引擎提供的指令集中表示(express)的可执行指令。FIG. 1 depicts instructions 108 as source code to simplify the description. The actual instructions stored by the shared memory 106 and dispatched by the engine will typically be executable instructions expressed in an instruction set provided by the engine.

潜在地，引擎102所需的用于继续程序执行的程序段可以在“按需”的基础上来提供。也就是说，引擎102可以继续执行储存在指令缓存104a中的指令108b，直到在缓存104a中找不到需要执行的指令。当这种情况发生时，引擎102可以信令(signal)共享的储存器106输送包括要被执行的下一条指令的程序段。然而，这个“按需”的情形(scenario)可能将延迟引入引擎102的程序执行。也就是说，在“按需”的时序中，引擎102(或引擎102线程)可以保持空闲直到所需指令被加载。该延迟可以不仅仅由将所需的指令下载到引擎102L1缓存104所涉及到的操作引起，还可以由访问共享的储存器106的引擎102b-102n之间的竞争引起。Potentially, the program segments needed by engine 102 to continue program execution may be provided on an "as-needed" basis. That is, the engine 102 may continue to execute the instructions 108b stored in the instruction cache 104a until no instruction to be executed is found in the cache 104a. When this occurs, the engine 102 may signal the shared memory 106 to deliver the program segment that includes the next instruction to be executed. However, this "on demand" scenario may introduce delays into engine 102's program execution. That is, in an "on-demand" sequence, the engine 102 (or engine 102 thread) may remain idle until the required instruction is loaded. This delay may be caused not only by the operations involved in downloading the required instructions into the engine 102L1 cache 104 , but also by contention among the engines 102b - 102n accessing the shared memory 106 .

为了潜在地避免该延迟，图2示出了包括取指令122的程序源代码列表的一部分，所述取指令122允许程序在当指令将被要求继续程序的执行时之前，发起将程序指令“预取”到引擎的缓存104的操作。例如，如图2所示，取指令122导致引擎102n在执行进行到下一个段108b中的点之前，向共享的指令储存器106发布(“1”)请求下一个所需段108b的请求。当引擎102继续处理取指令122之后的指令124时，被请求的段被载入引擎102n的指令缓存104n。换句话说，用于取得(retrieve)(“2”)程序段的时间与在引擎执行预取指令122和引擎“用完”要在当前被缓存的程序段中执行的指令的时间之间的时间重叠。To potentially avoid this delay, FIG. 2 shows a portion of a program source code listing that includes an instruction fetch 122 that allows a program to initiate a "pre-load" of program instructions before the instruction will be required to continue execution of the program. Fetch" to the operation of the cache 104 of the engine. For example, as shown in FIG. 2, an instruction fetch 122 causes the engine 102n to issue (“1”) a request to the shared instruction store 106 for the next desired segment 108b before execution proceeds to a point in the next segment 108b. As engine 102 continues to process instructions 124 following instruction fetch 122, the requested segment is loaded into instruction cache 104n of engine 102n. In other words, the difference between the time used to retrieve ("2") the program segment and the time the engine executes the prefetch instruction 122 and the time the engine "runs out" of instructions to be executed in the currently cached program segment time overlap.

在图2中示出的实施例中，用于取得程序指令的时间被在取指令之后执行指令122所花费的时间所隐蔽(conceal)。还可以通过在指令120(例如存储器操作)后执行取指令来“隐藏(hide)”取延迟，所述指令120花费一些时间来完成。In the embodiment shown in FIG. 2, the time to fetch a program instruction is concealed by the time it takes to execute instruction 122 after the instruction is fetched. Fetch latency can also be "hide" by performing an instruction fetch after an instruction 120 (eg, a memory operation) that takes some time to complete.

图2中示出的实例性取指令具有语法：The example instruction fetch shown in Figure 2 has the syntax:

Prefetch(SegmentAddress，SegmentCount)[，optional_token]Prefetch(SegmentAddress, SegmentCount)[, optional_token]

其中，SegmentAddress标识要从共享的储存器106取得的程序的起始地址并且SegmentCount标识要取出的后续段的数目。潜在地，SegmentAddress可以被约束为标识程序段的起始地址。Among them, SegmentAddress identifies the starting address of the program to be fetched from the shared storage 106 and SegmentCount identifies the number of subsequent segments to fetch. Potentially, SegmentAddress can be constrained to identify the starting address of a program segment.

optional_token具有语法：optional_token has syntax:

optional_token＝[ctx_swap[signal]，][sig_done[signal]]optional_token=[ctx_swap[signal],][sig_done[signal]]

ctx_swap参数指示引擎102切换到另一个引擎的执行线程，直到信号指示程序段取操作完成。sig_done参数还标识一旦取操作完成即被设置的状态信号，但不指示引擎102切换上下文(context)。The ctx_swap parameter instructs the engine 102 to switch to another engine's thread of execution until the signal indicates that the segment fetch operation is complete. The sig_done parameter also identifies a status signal that is set once the fetch operation is complete, but does not instruct the engine 102 to switch contexts.

在图2中示出的指令语法仅仅是示例性的，并且其他用于取出程序指令的指令可以以不同参数、关键字为特征，并且以不同选项为特征。此外，指令可以存在于不同的级别中。例如，指令可以是引擎的指令集的部分。可替换地，指令可以是由编译器处理来产生对应于取指令的目标指令(例如引擎可执行指令)的源代码指令。这样的编译器可以执行其他传统的编译器操作，例如用于将源代码的文本字符组合(group)成“单词(token)”的词法分析、将单词组合成文法短语的语法分析、更加抽象地表征源代码、优化等的中间代码生成。The instruction syntax shown in FIG. 2 is exemplary only, and other instructions for fetching program instructions may feature different parameters, keywords, and feature different options. Furthermore, instructions can exist in different levels. For example, the instructions may be part of the engine's instruction set. Alternatively, the instructions may be processed by a compiler to generate source code instructions corresponding to target instructions (eg, engine-executable instructions) of the instruction fetch. Such a compiler can perform other traditional compiler operations, such as lexical analysis for grouping textual characters of source code into "tokens," syntactic analysis for grouping words into grammatical phrases, more abstract representations of Intermediate code generation for source code, optimizations, etc.

取指令可以由程序员在代码开发期间手工地插入。例如，基于分组的初始分类，可以知道用于分组的剩余的程序流(flow)。因此，取指令可以在分类之后取得处理分组所需的段。例如，以高级语言编写的程序可以包括指令：Instruction fetches can be manually inserted by the programmer during code development. For example, based on the initial classification of the packet, the remaining program flow for the packet can be known. Therefore, an instruction fetch can fetch the segments needed to process a packet after sorting. For example, a program written in a high-level language can include instructions:

switch(classify(packet.header)){switch(classify(packet. header)){

case DropPacket:{case DropPacket:{

prefetch(DropCounterInstructions)；prefetch(DropCounterInstructions);

}}

case ForwardPacket{case ForwardPacket{

prefetch(RountingLookupInstructions)prefetch(RountingLookupInstructions)

prefetch(PacketEnqueueInstructions)；prefetch(PacketEnqueueInstructions);

}}}}

所述指令基于分组的分类将适当的程序段载入引擎102的指令缓存104。The instructions load the appropriate program segment into the instruction cache 104 of the engine 102 based on the classification of the packets.

尽管程序员可以手工地将取指令插入代码，取指令还可以由软件开发工具，例如编译器、分析器、剖析器(profiler)和/或预处理器插入到代码中。例如，代码流分析可以标识不同的程序段何时应该被加载。例如，编译器可以在存储器访问指令之后或在一组要花费一些时间来执行的指令之前插入取指令。Although programmers may manually insert instruction fetches into code, instruction fetches may also be inserted into code by software development tools, such as compilers, analyzers, profilers, and/or preprocessors. For example, code flow analysis can identify when different program segments should be loaded. For example, a compiler may insert an instruction fetch after a memory access instruction or before a set of instructions that take some time to execute.

图3示出了流程图，所述流程图图示“按需”和响应于“取”指令来取得指令的引擎的操作。如图3所示，标识下一条要执行的程序指令的程序计数器130被更新。例如，程序计数器130可以被递增，以前进到下一个顺续的指令地址，或计数器130可以响应于分支指令，被设置为一些其他的指令地址。如图所示，引擎确定132是否引擎的指令缓存目前保持程序计数器所标识的指令。如果没有，引擎线程停止(stall)134(例如，需要指令的线程被交换出引擎)，直到取操作136从共享的储存器取得未命中的指令。Figure 3 shows a flowchart illustrating the operation of the engine to fetch instructions "on demand" and in response to "fetch" instructions. As shown in FIG. 3, the program counter 130 identifying the next program instruction to be executed is updated. For example, program counter 130 may be incremented to advance to the next sequential instruction address, or counter 130 may be set to some other instruction address in response to a branch instruction. As shown, the engine determines 132 whether the engine's instruction cache currently holds the instruction identified by the program counter. If not, the engine thread stalls 134 (eg, the thread needing the instruction is swapped out of the engine) until a fetch operation 136 fetches the missing instruction from shared storage.

一旦要被执行的指令存在于引擎的指令缓存中，引擎可以确定140是否下一条要执行的指令是取指令。如果是，引擎可以发起被请求的程序段的取操作142。如果不是，引擎可以照常处理144指令。Once the instruction to be executed is present in the engine's instruction cache, the engine may determine 140 whether the next instruction to be executed is a fetch. If so, the engine may initiate a fetch operation 142 of the requested program segment. If not, the engine can process 144 instructions as usual.

图4示出了共享的指令缓存106的实例性的体系结构。例如，在网络处理器启动期间，指令缓存106接收指令(“1”)以与引擎共享。之后，共享的指令缓存106按照所需和/或所请求的，将指令108的部分分发(distribute)到引擎。FIG. 4 shows an example architecture of the shared instruction cache 106 . For example, during network processor startup, instruction cache 106 receives an instruction ("1") to share with the engine. The shared instruction cache 106 then distributes portions of the instructions 108 to the engines as needed and/or requested.

如图4的实例性体系结构所示，两条不同的总线150、152可以将共享的缓存106连接到引擎102。总线150携带(“2”)对共享的缓存106的取请求。这些请求可以标识要取出的程序段108和做出请求的引擎。请求还可以标识该请求是预取操作还是“按需”取操作。高带宽总线152将被请求的程序段中的指令传送(carry)(“4”)回请求引擎102。总线152的带宽可以允许共享的缓存106同时将所请求的指令传输到多个引擎。例如，总线152可以被划分成可以被动态分配给引擎的n条线路(line)。例如，如果4个引擎请求段，每个可以被分配给25％的总线带宽。As shown in the example architecture of FIG. 4 , two different buses 150 , 152 may connect the shared cache 106 to the engine 102 . Bus 150 carries (“2”) fetch requests to shared cache 106 . These requests may identify the program segment 108 to fetch and the engine making the request. The request can also identify whether the request is a prefetch operation or an "on demand" fetch operation. The high bandwidth bus 152 carries (“4”) the instructions in the requested program segment back to the request engine 102 . The bandwidth of bus 152 may allow shared cache 106 to transfer requested instructions to multiple engines simultaneously. For example, bus 152 may be divided into n lines that may be dynamically assigned to engines. For example, if 4 engines request segments, each may be allocated 25% of the bus bandwidth.

如图所示，共享的缓存106可以在请求到达时(例如，在用于后续服务的(先进先出)FIFO队列154中)对他们进行排队。然而，如上所述，当要被执行的指令还未被加载到引擎的指令缓存104中时，该线程停止。因此，服务导致实际停止的“按需”请求比服务可能导致或可能不导致停止的“预取”请求呈现出更加紧迫的重要性。如图所示，共享的缓存106包括判优器(arbiter)156，所述判优器156可以授予需求请求(demand request)高于(over)预取请求的优先权。判优器156可以包括专用电路或者可以是可编程的。As shown, the shared cache 106 may queue requests as they arrive (eg, in a (first in first out) FIFO queue 154 for subsequent servicing). However, as described above, when the instruction to be executed has not been loaded into the engine's instruction cache 104, the thread stalls. Thus, servicing "on-demand" requests that cause actual stalls takes on more urgent importance than servicing "prefetch" requests that may or may not cause stalls. As shown, the shared cache 106 includes an arbiter 156 that can grant demand requests priority over prefetch requests. Arbiter 156 may comprise dedicated circuitry or may be programmable.

判优器156可以以各种方式区分需求请求的优先权。例如，判优器156可以不把需求请求添加到队列154，但可以另外将所述请求表示为立即服务(“3”)。为了区分多个“需求”请求之间的优先权，判优器156还可以维持一个独立的“需求”FIFO队列，所述“需求”FIFO队列由判优器156授予高于FIFO队列154中的请求的优先权。判优器156还可以立即挂起正在进行的指令下载操作以服务需求请求。此外，判优器156可以分配很大一部分(不然的话100％)的总线152带宽，以将段指令传递到发布“按需”请求的引擎。Arbiter 156 may prioritize demand requests in various ways. For example, the arbiter 156 may not add the demand request to the queue 154, but may otherwise indicate the request as immediate service ("3"). In order to differentiate priority between multiple "demand" requests, the arbiter 156 may also maintain a separate "demand" FIFO queue that is granted higher priority by the arbiter 156 than those in the FIFO queue 154. The priority of the request. The arbiter 156 may also immediately suspend an ongoing instruction download operation to service demand requests. In addition, the arbiter 156 may allocate a substantial portion (otherwise 100%) of the bus 152 bandwidth to pass segment instructions to engines issuing "on demand" requests.

图5示出了引擎的指令缓存的实例性的体系结构。如图所示，缓存储存由一组存储器设备166x来提供，所述存储器设备166x储存通过总线164从共享的指令储存器106接收的指令。可以将单独的存储器元件166a的大小调整为保持一个程序段。如图所示，每个存储器166x与地址译码器相关联，所述地址译码器从引擎接收要被处理的指令的地址并且确定该指令是否存在于相关联的存储器166中。不同的译码器并行地对地址进行操作。也就是说，每个译码器同时搜索自身相关联的存储器。如果在存储器166x中的一个存储器中找到指令地址，那么那个存储器166x单元输出168要被引擎处理的所请求的指令。如果在存储器166的任何一个存储器中都未找到指令地址，那么产生“未命中”信号168。Figure 5 shows an example architecture of the engine's instruction cache. As shown, cache storage is provided by a set of memory devices 166x that store instructions received from shared instruction memory 106 over bus 164 . A single memory element 166a may be sized to hold one program segment. As shown, each memory 166x is associated with an address decoder that receives the address of an instruction to be processed from the engine and determines whether the instruction is present in the associated memory 166 . Different decoders operate on addresses in parallel. That is, each decoder simultaneously searches its associated memory. If the instruction address is found in one of the memories 166x, then that memory 166x unit outputs 168 the requested instruction to be processed by the engine. If the instruction address is not found in any of the memories 166, then a "miss" signal 168 is generated.

如上所述，引擎可以提供多个执行线程。在执行的过程中，这些不同的线程将把不同的程序段载入引擎的指令缓存。当缓存被填满时，将段载入缓存的操作需要一些其他段从缓存中被移除(“牺牲”)。没有一些安全措施，线程可能牺牲当前正被另一个线程使用的段。当其他线程恢复处理时，最近牺牲的段可以再次从共享的缓存106中取出。这种指令缓存104的线程间颠簸(inter-thread thrashing)可能不断地重复，显著降低了系统性能，因为刚被一个线程载入缓存的段，却过早地被另一个线程牺牲，并且在短时间后被重新加载。As mentioned above, an engine can provide multiple threads of execution. During execution, these different threads will load different program segments into the engine's instruction cache. When the cache is full, the operation of loading a segment into the cache requires some other segments to be removed from the cache ("sacrificed"). Without some safeguard, a thread may sacrifice a segment that is currently being used by another thread. When other threads resume processing, the most recently sacrificed segment can be fetched from the shared cache 106 again. This inter-thread thrashing of the instruction cache 104 may be repeated over and over again, significantly reducing system performance as segments just loaded into the cache by one thread are prematurely sacrificed by another thread and are short-lived. time to be reloaded.

为了抵抗这样的颠簸，各种机制可以在线程牺牲段的能力上施加限制。例如，图6示出了引擎的指令缓存104的存储器映射图，其中给每个引擎线程分配独占的一部分缓存104。例如，给线程0172分配了用于N个程序段172a、172b、172n的存储器。为线程取出的指令段可以驻留在线程的缓存104的分配空间中。为了防止颠簸，逻辑可以限制一个线程牺牲来自分配给其他线程的缓存分区的段。To resist such thrashing, various mechanisms can impose limits on a thread's ability to sacrifice segments. For example, FIG. 6 shows a memory map of an engine's instruction cache 104, where each engine thread is assigned an exclusive portion of the cache 104. For example, thread 0 172 is allocated memory for N program segments 172a, 172b, 172n. Instruction segments fetched for a thread may reside in allocated space in the thread's cache 104 . To prevent thrashing, logic can restrict one thread from sacrificing segments from cache partitions allocated to other threads.

为了快速访问被缓存的段，与线程相关联的控制和状态寄存器(CSR)可以储存已分配缓存部分的起始地址。该地址可以，例如基于线程编号(number of threads)来计算(例如，分配空间起始地址＝基址+(线程#X每个线程分配的存储器))。每个分区可以进一步划分成段，所述段例如对应于从共享的储存器106的突发(burst)取大小或其他的从共享的储存器106到引擎缓存的传输粒度。可以为不同的段在线程的已分配缓存部分中维持LRU(最近最少使用的)信息。因此，在LRU方案中，可以首先牺牲给定线程的缓存中最近最少使用的段。For fast access to cached segments, a control and status register (CSR) associated with a thread may store the starting address of the allocated cache portion. This address may, for example, be calculated based on the number of threads (eg, allocation space start address = base address + (thread #X memory allocated per thread)). Each partition may be further divided into segments corresponding, for example, to a burst size from shared storage 106 or other transfer granularity from shared storage 106 to the engine cache. LRU (least recently used) information may be maintained in the thread's allocated cache portion for different segments. Thus, in an LRU scheme, the least recently used segment in a given thread's cache may be sacrificed first.

除了在不同线程之间被划分的区域之外，示出的映射图还包括“锁住(lock-down)”部分170。在锁住区域中的指令可在初始化时被加载，并且可以被保护以免牺牲。所有线程可以访问和执行储存在该区域的指令。The map shown includes a “lock-down” portion 170 in addition to the regions divided between different threads. Instructions in locked regions can be loaded at initialization and can be protected from sacrifice. All threads can access and execute instructions stored in this area.

例如图6中示出的方案的存储器分配方案可以阻止线程间颠簸。然而，还可以使用其他的方法。例如，访问计数可以与当前正在使用段的线程相关联。当计数到达零时，可以牺牲所述段。可替换地，缓存牺牲方案可以应用不同的规则。例如，所述方案可以试图避免牺牲还未被任何线程访问过的已加载段。A memory allocation scheme such as that shown in FIG. 6 can prevent inter-thread thrashing. However, other methods can also be used. For example, an access count can be associated with the thread currently using the segment. When the count reaches zero, the segment can be sacrificed. Alternatively, cache sacrifice schemes may apply different rules. For example, the scheme may attempt to avoid sacrificing loaded segments that have not been accessed by any thread.

图7示出了实例性的引擎102体系结构。引擎102可以是为分组处理定制的精简指令集计算(RISC)处理器。例如，引擎102可以不提供通用处理器的指令集通常提供的浮点或整数除法指令。FIG. 7 shows an example engine 102 architecture. Engine 102 may be a Reduced Instruction Set Computing (RISC) processor customized for packet processing. For example, engine 102 may not provide floating point or integer division instructions that are typically provided by instruction sets of general-purpose processors.

引擎102可以通过传输寄存器192a、192b与其他网络处理器部件(component)(例如共享的存储器)通信，所述传输寄存器192a、192b缓冲要发送到其他部件/从其他部件接收到的数据。引擎102还可以通过硬连线到其他引擎的“邻居”寄存器194a、194b与其他引擎102通信。The engine 102 may communicate with other network processor components (eg, shared memory) through transmit registers 192a, 192b, which buffer data to be sent/received to/from other components. Engines 102 may also communicate with other engines 102 through "neighborhood" registers 194a, 194b hardwired to the other engines.

示出的实例性的引擎102提供多个执行线程。为了支持多个线程，引擎102为每个线程储存程序上下文182。该上下文182可以包括线程状态数据例如程序计数器。线程判优器180选择要执行的线程的程序上下文182x。用于被选中的上下文的程序计数器被传送到指令缓冲104。当程序计数器标识的指令当前未被缓存时(例如，段未在锁住的缓冲区域或未在被分配给当前执行线程的区域中)，缓冲104可以发起程序段取操作。否则，缓冲104将被缓存指令发送到指令译码单元186。潜在地，指令译码单元190将指令标识为“取”指令并且发起段的取操作。否则，译码190单元可以使指令传送到执行单元(例如ALU)以便处理，或通过命令队列188发起对不同引擎所共享的资源(例如存储器控制器)的请求。The illustrated example engine 102 provides multiple threads of execution. To support multiple threads, the engine 102 stores a program context 182 for each thread. The context 182 may include thread state data such as a program counter. Thread arbiter 180 selects a program context 182x for a thread to execute. The program counter for the selected context is transferred to the instruction buffer 104 . Buffer 104 may initiate a program segment fetch operation when the instruction identified by the program counter is not currently cached (eg, the segment is not in a locked buffer area or in an area allocated to the currently executing thread). Otherwise, buffer 104 sends the cached instruction to instruction decode unit 186 . Potentially, instruction decode unit 190 identifies the instruction as a "fetch" instruction and initiates a fetch operation of the segment. Otherwise, the decode 190 unit may cause the instruction to be passed to an execution unit (eg, ALU) for processing, or initiate a request through the command queue 188 for a resource shared by different engines (eg, a memory controller).

取控制单元184处理从共享的缓存106取得程序段的操作。例如，取控制单元184可以协商对共享的缓存请求总线的访问、发布请求和将被返回的指令存入指令缓存104。取控制单元184还可以处理先前被缓存的指令的牺牲。The fetch control unit 184 handles fetching program segments from the shared cache 106 . For example, fetch control unit 184 may negotiate access to a shared cache request bus, issue requests, and store returned instructions into instruction cache 104 . Fetch control unit 184 may also handle the victim of previously cached instructions.

引擎102的指令缓存104和译码器186形成指令处理流水线的部分。也就是说，在多个时钟周期的过程中，指令可以从缓存104，、译码器186加载，可以是加载的指令操作数(例如，从通用寄存器196、下一个邻居寄存器194a、传输寄存器192a和本地存储器198)，并且可以由执行数据通路190执行。最后，操作的结果可以被写入(例如，写到通用寄存器196、本地存储器198、下一个邻居寄存器194b或传输寄存器192b)。在流水线中可以同时有许多指令。也就是说，当一条指令正被译码时，另一条正从L1指令缓存104被加载。The instruction cache 104 and the decoder 186 of the engine 102 form part of an instruction processing pipeline. That is, over the course of multiple clock cycles, instructions may be loaded from cache 104′, decoder 186, and may be loaded instruction operands (e.g., from general register 196, next neighbor register 194a, transfer register 192a and local memory 198 ), and can be executed by execution data path 190 . Finally, the result of the operation may be written (eg, to general register 196, local memory 198, next neighbor register 194b, or transfer register 192b). There can be many instructions in the pipeline at the same time. That is, while one instruction is being decoded, another is being loaded from the L1 instruction cache 104 .

图8示出了网络处理器200的实施例。示出的网路处理器200是

互联网交换网路处理器(IXP)。其他网络处理器以不同的设计为特征。FIG. 8 shows an embodiment of a network processor 200 . The network processor 200 shown is

Internet Exchange Network Processor (IXP). Other network processors feature different designs.

示出的网络处理器200以集成在单个管芯上的一组分组引擎102为特征。如上所述，单独的分组引擎102可以提供多个线程。处理器200还可以包括核心处理器210(例如强大的

)，所述核心处理器210通常被编程为执行网络操作涉及到的“控制面(control plane)”任务。然而，核心处理器210还可以处理“数据面(data plane)”任务并且可以提供额外的分组处理线程。The illustrated network processor 200 features a set of packet engines 102 integrated on a single die. As noted above, a single packet engine 102 may provide multiple threads. The processor 200 may also include a core processor 210 (such as a powerful

), the core processor 210 is typically programmed to perform "control plane" tasks involved in network operations. However, core processor 210 may also handle "data plane" tasks and may provide additional packet processing threads.

如图所示，网络处理器200还以接口202为特征，所述接口202可以在处理器200和其他网络部件之间传送分组。例如，处理器200可以以交换结构接口202(例如通用交换接口(CSIX)接口)为特征，所述交换结构接口202使得处理器200能够将分组传输到连接到该结构的其他处理器或电路。处理器200还可以以接口202(例如系统分组接口(SPI)接口)为特征，所述接口202使得处理器能够与物理层(PHY)和/或链路层设备通信。处理器200还可以包括接口208(例如外设部件互联(PCI)总线接口)以便例如与主机通信。如图所示，处理器200还可以包括由引擎共享的其他部件，例如存储器控制器206、212、哈希引擎和便笺式存储器。As shown, network processor 200 also features an interface 202 that may communicate packets between processor 200 and other network components. For example, processor 200 may feature a switch fabric interface 202, such as a Common Switch Interface (CSIX) interface, that enables processor 200 to transmit packets to other processors or circuits connected to the fabric. Processor 200 may also feature an interface 202 , such as a System Packet Interface (SPI) interface, that enables the processor to communicate with physical layer (PHY) and/or link layer devices. Processor 200 may also include an interface 208 (eg, a Peripheral Component Interconnect (PCI) bus interface) to communicate with a host, for example. As shown, the processor 200 may also include other components shared by the engines, such as memory controllers 206, 212, hash engines, and scratch pads.

以上所述的分组处理技术可以在例如IXP的网络处理器上以各种各样的方式实现。例如，核心处理器210可以在网络处理器引导期间将程序指令输送到共享的指令缓存106。此外，不同于“深度为2(two-deep)”的指令缓存层次，例如，当处理器以大量的引擎为特征时，处理器200可以以深度为N(N-deep)的指令缓存层次为特征。The packet processing techniques described above can be implemented in various ways on a network processor such as an IXP. For example, core processor 210 may deliver program instructions to shared instruction cache 106 during network processor boot. In addition, unlike the "two-deep" instruction cache hierarchy, for example, when the processor is characterized by a large number of engines, the processor 200 may use an N-deep instruction cache hierarchy as feature.

图9示出了包括以上描述的技术的网络设备。如图所示，设备以一组通过交换结构310(例如纵横结构或共享的存储器交换结构)互连的线卡(“刀片(blade)”)为特征。交换结构例如可以遵守CSIX或其他结构技术，例如HyperTansport、Infiniband、外设部件互联Express(PCI-X)等。Figure 9 illustrates a network device that includes the techniques described above. As shown, the device features a set of line cards ("blades") interconnected by a switch fabric 310 (eg, a crossbar fabric or a shared memory switch fabric). The switch fabric can, for example, comply with CSIX or other fabric technologies, such as HyperTransport, Infiniband, Peripheral Component Interconnect Express (PCI-X), and the like.

单独的线卡(例如300a)可以包括一个或更多个处理网络连接上的通信的物理层(PHY)设备302(例如光、有线和无线的物理层)。PHY在不同网络介质携带的物理信号和数字系统所使用的位(例如“0”和“1”)之间翻译。线卡300可以包括(例如以太网、铜鼓光网络(SONET)、高级数据链路)。线卡300还可以包括成帧器设备(例如，以太网、同步光网络(SONET)、高级数据链路(HDLC)成帧器或其他的“层2”设备)304，所述设备可以对帧执行例如检错和/或纠错的操作。示出的线卡300还可以包括一个或更多个使用上述的指令缓存技术的网络处理器306。网络处理器306被编程为对通过PHY 300接收到的分组执行分组处理操作，并且通过交换结构310将分组导向到提供被选中的出口接口的线卡。潜在地，网络处理器306可以代替成帧器设备304来履行“层2”的职责。An individual line card (eg, 300a) may include one or more physical layer (PHY) devices 302 (eg, optical, wired, and wireless PHYs) that handle communications over network connections. The PHY translates between the physical signals carried by different network media and the bits (such as "0" and "1") used by digital systems. Line card 300 may include (eg, Ethernet, Copper Optical Network (SONET), Advanced Data Link). Line card 300 may also include a framer device (e.g., Ethernet, Synchronous Optical Network (SONET), High-Level Data Link (HDLC) framer, or other "layer 2" device) 304, which may Perform operations such as error detection and/or error correction. The illustrated line card 300 may also include one or more network processors 306 using the instruction cache techniques described above. Network processor 306 is programmed to perform packet processing operations on packets received through PHY 300 and to direct the packet through switch fabric 310 to the line card providing the selected egress interface. Potentially, the network processor 306 can replace the framer device 304 to perform "layer 2" responsibilities.

尽管在图8和图9中示出了引擎、网络处理器和包括网络处理器的设备的实例性体系结构，可以在其他引擎、网络处理器和设备设计中实现这些技术。此外，可以在各种网络设备(例如，路由器、交换机、桥路器、集线器、流量发生器等)中使用这些技术。Although example architectures for engines, network processors, and devices including network processors are shown in FIGS. 8 and 9, these techniques may be implemented in other engine, network processor, and device designs. Furthermore, these techniques can be used in various network devices (eg, routers, switches, bridges, hubs, traffic generators, etc.).

本文用到的术语电路包括硬连线电路、数字电路、模拟电路、可编程电路等。可编程电路可以在计算机程序上运行。The term circuit as used herein includes hardwired circuits, digital circuits, analog circuits, programmable circuits, and the like. Programmable circuits can run on computer programs.

这样的计算机程序可以以高级过程或面向对象编程语言来编码。然而，如果需要，可以以汇编或机器语言实现所述程序。语言可以被编译或解释。此外，可以在各种联网环境中使用这些技术。Such computer programs may be coded in a high-level procedural or object-oriented programming language. However, the programs can be implemented in assembly or machine language, if desired. Languages can be compiled or interpreted. Additionally, these techniques can be used in a variety of networking environments.

其他实施方案在所附权利要求书的范围之内。Other implementations are within the scope of the following claims.

Claims

1. A processor comprising:

an instruction storage storing instructions of at least a portion of at least one program; and

a set of multiple multi-threaded engines coupled to the instruction store, individual ones of the multiple multi-threaded engines including engine instruction caches and circuitry responsive to fetching instructions included in the at least one program executed on at least one of the at least one program, requesting a subset of the at least a portion of instructions of the at least one program, wherein the fetching instructions includes a request to switch to a different thread, and

In response to determining that a subset of the at least a portion of instructions of the at least one program is not stored in the engine instruction cache, the circuitry causes engine threads to stall until instructions for the at least one program are fetched from the instruction store. A subset of the at least a portion of instructions.

2. The processor of claim 1, wherein

the engine instruction cache includes an L1 cache; and

The instruction store includes an L2 cache.

3. The processor of claim 1, further comprising a second instruction store coupled to a second plurality of multithreaded engines.

4. The processor of claim 1, wherein the instruction fetch identifies a signal associated with a state of a fetch operation.

5. The processor of claim 1, wherein the instruction fetch identifies an amount of the instruction store to cache.

6. The processor of claim 5,

wherein the instruction fetch identifies the quantity as a number of segments combining instructions of the program.

7. The processor of claim 1, wherein the multithreaded engine includes circuitry to select an instruction to sacrifice from the engine instruction cache.

8. The processor of claim 1, further comprising at least one of:

Interfaces to switch fabrics, interfaces to media access controllers, and interfaces to physical layer devices.

9. A method for dynamically caching engine instructions comprising:

In response to fetching instructions included in at least one program executing on at least one of the plurality of multi-threaded engines, requesting a subset of instructions stored by an instruction store comprised of all integrated on a single die A plurality of multi-threaded engines are shared, wherein the instruction fetch includes a request to switch to a different thread;

receiving the subset of instructions at one of the plurality of multi-threaded engines requesting the subset;

storing a subset of the received instructions in an instruction cache of a multithreaded engine of the plurality of multithreaded engines, and

In response to determining that a subset of the at least a portion of instructions of the at least one program is not stored in the engine instruction cache, stalling engine threads until the at least a portion of the at least one program is fetched from the instruction store subset of instructions.

10. The method of claim 9,

wherein the instruction store includes an L2 cache; and

Wherein the instruction cache of one of the multiple multi-thread engines includes an L1 cache.

11. The method of claim 9,

Wherein the instruction storage is one of a group of instruction storages, and different instruction storages of the group of instruction storages are shared by different engine groups.

12. The method of claim 9, further comprising selecting an instruction to sacrifice from the engine instruction cache.

13. The method of claim 10, further comprising executing a subset of the instructions to process packets received over the network.

14. An apparatus for dynamically caching engine instructions comprising:

means for requesting, in response to fetching instructions included in at least one program executing on at least one of a plurality of multithreaded engines, a subset of instructions stored by an instruction store integrated in a single tube shared by the plurality of multi-threaded engines on the core, wherein the instruction fetch includes a request to switch to a different thread;

means for receiving the subset of instructions at one of the plurality of multithreaded engines requesting the subset;

means for storing a subset of the received instructions in an instruction cache of a multithreaded engine of the plurality of multithreaded engines; and

for, in response to determining that the at least a subset of instructions of the at least one program is not stored in the engine instruction cache, stalling engine threads until the at least one program of the at least one program is fetched from the instruction store. means for a subset of at least a portion of the instructions.

15. A network forwarding device, comprising:

exchange structural parts;

A group of line cards interconnected through the switching fabric, at least one of the line card groups includes:

at least one physical layer device; and

at least one network processor comprising:

instruction memory;

A set of multi-threaded engines operatively coupled to the instruction store, individual multi-threaded engines of the set of multi-threaded engines comprising:

a cache for storing instructions to be executed by the engine; and

circuitry for, in response to fetching instructions included in at least one program executing on at least one of the set of multi-threaded engines, requesting a subset of instructions from the instruction store, the instructions being stored by the instruction store memory storage, the fetching instructions includes a request to switch to a different thread, and for stopping an engine thread in response to determining that a subset of the at least a portion of instructions of the at least one program is not stored in the engine instruction cache, until fetching the at least a subset of instructions of the at least one program from the instruction storage.

16. The network forwarding device of claim 15 , wherein said circuitry for requesting said subset of instructions comprises

Circuitry that is invoked when an instruction to be executed is not found in the multi-threaded engine's instruction cache.

17. The network forwarding device of claim 15, wherein the circuitry for requesting the subset of instructions comprises circuitry responsive to instructions executed by the multi-threaded engine.

18. The network forwarding device of claim 15, wherein the network processor further comprises

a second instruction store; and

A second set of multi-threaded engines is operatively coupled to the second instruction store.