CN1997973B - Processor, method, apparatus and apparatus for dynamic cache engine instructions - Google Patents
Processor, method, apparatus and apparatus for dynamic cache engine instructions Download PDFInfo
- Publication number
- CN1997973B CN1997973B CN200480040023XA CN200480040023A CN1997973B CN 1997973 B CN1997973 B CN 1997973B CN 200480040023X A CN200480040023X A CN 200480040023XA CN 200480040023 A CN200480040023 A CN 200480040023A CN 1997973 B CN1997973 B CN 1997973B
- Authority
- CN
- China
- Prior art keywords
- instruction
- instructions
- engine
- program
- subset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/30—Peripheral units, e.g. input or output ports
- H04L49/3063—Pipelined operation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Small-Scale Networks (AREA)
- Microcomputers (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
相关申请引用:本申请和以下与本申请在同一天递交的申请相关: RELATED APPLICATIONS REFERENCED : This application is related to the following applications filed on the same date as this application:
a.律师案卷号P16851-“Service Engine Cache Request(服务引擎缓存请求)”a. Lawyer file number P16851-"Service Engine Cache Request (service engine cache request)"
b.律师案卷号P16852-“Thread-Based Engine Cache Partitioning(基于线程的引擎缓存分区)”b. Lawyer case file number P16852-"Thread-Based Engine Cache Partitioning (thread-based engine cache partition)"
背景background
网络使得计算机和其他设备能够通信。例如网络可以携带表征视频、音频、电子邮件等的数据。通常,通过网络发送的数据被划分(divide)成称为分组(packet)的更小的信息(message)。类似地,分组非常像你投入邮箱的信封。分组通常包括“有效负荷”和“头部”。分组的“有效负荷”类似于信封里的信。分组的“头部”非常像写在信封本身上的信息。头部可以包括用于帮助网络设备适当地处理分组的信息。例如,头部可以包括标识分组目的地的地址。Networks enable computers and other devices to communicate. For example, a network may carry data representing video, audio, email, and the like. Typically, data sent over a network is divided into smaller messages called packets. Similarly, packets are very much like envelopes you drop in the mailbox. A packet generally includes a "payload" and a "header". The "payload" of a packet is like a letter in an envelope. The "header" of the packet is much like the information written on the envelope itself. The header may include information to help network devices process the packet appropriately. For example, the header may include an address identifying the destination of the packet.
给定的分组可以在到达其目的地之前“跳(hop)”过许多不同的中间网络设备(例如“路由器”、“桥路器”和“交换机”)。这些中间设备通常执行多种分组处理操作。例如,中间设备通常执行操作来确定如何更进一步向其目的地转发分组或确定在处理分组中要用到的服务质量。A given packet may "hop" through many different intermediate network devices (eg, "routers," "bridges," and "switches") before reaching its destination. These intermediary devices typically perform various packet processing operations. For example, intermediary devices typically perform operations to determine how to forward a packet further toward its destination or to determine the quality of service to use in processing the packet.
随着网络连接速度的增长,中间设备所具有的用于处理分组的时间量继续缩小。为了实现快速分组处理,许多设备以专用的、“硬连线”设计(例如专用集成电路(ASIC))为特征。然而,这些设计通常难以适合新兴的网络技术和通信协议。As network connection speeds increase, the amount of time intermediary devices have to process packets continues to shrink. To achieve fast packet processing, many devices feature dedicated, "hardwired" designs such as application-specific integrated circuits (ASICs). However, these designs are often difficult to adapt to emerging network technologies and communication protocols.
为了将灵活性和通常与ASIC相关联的速度结合起来,一些网络设备以可编程网络处理器为特征。网络处理器使得软件工程师能够迅速地重编(reprogram)网络处理器操作。To combine the flexibility and speed typically associated with ASICs, some networking equipment features programmable network processors. Network processors allow software engineers to quickly reprogram network processor operations.
通常,同样是由于不断增长的网络连接速度,处理分组所花费的时间大大地超过了分组到达的速率。因此,一些网络处理器的体系结构以同时处理分组的多个处理引擎为特征。例如,当一个引擎确定如何转发一个分组时,另一个引擎确定如何转发不同的一个分组。尽管处理给定分组的时间可以保持相同,同时处理多个分组使得网络处理器能够应对大量到来的分组。Often, also due to ever-increasing network connection speeds, the time it takes to process packets greatly exceeds the rate at which packets arrive. Accordingly, the architecture of some network processors features multiple processing engines that process packets concurrently. For example, while one engine determines how to forward one packet, another engine determines how to forward a different one. Although the time to process a given packet may remain the same, processing multiple packets simultaneously enables the network processor to cope with a large number of incoming packets.
附图简要说明Brief description of the drawings
图1是描述网络处理器的指令缓存的图。FIG. 1 is a diagram describing an instruction cache of a network processor.
图2是描述用于将指令取到引擎的指令缓存中的指令操作的图。Figure 2 is a diagram describing instruction operations for fetching instructions into an engine's instruction cache.
图3是描述由网络处理器引擎执行的指令处理的流程图。Figure 3 is a flowchart describing instruction processing performed by a network processor engine.
图4是描述指令缓存的流程图。Figure 4 is a flowchart describing instruction caching.
图5是描述用于搜索被缓存指令的引擎电路的图。Figure 5 is a diagram depicting engine circuitry for searching cached instructions.
图6是被分配给网络处理器引擎的不同线程的指令缓存存储器的映射图(map)。FIG. 6 is a map of instruction cache memories allocated to different threads of a network processor engine.
图7是网络处理器引擎图。Figure 7 is a network processor engine diagram.
图8是网络处理器图。Figure 8 is a network processor diagram.
图9是网络设备图。Fig. 9 is a diagram of network equipment.
详细说明Detailed description
图1示出了包括多个处理引擎102的网络处理器100。引擎102可以被编程为执行各种分组处理操作,例如确定分组的下一跳、应用服务质量(Qos),测量分组流量等等。在示出的体系结构中,引擎102执行储存在引擎102的高速本地存储器104的程序指令108。由于尺寸和成本的限制,由引擎102提供的指令存储器104的数量通常是有限的。为了防止引擎存储器104的有限的储存量(storage)对程序108的整体大小和复杂性施加过于严格的约束,图1示出了指令缓存方案的实施例,所述指令缓存方案在引擎102对程序108的执行进行时,动态地将更大的程序108的段(例如108b)下载到引擎102。FIG. 1 shows a
在图1中示出的实施例中,每个引擎102包括存有程序108指令的子集的指令缓存(cache)104。例如,分组引擎102a的指令缓存104a保持程序108的段108b。程序108的余下部分储存在引擎102所共享的指令储存器(instruction store)106中。In the embodiment shown in FIG. 1 , each engine 102 includes an instruction cache 104 that stores a subset of
最后,引擎102a可能需要访问除段108b外的程序段。例如,程序可以分支或顺序地前进到程序108内段108b之外的点。为了使引擎102可以继续程序108的执行,网络处理器100将把所请求/所需的段下载到引擎102a的缓存104a。因此,由缓存储存的段随着程序执行的进行动态地改变。Finally,
如图1所示,多个引擎102从指令储存器106接收要缓存的指令。共享的指令储存器106可以依次缓存来自层次上更高的指令储存器110的指令,所述储存器110位于处理器的内部或外部。换句话说,指令储存器104、106和110可以形成缓存层次,所述缓存层次包括引擎的L1指令缓存104和由不同引擎共享的L2指令缓存106。As shown in FIG. 1 , a plurality of engines 102 receive instructions to be cached from an
尽管图1将指令储存器106描述为服务所有的引擎102,网络处理器100可以另外以服务不同的引擎102组的多个共享的储存器106为特征。例如,一个共享的指令储存器106可以为引擎#1到#4储存程序指令,而另一个为引擎#5到#8储存程序指令。此外,尽管图1将引擎缓存104和指令储存器106描述为储存单个程序108的指令,它们可以另外储存属于不同程序的指令集。例如,共享的指令存储器106可以为每个引擎102或甚至不同的引擎102线程储存不同的程序指令。Although FIG. 1 depicts the
图1将指令108描述为源代码以简化描述。由共享的储存器106储存和由引擎分发的实际指令通常会是在由引擎提供的指令集中表示(express)的可执行指令。FIG. 1 depicts
潜在地,引擎102所需的用于继续程序执行的程序段可以在“按需”的基础上来提供。也就是说,引擎102可以继续执行储存在指令缓存104a中的指令108b,直到在缓存104a中找不到需要执行的指令。当这种情况发生时,引擎102可以信令(signal)共享的储存器106输送包括要被执行的下一条指令的程序段。然而,这个“按需”的情形(scenario)可能将延迟引入引擎102的程序执行。也就是说,在“按需”的时序中,引擎102(或引擎102线程)可以保持空闲直到所需指令被加载。该延迟可以不仅仅由将所需的指令下载到引擎102L1缓存104所涉及到的操作引起,还可以由访问共享的储存器106的引擎102b-102n之间的竞争引起。Potentially, the program segments needed by engine 102 to continue program execution may be provided on an "as-needed" basis. That is, the engine 102 may continue to execute the
为了潜在地避免该延迟,图2示出了包括取指令122的程序源代码列表的一部分,所述取指令122允许程序在当指令将被要求继续程序的执行时之前,发起将程序指令“预取”到引擎的缓存104的操作。例如,如图2所示,取指令122导致引擎102n在执行进行到下一个段108b中的点之前,向共享的指令储存器106发布(“1”)请求下一个所需段108b的请求。当引擎102继续处理取指令122之后的指令124时,被请求的段被载入引擎102n的指令缓存104n。换句话说,用于取得(retrieve)(“2”)程序段的时间与在引擎执行预取指令122和引擎“用完”要在当前被缓存的程序段中执行的指令的时间之间的时间重叠。To potentially avoid this delay, FIG. 2 shows a portion of a program source code listing that includes an instruction fetch 122 that allows a program to initiate a "pre-load" of program instructions before the instruction will be required to continue execution of the program. Fetch" to the operation of the cache 104 of the engine. For example, as shown in FIG. 2, an instruction fetch 122 causes the
在图2中示出的实施例中,用于取得程序指令的时间被在取指令之后执行指令122所花费的时间所隐蔽(conceal)。还可以通过在指令120(例如存储器操作)后执行取指令来“隐藏(hide)”取延迟,所述指令120花费一些时间来完成。In the embodiment shown in FIG. 2, the time to fetch a program instruction is concealed by the time it takes to execute
图2中示出的实例性取指令具有语法:The example instruction fetch shown in Figure 2 has the syntax:
Prefetch(SegmentAddress,SegmentCount)[,optional_token]Prefetch(SegmentAddress, SegmentCount)[, optional_token]
其中,SegmentAddress标识要从共享的储存器106取得的程序的起始地址并且SegmentCount标识要取出的后续段的数目。潜在地,SegmentAddress可以被约束为标识程序段的起始地址。Among them, SegmentAddress identifies the starting address of the program to be fetched from the shared
optional_token具有语法:optional_token has syntax:
optional_token=[ctx_swap[signal],][sig_done[signal]]optional_token=[ctx_swap[signal],][sig_done[signal]]
ctx_swap参数指示引擎102切换到另一个引擎的执行线程,直到信号指示程序段取操作完成。sig_done参数还标识一旦取操作完成即被设置的状态信号,但不指示引擎102切换上下文(context)。The ctx_swap parameter instructs the engine 102 to switch to another engine's thread of execution until the signal indicates that the segment fetch operation is complete. The sig_done parameter also identifies a status signal that is set once the fetch operation is complete, but does not instruct the engine 102 to switch contexts.
在图2中示出的指令语法仅仅是示例性的,并且其他用于取出程序指令的指令可以以不同参数、关键字为特征,并且以不同选项为特征。此外,指令可以存在于不同的级别中。例如,指令可以是引擎的指令集的部分。可替换地,指令可以是由编译器处理来产生对应于取指令的目标指令(例如引擎可执行指令)的源代码指令。这样的编译器可以执行其他传统的编译器操作,例如用于将源代码的文本字符组合(group)成“单词(token)”的词法分析、将单词组合成文法短语的语法分析、更加抽象地表征源代码、优化等的中间代码生成。The instruction syntax shown in FIG. 2 is exemplary only, and other instructions for fetching program instructions may feature different parameters, keywords, and feature different options. Furthermore, instructions can exist in different levels. For example, the instructions may be part of the engine's instruction set. Alternatively, the instructions may be processed by a compiler to generate source code instructions corresponding to target instructions (eg, engine-executable instructions) of the instruction fetch. Such a compiler can perform other traditional compiler operations, such as lexical analysis for grouping textual characters of source code into "tokens," syntactic analysis for grouping words into grammatical phrases, more abstract representations of Intermediate code generation for source code, optimizations, etc.
取指令可以由程序员在代码开发期间手工地插入。例如,基于分组的初始分类,可以知道用于分组的剩余的程序流(flow)。因此,取指令可以在分类之后取得处理分组所需的段。例如,以高级语言编写的程序可以包括指令:Instruction fetches can be manually inserted by the programmer during code development. For example, based on the initial classification of the packet, the remaining program flow for the packet can be known. Therefore, an instruction fetch can fetch the segments needed to process a packet after sorting. For example, a program written in a high-level language can include instructions:
switch(classify(packet.header)){switch(classify(packet. header)){
case DropPacket:{case DropPacket:{
prefetch(DropCounterInstructions);prefetch(DropCounterInstructions);
}}
case ForwardPacket{case ForwardPacket{
prefetch(RountingLookupInstructions)prefetch(RountingLookupInstructions)
prefetch(PacketEnqueueInstructions);prefetch(PacketEnqueueInstructions);
}}}}
所述指令基于分组的分类将适当的程序段载入引擎102的指令缓存104。The instructions load the appropriate program segment into the instruction cache 104 of the engine 102 based on the classification of the packets.
尽管程序员可以手工地将取指令插入代码,取指令还可以由软件开发工具,例如编译器、分析器、剖析器(profiler)和/或预处理器插入到代码中。例如,代码流分析可以标识不同的程序段何时应该被加载。例如,编译器可以在存储器访问指令之后或在一组要花费一些时间来执行的指令之前插入取指令。Although programmers may manually insert instruction fetches into code, instruction fetches may also be inserted into code by software development tools, such as compilers, analyzers, profilers, and/or preprocessors. For example, code flow analysis can identify when different program segments should be loaded. For example, a compiler may insert an instruction fetch after a memory access instruction or before a set of instructions that take some time to execute.
图3示出了流程图,所述流程图图示“按需”和响应于“取”指令来取得指令的引擎的操作。如图3所示,标识下一条要执行的程序指令的程序计数器130被更新。例如,程序计数器130可以被递增,以前进到下一个顺续的指令地址,或计数器130可以响应于分支指令,被设置为一些其他的指令地址。如图所示,引擎确定132是否引擎的指令缓存目前保持程序计数器所标识的指令。如果没有,引擎线程停止(stall)134(例如,需要指令的线程被交换出引擎),直到取操作136从共享的储存器取得未命中的指令。Figure 3 shows a flowchart illustrating the operation of the engine to fetch instructions "on demand" and in response to "fetch" instructions. As shown in FIG. 3, the
一旦要被执行的指令存在于引擎的指令缓存中,引擎可以确定140是否下一条要执行的指令是取指令。如果是,引擎可以发起被请求的程序段的取操作142。如果不是,引擎可以照常处理144指令。Once the instruction to be executed is present in the engine's instruction cache, the engine may determine 140 whether the next instruction to be executed is a fetch. If so, the engine may initiate a fetch
图4示出了共享的指令缓存106的实例性的体系结构。例如,在网络处理器启动期间,指令缓存106接收指令(“1”)以与引擎共享。之后,共享的指令缓存106按照所需和/或所请求的,将指令108的部分分发(distribute)到引擎。FIG. 4 shows an example architecture of the shared
如图4的实例性体系结构所示,两条不同的总线150、152可以将共享的缓存106连接到引擎102。总线150携带(“2”)对共享的缓存106的取请求。这些请求可以标识要取出的程序段108和做出请求的引擎。请求还可以标识该请求是预取操作还是“按需”取操作。高带宽总线152将被请求的程序段中的指令传送(carry)(“4”)回请求引擎102。总线152的带宽可以允许共享的缓存106同时将所请求的指令传输到多个引擎。例如,总线152可以被划分成可以被动态分配给引擎的n条线路(line)。例如,如果4个引擎请求段,每个可以被分配给25%的总线带宽。As shown in the example architecture of FIG. 4 , two different buses 150 , 152 may connect the shared
如图所示,共享的缓存106可以在请求到达时(例如,在用于后续服务的(先进先出)FIFO队列154中)对他们进行排队。然而,如上所述,当要被执行的指令还未被加载到引擎的指令缓存104中时,该线程停止。因此,服务导致实际停止的“按需”请求比服务可能导致或可能不导致停止的“预取”请求呈现出更加紧迫的重要性。如图所示,共享的缓存106包括判优器(arbiter)156,所述判优器156可以授予需求请求(demand request)高于(over)预取请求的优先权。判优器156可以包括专用电路或者可以是可编程的。As shown, the shared
判优器156可以以各种方式区分需求请求的优先权。例如,判优器156可以不把需求请求添加到队列154,但可以另外将所述请求表示为立即服务(“3”)。为了区分多个“需求”请求之间的优先权,判优器156还可以维持一个独立的“需求”FIFO队列,所述“需求”FIFO队列由判优器156授予高于FIFO队列154中的请求的优先权。判优器156还可以立即挂起正在进行的指令下载操作以服务需求请求。此外,判优器156可以分配很大一部分(不然的话100%)的总线152带宽,以将段指令传递到发布“按需”请求的引擎。Arbiter 156 may prioritize demand requests in various ways. For example, the arbiter 156 may not add the demand request to the queue 154, but may otherwise indicate the request as immediate service ("3"). In order to differentiate priority between multiple "demand" requests, the arbiter 156 may also maintain a separate "demand" FIFO queue that is granted higher priority by the arbiter 156 than those in the FIFO queue 154. The priority of the request. The arbiter 156 may also immediately suspend an ongoing instruction download operation to service demand requests. In addition, the arbiter 156 may allocate a substantial portion (otherwise 100%) of the bus 152 bandwidth to pass segment instructions to engines issuing "on demand" requests.
图5示出了引擎的指令缓存的实例性的体系结构。如图所示,缓存储存由一组存储器设备166x来提供,所述存储器设备166x储存通过总线164从共享的指令储存器106接收的指令。可以将单独的存储器元件166a的大小调整为保持一个程序段。如图所示,每个存储器166x与地址译码器相关联,所述地址译码器从引擎接收要被处理的指令的地址并且确定该指令是否存在于相关联的存储器166中。不同的译码器并行地对地址进行操作。也就是说,每个译码器同时搜索自身相关联的存储器。如果在存储器166x中的一个存储器中找到指令地址,那么那个存储器166x单元输出168要被引擎处理的所请求的指令。如果在存储器166的任何一个存储器中都未找到指令地址,那么产生“未命中”信号168。Figure 5 shows an example architecture of the engine's instruction cache. As shown, cache storage is provided by a set of memory devices 166x that store instructions received from shared
如上所述,引擎可以提供多个执行线程。在执行的过程中,这些不同的线程将把不同的程序段载入引擎的指令缓存。当缓存被填满时,将段载入缓存的操作需要一些其他段从缓存中被移除(“牺牲”)。没有一些安全措施,线程可能牺牲当前正被另一个线程使用的段。当其他线程恢复处理时,最近牺牲的段可以再次从共享的缓存106中取出。这种指令缓存104的线程间颠簸(inter-thread thrashing)可能不断地重复,显著降低了系统性能,因为刚被一个线程载入缓存的段,却过早地被另一个线程牺牲,并且在短时间后被重新加载。As mentioned above, an engine can provide multiple threads of execution. During execution, these different threads will load different program segments into the engine's instruction cache. When the cache is full, the operation of loading a segment into the cache requires some other segments to be removed from the cache ("sacrificed"). Without some safeguard, a thread may sacrifice a segment that is currently being used by another thread. When other threads resume processing, the most recently sacrificed segment can be fetched from the shared
为了抵抗这样的颠簸,各种机制可以在线程牺牲段的能力上施加限制。例如,图6示出了引擎的指令缓存104的存储器映射图,其中给每个引擎线程分配独占的一部分缓存104。例如,给线程0172分配了用于N个程序段172a、172b、172n的存储器。为线程取出的指令段可以驻留在线程的缓存104的分配空间中。为了防止颠簸,逻辑可以限制一个线程牺牲来自分配给其他线程的缓存分区的段。To resist such thrashing, various mechanisms can impose limits on a thread's ability to sacrifice segments. For example, FIG. 6 shows a memory map of an engine's instruction cache 104, where each engine thread is assigned an exclusive portion of the cache 104. For example, thread 0 172 is allocated memory for N program segments 172a, 172b, 172n. Instruction segments fetched for a thread may reside in allocated space in the thread's cache 104 . To prevent thrashing, logic can restrict one thread from sacrificing segments from cache partitions allocated to other threads.
为了快速访问被缓存的段,与线程相关联的控制和状态寄存器(CSR)可以储存已分配缓存部分的起始地址。该地址可以,例如基于线程编号(number of threads)来计算(例如,分配空间起始地址=基址+(线程#X每个线程分配的存储器))。每个分区可以进一步划分成段,所述段例如对应于从共享的储存器106的突发(burst)取大小或其他的从共享的储存器106到引擎缓存的传输粒度。可以为不同的段在线程的已分配缓存部分中维持LRU(最近最少使用的)信息。因此,在LRU方案中,可以首先牺牲给定线程的缓存中最近最少使用的段。For fast access to cached segments, a control and status register (CSR) associated with a thread may store the starting address of the allocated cache portion. This address may, for example, be calculated based on the number of threads (eg, allocation space start address = base address + (thread #X memory allocated per thread)). Each partition may be further divided into segments corresponding, for example, to a burst size from shared
除了在不同线程之间被划分的区域之外,示出的映射图还包括“锁住(lock-down)”部分170。在锁住区域中的指令可在初始化时被加载,并且可以被保护以免牺牲。所有线程可以访问和执行储存在该区域的指令。The map shown includes a “lock-down” portion 170 in addition to the regions divided between different threads. Instructions in locked regions can be loaded at initialization and can be protected from sacrifice. All threads can access and execute instructions stored in this area.
例如图6中示出的方案的存储器分配方案可以阻止线程间颠簸。然而,还可以使用其他的方法。例如,访问计数可以与当前正在使用段的线程相关联。当计数到达零时,可以牺牲所述段。可替换地,缓存牺牲方案可以应用不同的规则。例如,所述方案可以试图避免牺牲还未被任何线程访问过的已加载段。A memory allocation scheme such as that shown in FIG. 6 can prevent inter-thread thrashing. However, other methods can also be used. For example, an access count can be associated with the thread currently using the segment. When the count reaches zero, the segment can be sacrificed. Alternatively, cache sacrifice schemes may apply different rules. For example, the scheme may attempt to avoid sacrificing loaded segments that have not been accessed by any thread.
图7示出了实例性的引擎102体系结构。引擎102可以是为分组处理定制的精简指令集计算(RISC)处理器。例如,引擎102可以不提供通用处理器的指令集通常提供的浮点或整数除法指令。FIG. 7 shows an example engine 102 architecture. Engine 102 may be a Reduced Instruction Set Computing (RISC) processor customized for packet processing. For example, engine 102 may not provide floating point or integer division instructions that are typically provided by instruction sets of general-purpose processors.
引擎102可以通过传输寄存器192a、192b与其他网络处理器部件(component)(例如共享的存储器)通信,所述传输寄存器192a、192b缓冲要发送到其他部件/从其他部件接收到的数据。引擎102还可以通过硬连线到其他引擎的“邻居”寄存器194a、194b与其他引擎102通信。The engine 102 may communicate with other network processor components (eg, shared memory) through transmit registers 192a, 192b, which buffer data to be sent/received to/from other components. Engines 102 may also communicate with other engines 102 through "neighborhood" registers 194a, 194b hardwired to the other engines.
示出的实例性的引擎102提供多个执行线程。为了支持多个线程,引擎102为每个线程储存程序上下文182。该上下文182可以包括线程状态数据例如程序计数器。线程判优器180选择要执行的线程的程序上下文182x。用于被选中的上下文的程序计数器被传送到指令缓冲104。当程序计数器标识的指令当前未被缓存时(例如,段未在锁住的缓冲区域或未在被分配给当前执行线程的区域中),缓冲104可以发起程序段取操作。否则,缓冲104将被缓存指令发送到指令译码单元186。潜在地,指令译码单元190将指令标识为“取”指令并且发起段的取操作。否则,译码190单元可以使指令传送到执行单元(例如ALU)以便处理,或通过命令队列188发起对不同引擎所共享的资源(例如存储器控制器)的请求。The illustrated example engine 102 provides multiple threads of execution. To support multiple threads, the engine 102 stores a program context 182 for each thread. The context 182 may include thread state data such as a program counter. Thread arbiter 180 selects a program context 182x for a thread to execute. The program counter for the selected context is transferred to the instruction buffer 104 . Buffer 104 may initiate a program segment fetch operation when the instruction identified by the program counter is not currently cached (eg, the segment is not in a locked buffer area or in an area allocated to the currently executing thread). Otherwise, buffer 104 sends the cached instruction to instruction decode unit 186 . Potentially, instruction decode unit 190 identifies the instruction as a "fetch" instruction and initiates a fetch operation of the segment. Otherwise, the decode 190 unit may cause the instruction to be passed to an execution unit (eg, ALU) for processing, or initiate a request through the command queue 188 for a resource shared by different engines (eg, a memory controller).
取控制单元184处理从共享的缓存106取得程序段的操作。例如,取控制单元184可以协商对共享的缓存请求总线的访问、发布请求和将被返回的指令存入指令缓存104。取控制单元184还可以处理先前被缓存的指令的牺牲。The fetch control unit 184 handles fetching program segments from the shared
引擎102的指令缓存104和译码器186形成指令处理流水线的部分。也就是说,在多个时钟周期的过程中,指令可以从缓存104,、译码器186加载,可以是加载的指令操作数(例如,从通用寄存器196、下一个邻居寄存器194a、传输寄存器192a和本地存储器198),并且可以由执行数据通路190执行。最后,操作的结果可以被写入(例如,写到通用寄存器196、本地存储器198、下一个邻居寄存器194b或传输寄存器192b)。在流水线中可以同时有许多指令。也就是说,当一条指令正被译码时,另一条正从L1指令缓存104被加载。The instruction cache 104 and the decoder 186 of the engine 102 form part of an instruction processing pipeline. That is, over the course of multiple clock cycles, instructions may be loaded from cache 104′, decoder 186, and may be loaded instruction operands (e.g., from general register 196, next neighbor register 194a, transfer register 192a and local memory 198 ), and can be executed by execution data path 190 . Finally, the result of the operation may be written (eg, to general register 196, local memory 198, next neighbor register 194b, or transfer register 192b). There can be many instructions in the pipeline at the same time. That is, while one instruction is being decoded, another is being loaded from the L1 instruction cache 104 .
图8示出了网络处理器200的实施例。示出的网路处理器200是互联网交换网路处理器(IXP)。其他网络处理器以不同的设计为特征。FIG. 8 shows an embodiment of a
示出的网络处理器200以集成在单个管芯上的一组分组引擎102为特征。如上所述,单独的分组引擎102可以提供多个线程。处理器200还可以包括核心处理器210(例如强大的),所述核心处理器210通常被编程为执行网络操作涉及到的“控制面(control plane)”任务。然而,核心处理器210还可以处理“数据面(data plane)”任务并且可以提供额外的分组处理线程。The illustrated
如图所示,网络处理器200还以接口202为特征,所述接口202可以在处理器200和其他网络部件之间传送分组。例如,处理器200可以以交换结构接口202(例如通用交换接口(CSIX)接口)为特征,所述交换结构接口202使得处理器200能够将分组传输到连接到该结构的其他处理器或电路。处理器200还可以以接口202(例如系统分组接口(SPI)接口)为特征,所述接口202使得处理器能够与物理层(PHY)和/或链路层设备通信。处理器200还可以包括接口208(例如外设部件互联(PCI)总线接口)以便例如与主机通信。如图所示,处理器200还可以包括由引擎共享的其他部件,例如存储器控制器206、212、哈希引擎和便笺式存储器。As shown,
以上所述的分组处理技术可以在例如IXP的网络处理器上以各种各样的方式实现。例如,核心处理器210可以在网络处理器引导期间将程序指令输送到共享的指令缓存106。此外,不同于“深度为2(two-deep)”的指令缓存层次,例如,当处理器以大量的引擎为特征时,处理器200可以以深度为N(N-deep)的指令缓存层次为特征。The packet processing techniques described above can be implemented in various ways on a network processor such as an IXP. For example, core processor 210 may deliver program instructions to shared
图9示出了包括以上描述的技术的网络设备。如图所示,设备以一组通过交换结构310(例如纵横结构或共享的存储器交换结构)互连的线卡(“刀片(blade)”)为特征。交换结构例如可以遵守CSIX或其他结构技术,例如HyperTansport、Infiniband、外设部件互联Express(PCI-X)等。Figure 9 illustrates a network device that includes the techniques described above. As shown, the device features a set of line cards ("blades") interconnected by a switch fabric 310 (eg, a crossbar fabric or a shared memory switch fabric). The switch fabric can, for example, comply with CSIX or other fabric technologies, such as HyperTransport, Infiniband, Peripheral Component Interconnect Express (PCI-X), and the like.
单独的线卡(例如300a)可以包括一个或更多个处理网络连接上的通信的物理层(PHY)设备302(例如光、有线和无线的物理层)。PHY在不同网络介质携带的物理信号和数字系统所使用的位(例如“0”和“1”)之间翻译。线卡300可以包括(例如以太网、铜鼓光网络(SONET)、高级数据链路)。线卡300还可以包括成帧器设备(例如,以太网、同步光网络(SONET)、高级数据链路(HDLC)成帧器或其他的“层2”设备)304,所述设备可以对帧执行例如检错和/或纠错的操作。示出的线卡300还可以包括一个或更多个使用上述的指令缓存技术的网络处理器306。网络处理器306被编程为对通过PHY 300接收到的分组执行分组处理操作,并且通过交换结构310将分组导向到提供被选中的出口接口的线卡。潜在地,网络处理器306可以代替成帧器设备304来履行“层2”的职责。An individual line card (eg, 300a) may include one or more physical layer (PHY) devices 302 (eg, optical, wired, and wireless PHYs) that handle communications over network connections. The PHY translates between the physical signals carried by different network media and the bits (such as "0" and "1") used by digital systems. Line card 300 may include (eg, Ethernet, Copper Optical Network (SONET), Advanced Data Link). Line card 300 may also include a framer device (e.g., Ethernet, Synchronous Optical Network (SONET), High-Level Data Link (HDLC) framer, or other "layer 2" device) 304, which may Perform operations such as error detection and/or error correction. The illustrated line card 300 may also include one or more network processors 306 using the instruction cache techniques described above. Network processor 306 is programmed to perform packet processing operations on packets received through PHY 300 and to direct the packet through switch fabric 310 to the line card providing the selected egress interface. Potentially, the network processor 306 can replace the framer device 304 to perform "layer 2" responsibilities.
尽管在图8和图9中示出了引擎、网络处理器和包括网络处理器的设备的实例性体系结构,可以在其他引擎、网络处理器和设备设计中实现这些技术。此外,可以在各种网络设备(例如,路由器、交换机、桥路器、集线器、流量发生器等)中使用这些技术。Although example architectures for engines, network processors, and devices including network processors are shown in FIGS. 8 and 9, these techniques may be implemented in other engine, network processor, and device designs. Furthermore, these techniques can be used in various network devices (eg, routers, switches, bridges, hubs, traffic generators, etc.).
本文用到的术语电路包括硬连线电路、数字电路、模拟电路、可编程电路等。可编程电路可以在计算机程序上运行。The term circuit as used herein includes hardwired circuits, digital circuits, analog circuits, programmable circuits, and the like. Programmable circuits can run on computer programs.
这样的计算机程序可以以高级过程或面向对象编程语言来编码。然而,如果需要,可以以汇编或机器语言实现所述程序。语言可以被编译或解释。此外,可以在各种联网环境中使用这些技术。Such computer programs may be coded in a high-level procedural or object-oriented programming language. However, the programs can be implemented in assembly or machine language, if desired. Languages can be compiled or interpreted. Additionally, these techniques can be used in a variety of networking environments.
其他实施方案在所附权利要求书的范围之内。Other implementations are within the scope of the following claims.
Claims (18)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US10/704,432 | 2003-11-06 | ||
| US10/704,432 US20050102474A1 (en) | 2003-11-06 | 2003-11-06 | Dynamically caching engine instructions |
| PCT/US2004/035923 WO2005048113A2 (en) | 2003-11-06 | 2004-10-29 | Dynamically caching engine instructions for on demand program execution |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN1997973A CN1997973A (en) | 2007-07-11 |
| CN1997973B true CN1997973B (en) | 2010-06-09 |
Family
ID=34552126
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN200480040023XA Expired - Fee Related CN1997973B (en) | 2003-11-06 | 2004-10-29 | Processor, method, apparatus and apparatus for dynamic cache engine instructions |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US20050102474A1 (en) |
| EP (1) | EP1680743B1 (en) |
| JP (1) | JP2007510989A (en) |
| CN (1) | CN1997973B (en) |
| AT (1) | ATE484027T1 (en) |
| DE (1) | DE602004029485D1 (en) |
| WO (1) | WO2005048113A2 (en) |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7376789B2 (en) * | 2005-06-29 | 2008-05-20 | Intel Corporation | Wide-port context cache apparatus, systems, and methods |
| US20070005898A1 (en) * | 2005-06-30 | 2007-01-04 | William Halleck | Method, apparatus and system for task context cache replacement |
| US7676604B2 (en) * | 2005-11-22 | 2010-03-09 | Intel Corporation | Task context direct indexing in a protocol engine |
| US8209488B2 (en) * | 2008-02-01 | 2012-06-26 | International Business Machines Corporation | Techniques for prediction-based indirect data prefetching |
| US8166277B2 (en) * | 2008-02-01 | 2012-04-24 | International Business Machines Corporation | Data prefetching using indirect addressing |
| CN102571761B (en) * | 2011-12-21 | 2014-08-20 | 四川长虹电器股份有限公司 | Information transmission method of network interface |
| US10095847B2 (en) * | 2012-05-25 | 2018-10-09 | Koninklijke Philips N.V. | Method, system and device for protection against reverse engineering and/or tampering with programs |
| CN102855213B (en) * | 2012-07-06 | 2017-10-27 | 中兴通讯股份有限公司 | A kind of instruction storage method of network processing unit instruction storage device and the device |
| US9223705B2 (en) * | 2013-04-01 | 2015-12-29 | Advanced Micro Devices, Inc. | Cache access arbitration for prefetch requests |
| US9286258B2 (en) | 2013-06-14 | 2016-03-15 | National Instruments Corporation | Opaque bridge for peripheral component interconnect express bus systems |
| WO2015015251A1 (en) * | 2013-08-01 | 2015-02-05 | Yogesh Chunilal Rathod | Presenting plurality types of interfaces and functions for conducting various activities |
| US11397520B2 (en) | 2013-08-01 | 2022-07-26 | Yogesh Chunilal Rathod | Application program interface or page processing method and device |
| CN113326020B (en) | 2020-02-28 | 2025-04-25 | 昆仑芯(北京)科技有限公司 | Cache device, cache, system, data processing method, device and medium |
| CN116151338A (en) * | 2021-11-19 | 2023-05-23 | 平头哥(上海)半导体技术有限公司 | Cache access method and correlative graph neural network system |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6574712B1 (en) * | 1999-11-08 | 2003-06-03 | International Business Machines Corporation | Software prefetch system and method for predetermining amount of streamed data |
| CN1307564C (en) * | 1999-08-27 | 2007-03-28 | 国际商业机器公司 | Network switch and components and method of operation |
Family Cites Families (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS60263238A (en) * | 1984-06-11 | 1985-12-26 | Nippon Telegr & Teleph Corp <Ntt> | Information processor |
| JPH03268041A (en) * | 1990-03-17 | 1991-11-28 | Res Dev Corp Of Japan | Cache operation clarifying computer |
| US6021471A (en) * | 1994-11-15 | 2000-02-01 | Advanced Micro Devices, Inc. | Multiple level cache control system with address and data pipelines |
| JPH096633A (en) * | 1995-06-07 | 1997-01-10 | Internatl Business Mach Corp <Ibm> | Method and system for operation of high-performance multiplelogical route in data-processing system |
| JPH09282223A (en) * | 1996-04-12 | 1997-10-31 | Ricoh Co Ltd | Memory controller |
| US6446143B1 (en) * | 1998-11-25 | 2002-09-03 | Compaq Information Technologies Group, L.P. | Methods and apparatus for minimizing the impact of excessive instruction retrieval |
| JP3420091B2 (en) * | 1998-11-30 | 2003-06-23 | Necエレクトロニクス株式会社 | Microprocessor |
| US6393551B1 (en) * | 1999-05-26 | 2002-05-21 | Infineon Technologies North America Corp. | Reducing instruction transactions in a microprocessor |
| US6668317B1 (en) * | 1999-08-31 | 2003-12-23 | Intel Corporation | Microengine for parallel processor architecture |
| US6606704B1 (en) * | 1999-08-31 | 2003-08-12 | Intel Corporation | Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode |
| JP3741945B2 (en) * | 1999-09-30 | 2006-02-01 | 富士通株式会社 | Instruction fetch control device |
| US6470427B1 (en) * | 1999-11-09 | 2002-10-22 | International Business Machines Corporation | Programmable agent and method for managing prefetch queues |
| JP2001236221A (en) * | 2000-02-21 | 2001-08-31 | Keisuke Shindo | Pipe line parallel processor using multi-thread |
| JP2001344153A (en) * | 2000-05-30 | 2001-12-14 | Nec Corp | Cash memory controller for multiprocessor system |
| JP2002132702A (en) * | 2000-10-30 | 2002-05-10 | Nec Eng Ltd | Memory control system |
| US20020181476A1 (en) * | 2001-03-17 | 2002-12-05 | Badamo Michael J. | Network infrastructure device for data traffic to and from mobile units |
| US7487505B2 (en) * | 2001-08-27 | 2009-02-03 | Intel Corporation | Multithreaded microprocessor with register allocation based on number of active threads |
| US7120755B2 (en) * | 2002-01-02 | 2006-10-10 | Intel Corporation | Transfer of cache lines on-chip between processing cores in a multi-core system |
| US6915415B2 (en) * | 2002-01-07 | 2005-07-05 | International Business Machines Corporation | Method and apparatus for mapping software prefetch instructions to hardware prefetch logic |
| US20050050306A1 (en) * | 2003-08-26 | 2005-03-03 | Sridhar Lakshmanamurthy | Executing instructions on a processor |
-
2003
- 2003-11-06 US US10/704,432 patent/US20050102474A1/en not_active Abandoned
-
2004
- 2004-10-29 JP JP2006538286A patent/JP2007510989A/en active Pending
- 2004-10-29 WO PCT/US2004/035923 patent/WO2005048113A2/en not_active Ceased
- 2004-10-29 CN CN200480040023XA patent/CN1997973B/en not_active Expired - Fee Related
- 2004-10-29 AT AT04796714T patent/ATE484027T1/en not_active IP Right Cessation
- 2004-10-29 EP EP04796714A patent/EP1680743B1/en not_active Expired - Lifetime
- 2004-10-29 DE DE602004029485T patent/DE602004029485D1/en not_active Expired - Lifetime
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1307564C (en) * | 1999-08-27 | 2007-03-28 | 国际商业机器公司 | Network switch and components and method of operation |
| US6574712B1 (en) * | 1999-11-08 | 2003-06-03 | International Business Machines Corporation | Software prefetch system and method for predetermining amount of streamed data |
Non-Patent Citations (1)
| Title |
|---|
| 全文. |
Also Published As
| Publication number | Publication date |
|---|---|
| DE602004029485D1 (en) | 2010-11-18 |
| EP1680743B1 (en) | 2010-10-06 |
| WO2005048113A3 (en) | 2006-11-23 |
| ATE484027T1 (en) | 2010-10-15 |
| JP2007510989A (en) | 2007-04-26 |
| WO2005048113A2 (en) | 2005-05-26 |
| US20050102474A1 (en) | 2005-05-12 |
| CN1997973A (en) | 2007-07-11 |
| EP1680743A2 (en) | 2006-07-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8087024B2 (en) | Multiple multi-threaded processors having an L1 instruction cache and a shared L2 instruction cache | |
| CN100533372C (en) | Store instruction ordering for multi-core processors | |
| EP1214661B1 (en) | Sdram controller for parallel processor architecture | |
| JP5197010B2 (en) | Ordering stored instructions for multi-core processors | |
| Adiletta et al. | The Next Generation of Intel IXP Network Processors. | |
| EP1236094B1 (en) | Branch instruction for multithreaded processor | |
| EP1214660B1 (en) | Sram controller for parallel processor architecture including address and command queue and arbiter | |
| JP6243935B2 (en) | Context switching method and apparatus | |
| CN100378655C (en) | Multi-threaded execution in parallel processors | |
| US9021237B2 (en) | Low latency variable transfer network communicating variable written to source processing core variable register allocated to destination thread to destination processing core variable register allocated to source thread | |
| KR101279473B1 (en) | Advanced processor | |
| CN1997973B (en) | Processor, method, apparatus and apparatus for dynamic cache engine instructions | |
| US7000048B2 (en) | Apparatus and method for parallel processing of network data on a single processing thread | |
| CN108292239A (en) | Multi-core communication acceleration using hardware queue devices | |
| US11409506B2 (en) | Data plane semantics for software virtual switches | |
| CN108257078B (en) | Memory aware reordering source | |
| US20060179277A1 (en) | System and method for instruction line buffer holding a branch target buffer | |
| US7325099B2 (en) | Method and apparatus to enable DRAM to support low-latency access via vertical caching | |
| US9244798B1 (en) | Programmable micro-core processors for packet parsing with packet ordering | |
| US20050108479A1 (en) | Servicing engine cache requests | |
| US9455598B1 (en) | Programmable micro-core processors for packet parsing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20100609 Termination date: 20151029 |
|
| EXPY | Termination of patent right or utility model |