CN1698031A

CN1698031A - Methods to prefetch data/instructions associated with externally triggered events

Info

Publication number: CN1698031A
Application number: CNA038012367A
Authority: CN
Inventors: 安德烈亚斯·多林
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2002-03-05
Filing date: 2003-02-27
Publication date: 2005-11-16
Anticipated expiration: 2023-02-27
Also published as: MXPA04008502A; AU2003221510A8; JP2005519389A; WO2003075154A3; KR20040101231A; CA2478007A1; BR0308268A; CN100345103C; WO2003075154A2; AU2003221510A1

Abstract

Method of prefetching data /instructions related to externally triggered events in a system including an infrastructure (18) having an input interface (20) for receiving data/ instructions to be handled by the infrastructure and an output interface (22) for transmitting data after they have been handled, a memory (14) for storing data/instructions when they are received by input interface, a processor (10) for processing at least some data/instructions, the processor having a cache wherein the data/instructions are stored before being processed, and an external source (26) for assigning sequential tasks to the processor. The method comprises the following steps which are performed while the processor is performing a previous task: determining the location in the memory of data/ instructions to be processed by the processor, indicating to the cache the addresses of these memory locations, fetching the contents of the memory locations and writing them into the cache, and assigning the task of processing the data/instructions to the processor.

Description

Methods to prefetch data/instructions associated with externally triggered events

技术领域technical field

本发明一般地涉及这样的系统，其中诸如在网络处理器中的调度器的外部来源可以中断用于处理其中数据/指令与前一个任务无关的任务的处理器，本发明尤其涉及用于预取外部触发的事件相关联的数据/指令的方法。The present invention relates generally to systems where an external source such as a scheduler in a network processor can interrupt a processor for a task where data/instructions are not relevant to a previous task, and in particular to prefetching Data/command methods associated with externally triggered events.

背景技术Background technique

现代微处理器和微处理器核心的效率非常依赖于高速缓冲存储器的效率，因为指令循环时间比存储器存取时间小得多。高速缓冲存储器利用存储器存取的局部性特点，这就是一个存储器存取更有可能接近以前的存取的事实。The efficiency of modern microprocessors and microprocessor cores is very dependent on the efficiency of cache memories because instruction cycle times are much smaller than memory access times. A cache exploits the locality property of memory accesses, which is the fact that one memory access is more likely to be closer to previous accesses.

高速缓冲存储器包含一个机构，即高速缓冲存储器控制器，用于向所选择的区域(高速缓冲存储器线)装载新的内容，并且为了如此，高速缓冲存储器通过丢弃旧的输入项而为这个行为留出空间。高速缓冲存储器控制器当前可以被具有高速缓冲存储器预取指令的软件(例如，用于所有符合PowerPC的器件的Data Cache Block Touch(数据高速缓冲存储器块接触软件))激活。而且，存在高速缓冲存储器控制器识别诸如线性跨距或链接的数据结构的常规存取模式的建议。不幸的是，现有的方法不涵盖外部触发的事件，其中在这些情形中所需要的存储器内容与前面的处理无关。在这样的情况下，对关于所需要的存储器内容的唯一了解是诸如中断源的事件源、分配任务的调度器或其他处理器。The cache contains a mechanism, the cache controller, for loading selected areas (cache lines) with new content, and in doing so, the cache reserves for this action by discarding old entries. out of space. The cache controller can currently be activated by software with cache prefetch instructions (eg, Data Cache Block Touch for all PowerPC compliant devices). Also, there are proposals for cache controllers to recognize regular access patterns for data structures such as linear strides or links. Unfortunately, existing methods do not cover externally triggered events, where the required memory content in these cases is independent of the preceding processing. In such cases, the only knowledge about the required memory contents is the source of the event, such as the source of the interrupt, the scheduler or other processor that assigned the task.

在其中诸如在网络处理器中的调度器的外部来源可以中断用于处理与在先处理的数据无关的数据的系统中，所述处理器产生一个高速缓存命中遗漏(cache miss)。这意味着处理器停止处理直到它需要的数据从存储器被装载到高速缓冲存储器中。这浪费了相当多的时间。因此，对于当前的存储器技术和400MHz的处理器时钟速度，每个高速缓存命中遗漏涉及36个处理器时钟周期，这意味着大约40个指令。因为当前的技术趋势显示出处理器指令速率的增长比存储器延时更强。因此每个高速缓存命中遗漏的损失指令数量增加。In systems where an external source, such as a scheduler in a network processor, can be interrupted for processing data unrelated to previously processed data, the processor generates a cache miss. This means that the processor stops processing until the data it needs is loaded from memory into cache memory. This wastes quite a bit of time. Thus, with current memory technology and a processor clock speed of 400MHz, each cache miss involves 36 processor clock cycles, which means about 40 instructions. Because current technology trends show that the increase in processor instruction rate is stronger than memory latency. Thus the number of lost instructions per cache hit miss increases.

发明内容Contents of the invention

因此，本发明的主要目的是实现一种方法，用于预取与外部触发事件相关联的数据/指令，以便避免对于可以容易地确定地址的数据的高速缓存命中遗漏。Therefore, the main object of the present invention is to implement a method for prefetching data/instructions associated with external triggering events in order to avoid cache misses for data whose address can be easily determined.

本发明因此涉及用于在一个系统中预取与外部触发事件相关联的数据/指令的方法，所述系统包括：基础结构，它具有用于接收要由所述基础结构处理的数据/指令的输入接口、用于在数据已经被处理之后发送它们的输出接口；存储器，用于当数据/指令被输入接口接收时存储数据/指令；处理器，用于处理至少一些数据/指令，所述处理器具有高速缓冲存储器，其中数据/指令在被处理之前被存储；外部来源，用于向处理器分配序列任务。所述方法包括下列步骤，这些步骤当处理器在执行前一个任务时被执行：确定要由处理器处理的数据/指令在存储器中的位置；向高速缓冲存储器指示这些存储器位置的地址；取得存储器位置的内容和将它们写入高速缓冲存储器中；向处理器分配处理数据/指令的任务。The present invention thus relates to a method for prefetching data/instructions associated with external triggering events in a system comprising an infrastructure having means for receiving data/instructions to be processed by said infrastructure input interface, an output interface for sending data after they have been processed; memory, for storing data/instructions as they are received by the input interface; processor, for processing at least some of the data/instructions, said processing The processor has a cache memory, where data/instructions are stored before being processed; an external source, used to assign sequential tasks to the processor. The method comprises the following steps, which are performed while the processor is executing a previous task: determining the locations in memory of data/instructions to be processed by the processor; indicating addresses of these memory locations to a cache memory; fetching memory locations contents of locations and writes them into cache memory; tasks processors with processing data/instructions.

附图说明Description of drawings

通过结合附图阅读下面对本发明更具体的说明，可以更好地明白本发明的上述和其他目的、特征和优点，其中：By reading the following more specific description of the present invention in conjunction with the accompanying drawings, the above-mentioned and other objects, features and advantages of the present invention can be better understood, wherein:

图1是其中实现根据本发明的方法的网络处理系统的方框图。Figure 1 is a block diagram of a network processing system in which the method according to the invention is implemented.

图2是表示根据本发明的方法的步骤的流程图。Figure 2 is a flow chart representing the steps of the method according to the invention.

具体实施方式Detailed ways

这样的系统包括处理器核心10，诸如配备了数据/指令高速缓冲存储器的PowerPC处理器核心。所述系统由高性能总线12构造，所述高性能总线12诸如处理器局域总线(PLB)，它提供到外部存储器14(例如SDRAM)的连接，所述外部存储器14包括数据以及存储器控制器16的中介的指令，存储器控制器16通过产生例如所有必要的定时、刷新信号等来提供总线结构与存储器的无关性。Such a system includes a processor core 10, such as a PowerPC processor core equipped with a data/instruction cache. The system is structured with a high performance bus 12, such as a Processor Local Bus (PLB), which provides connectivity to external memory 14, such as SDRAM, which includes data as well as a memory controller 16, the memory controller 16 provides bus structure independence from the memory by generating, for example, all necessary timing, refresh signals, etc.

总线12和存储器14也被基础结构18使用，基础结构18处理在输入接口20上从网络接收的数据分组。基础结构18管理包括分组组装、存储器分配和释放以及从分组队列的插入和删除的接收和发送。Bus 12 and memory 14 are also used by infrastructure 18 , which processes data packets received on input interface 20 from the network. Infrastructure 18 manages reception and transmission including packet assembly, memory allocation and deallocation, and insertion and deletion from packet queues.

一些分组不需要被处理，并且被输出接口22通过网络直接被发送。其它的分组需要被处理器10处理。一个查找和分类单元24进行确定是否一个分组需要被处理和那种处理需要被执行。为了处理一个数据分组，处理器10需要几个信息。为此，它需要访问分组的首标和在基础结构18中产生的附加信息。例如，所述基础结构可能具有分组可以到达的几个端口，并且处理器需要分组来自何方的信息。Some packets do not need to be processed and are sent directly over the network by the output interface 22 . Other packets need to be processed by processor 10 . A lookup and classification unit 24 makes a determination of whether a packet needs to be processed and what kind of processing needs to be performed. In order to process a data packet, the processor 10 needs several pieces of information. For this, it requires the header of the access packet and additional information generated in the infrastructure 18 . For example, the infrastructure may have several ports where packets can arrive, and the processor needs information on where the packets came from.

调度器26处理在一个或几个队列中需要被处理器处理的所有数据分组。这些队列不必在调度器中物理地存在。至少每个队列的前面的项目需要被存储在芯片上。这个调度器记录处理器行为。当处理器已经结束处理一个分组时，它向调度器请求新的任务。但是，如果管理具有不同优先级的几个队列，则调度器26也可以中断处理器对低优先级的任务的处理而处理较高优先级的任务。The scheduler 26 handles all data packets that need to be processed by the processors in one or several queues. These queues do not have to physically exist in the scheduler. At least the first item of each queue needs to be stored on-chip. This scheduler records processor behavior. When a processor has finished processing a packet, it requests a new task from the scheduler. However, if several queues with different priorities are managed, the scheduler 26 can also interrupt the processor's processing of low priority tasks and process higher priority tasks.

在任何情况下，调度器26知道处理器10将要处理的下一个任务。所选择的任务确定将访问哪个数据。在这里所述的网络处理器的情况下，在任务(队列的项目)和首先被访问的地址、即分组首标和附加信息之间的关系很简单。由地址计算单元来进行从队列项目向一组地址的翻译。In any case, the scheduler 26 knows the next task that the processor 10 will process. The selected task determines which data will be accessed. In the case of the network processor described here, the relationship between the task (entry of the queue) and the address to be accessed first, ie packet header and additional information is very simple. The translation from the queue entry to a set of addresses is done by the address calculation unit.

当处理器10处理新的分组和访问诸如分组首标的数据时，如果不使用本发明则它通常产生一个高速缓存命中遗漏。这意味着处理器将停止处理，直到从外部存储器14向高速缓冲存储器装载所需要的数据，如上所述，这浪费了大量的时间。根据本发明的高速缓冲存储器预取避免了肯定要发生的数据访问的高速缓冲命中遗漏，并且可以容易地确定高速缓冲命中遗漏的地址。When processor 10 processes new packets and accesses data such as packet headers, it typically generates a cache miss if the present invention is not used. This means that the processor will stop processing until the cache is loaded with the required data from the external memory 14, which, as mentioned above, wastes a lot of time. The cache memory prefetching according to the present invention avoids cache misses of data accesses that must occur, and addresses of cache misses can be easily determined.

为了起作用，必须指令高速缓冲存储器在访问发生前装载所需要的数据。这个行为的启动来自调度器，调度器使用到高速缓冲存储器的直接连接28。在已经确定其中已经存储了分组首标和附加信息的存储器中的位置之后，调度器分配要抓取到高速缓冲存储器的地址，并且高速缓冲存储器控制器将所述数据从存储器抓取高速缓冲存储器中。在完成这个写入后，或者调度器中断处理器并且当新的任务具有比前一个任务更高的优先级时分配新的分组来用于处理，或者调度器在递送新的分组之前等待前一个任务的完成。In order to function, the instruction cache must be loaded with the required data before the access occurs. The initiation of this behavior comes from the scheduler, which uses a direct connection 28 to the cache memory. After having determined the location in memory where the packet header and additional information have been stored, the scheduler assigns an address to be fetched into the cache memory, and the cache memory controller fetches the data from memory into the cache memory middle. After this write is complete, either the scheduler interrupts the processor and allocates a new packet for processing when the new task has a higher priority than the previous task, or the scheduler waits for the previous packet before delivering the new packet. The completion of the task.

根据本发明的方法由图2所示的流程图表示。首先，所述基础结构等待接收新的数据，即在所述示例中的数据分组(步骤30)。分组的首标用于分类，并且从所述查找和分类单元的处理得到的首标和附加信息被存储在外部存储器中(步骤32)。所述查找和分类单元确定是否分组需要由软件处理，并且确定其优先级(步骤34)。如果分组不需要处理，处理循环回到等待接收新的数据(步骤30)。The method according to the invention is represented by the flow chart shown in FIG. 2 . First, the infrastructure waits to receive new data, ie data packets in the example (step 30). The headers of the packets are used for classification, and the headers and additional information resulting from the processing of the search and classification unit are stored in an external memory (step 32). The lookup and sort unit determines whether the packet needs to be processed by software and determines its priority (step 34). If the packet does not require processing, processing loops back to waiting for new data to be received (step 30).

当数据分组需要处理时，调度器需要计算对应于处理器将要访问的数据的存储器中的地址。在所述示例中，它是分组首标的地址和诸如分类器结果、输入端口的附加信息的地址(步骤36)。这些地址然后被传送到处理器的数据高速缓冲存储器控制器(步骤38)。数据高速缓冲存储器控制器向数据高速缓冲存储器中写入对应的数据(步骤40)。这是通过交错由当前的分组处理建立的存储器存取而进行的。When a data packet needs to be processed, the scheduler needs to calculate the address in memory that corresponds to the data that the processor is going to access. In the example it is the address of the packet header and of additional information such as classifier results, input ports (step 36). These addresses are then passed to the processor's data cache controller (step 38). The data cache controller writes the corresponding data into the data cache (step 40). This is done by interleaving the memory accesses created by the current packet processing.

在这个阶段，处理取决于是否刚刚到达的分组比前一个具有较高优先级(步骤42)。如果这样，调度器中断由处理器当前执行的前一个任务(步骤44)，并且分配新的分组来用于处理，并且处理器开始处理和找到在高速缓冲存储器中的相关数据(步骤46)。如果新的分组不比前一个具有较高的优先级，则处理器必须在处理新的分组(步骤48)之前完成前一个处理(步骤46)。At this stage, processing depends on whether the just arriving packet has a higher priority than the previous one (step 42). If so, the scheduler interrupts the previous task currently being executed by the processor (step 44), and allocates a new packet for processing, and the processor starts processing and finding the relevant data in the cache memory (step 46). If the new packet does not have a higher priority than the previous one, the processor must complete the previous processing (step 46) before processing the new packet (step 48).

注意，如果分组具有较高优先级的情况下，调度器需要在中断处理器之前等待完成数据高速缓冲存储器的抓取。为此，调度器可以观察在总线上的行为，并且等待直到所有的被分配访问已经被完成。或者，调度器可以等待固定数量的时间，或者可以使用从高速缓冲存储器控制器到调度器的直接反馈。Note that in the case of packets with higher priority, the scheduler needs to wait for the data cache fetch to complete before interrupting the processor. For this, the scheduler can observe the activity on the bus and wait until all allocated accesses have been completed. Alternatively, the scheduler may wait a fixed amount of time, or may use direct feedback from the cache controller to the scheduler.

也必须注意，对于其中如上所述第一分组的处理被中断以便处理第二较高优先级分组的两个分组，应当在两个情况下都发生在高速缓冲存储器的不相交的部分上。否则，预取的数据在它被访问之前可以被删除。这可以通过使用在处理器中从虚拟地址到实际地址的映射来被实现，因为通常使用虚拟地址来索引高速缓冲存储器。It must also be noted that for two packets where the processing of the first packet is interrupted as described above in order to process the second higher priority packet, this should in both cases occur on disjoint parts of the cache memory. Otherwise, prefetched data can be deleted before it is accessed. This can be achieved by using a mapping in the processor from virtual addresses to real addresses, since caches are usually indexed using virtual addresses.

虽然已经在网络处理器环境中说明了本发明的方法，本领域中的技术人员可以清楚，所述方法可以用于这样的任何系统，其中由处理器对一些数据的访问肯定发生，并且可以容易地确定其地址。在所有的情况下，外部事件与要处理的一些数据连接。因此，假定在使用照相机来用于导航的机器人中，新的图像以常规的时间间隔到达。图像的到达是事件，而图像数据本身是被预取的相关联数据。Although the method of the present invention has been described in the context of a network processor, it will be clear to those skilled in the art that the method can be used in any system where access to some data by the processor must occur and can be easily determine its address. In all cases, external events are connected to some data to be processed. Therefore, assume that in a robot that uses cameras for navigation, new images arrive at regular time intervals. The arrival of an image is the event, and the image data itself is the associated data that is prefetched.

必须注意，对于标准的微处理器，可以使用地址总线来作为外部来源，因为它已经被观察用于超高速缓存相关性。在这种情况下，仅仅需要一个外部连线来指示预取请求。It must be noted that for standard microprocessors, the address bus can be used as an external source, as it has been observed for cache coherency. In this case, only one external wire is required to indicate the prefetch request.

Claims

1. the method for the data/commands that is associated with external trigger of in a system, looking ahead, described system comprises: foundation structure (18), and it has the input interface (20) that is used to receive the data/commands that will be handled by described foundation structure, the output interface (22) that is used for sending them after data are processed; Storer (14) is used for storage data/commands when data/commands is received by described input interface; Processor (10) is used to handle at least some described data/commands, and described processor has cache memory, and wherein data/commands was stored before processed; External source (26) is used for to described processor distribution sequence task;

Described method is characterised in that, it comprises the following steps, these steps are performed when carrying out previous task when processor:

Determining will be by the position of data/commands in described storer of described processor processing;

Indicate the address of described memory location to described cache memory;

Obtain the content of described memory location and they are write in the described cache memory;

Task to described processor distribution processing said data/instruction.

2. the method for claim 1, wherein said processor (10) is a network processing unit, and data to be processed are in the head of the packet that is received by described foundation structure (18).

3. method as claimed in claim 2, wherein said external source is a scheduler (26) that is directly connected to the described cache memory in the described processor (10), described scheduler is determined the position of data/commands to be processed in described storer (14), and directly indicates described address to described cache memory.

4. method as claimed in claim 3, wherein said scheduler (26) is determined the position of the described data/commands in described storer (14) by calculating described address.

5. as any one described method of claim 2-4, wherein the step of the task of the described prefetch data of allocation process/instruction comprises: interrupt the processing of previous grouping, begin to have than described previous grouping the processing of the new grouping of higher priority.

6. as any one described method of claim 3-5, wherein said cache memory is associated with a cache controller, described cache controller is responsible for obtaining its address by the content of the definite described memory location of described scheduler (26), and they are write in the described cache memory.

7. method as claimed in claim 5, wherein said processor (10) and described scheduler (26) use a processor local bus (PLB), described scheduler obtains the back and interrupts described processor determining to finish data caching, wherein by when data from described storer (14) monitor when returning described bus and accurately observation finish data caching and obtain.

8. system comprises by the adaptive device of realizing according to the step of the method for claim 1-7.