WO2017211240A1

WO2017211240A1 - Processor chip and method for prefetching instruction cache

Info

Publication number: WO2017211240A1
Application number: PCT/CN2017/087091
Authority: WO
Inventors: 沈亦翀; 方磊; 罗会斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-06-07
Filing date: 2017-06-02
Publication date: 2017-12-14
Anticipated expiration: 2018-12-07
Also published as: CN107479860A; CN107479860B

Abstract

A processor chip (200) comprises a central processing unit (CPU) core (202) and a cache memory (204). The cache memory (204) comprises an L1 instruction cache (L1 I-cache) (2042) and a cache controller (2044). The L1 I-cache (2042) comprises at least one cache line. Each cache line comprises a tag domain, data, a flag bit, and an extension bit for storing an offset information of an access address. The CPU core (202) is configured to obtain an access address of a first instruction and access the L1 I-cache (2042) according to the access address of the first instruction. The cache controller (2044) is configured to read offset information of an access address in an extension bit of a first cache line if the first cache line corresponding to the access address of the first instruction in the L1 I-cache is hit and perform a calculation according to the offset information of the access address and the access address of the first instruction to obtain an access address of a second instruction. The CPU core (202) is further configured to prefetch the second instruction according to the access address of the second instruction.

Description

Processor chip and prefetching method of instruction cache

Technical field

本发明涉及计算机技术领域，特别涉及一种处理器芯片以及指令缓存的预取方法。The present invention relates to the field of computer technologies, and in particular, to a processor chip and a prefetching method for an instruction cache.

Background technique

现代计算机的体系结构中，缓存(Cache)是计算机存储层次结构中位于较高层级的存储单元，主要作用在于作为低层级存储单元与中央处理器(Central Processing Unit，CPU)之间的桥梁，减少CPU直接从低层级存储单元(如主存或磁盘)中访问数据的时延。缓存有几个相互独立的缓存模块组成，包括指令缓存(Instruction Cache，I-Cache)，数据缓存(Data Cache，D-Cache)以及传输后备缓存(Translation Lookaside Buffer，TLB)。目前所有主流CPU大都具有一级缓存(L1 Cache)和二级缓存(L2 Cache)，少数高端处理器还集成了三级缓存。其中，一级缓存可分为一级指令缓存(L1 I-Cache)和一级数据缓存(L1 D-Cache)。In the architecture of modern computers, the cache is a storage unit at a higher level in the computer storage hierarchy, and its main function is to serve as a bridge between the low-level storage unit and the central processing unit (CPU). The latency of the data directly accessed by the CPU from a low-level storage unit such as main memory or disk. The cache consists of several independent cache modules, including Instruction Cache (I-Cache), Data Cache (D-Cache), and Translation Lookaside Buffer (TLB). At present, most mainstream CPUs have L1 Cache and L2 Cache, and a few high-end processors also integrate L3 cache. Among them, the level 1 cache can be divided into a level one instruction cache (L1 I-Cache) and a level one data cache (L1 D-Cache).

现代计算机CPU通过直接访问其L1 I-Cache来获取即将执行的指令，CPU访问L1 I-Cache通过访问地址来访问，当访问地址对应的指令存在于L1 I-Cache中时，称之为命中(cache hit)；而当访问地址对应的指令无法在L1 I-Cache中找到时，称之为缺失(cache mi ss)。当产生缺失时，CPU就会到下一层级的存储单元去寻找。但是访问更低层级的存储单元会直接导致访问时延显著增加，引起CPU指令执行的阻塞(stall)，从而影响计算机的性能。于是缓存预取(cache prefetch)技术被提出，其主要的思想是在CPU取指令之前将可能会被访问到的指令预先取到L1 I-Cache中，从而避免由于缺失造成的CPU阻塞。The modern computer CPU obtains the instruction to be executed by directly accessing its L1 I-Cache. The CPU accesses the L1 I-Cache by accessing the address. When the instruction corresponding to the access address exists in the L1 I-Cache, it is called a hit ( Cache hit); and when the instruction corresponding to the access address cannot be found in the L1 I-Cache, it is called a cache mi ss. When a defect occurs, the CPU goes to the next level of storage to find it. However, accessing lower-level storage units directly leads to a significant increase in access latency, causing stalls in the execution of CPU instructions, which affects the performance of the computer. Therefore, the cache prefetch technique is proposed. The main idea is to prefetch the instructions that may be accessed into the L1 I-Cache before the CPU fetches instructions, so as to avoid CPU blocking caused by the missing.

为了解决CPU指令在L1 I-Cache缺失时造成的取指令时延，一种典型的解决方案是在L1 I-Cache与L2 I-Cache之间插入一个被称为流缓冲区(stream buffer)的功能单元，该功能单元实际上是一个先入先出(First Input First Output，FIFO)队列。当连续地对L1 I-Cache中的指令进行访问，在L1 I-Cache中缺失但是在stream buffer中能够命中时，那么便直接从stream buffer中取用，大大地减小了由于去下级存储单元(譬如：L2 I-Cache)中取指令带来的时延。In order to solve the instruction fetch caused by the CPU instruction in the L1 I-Cache missing, a typical solution is to insert a stream buffer called L1 I-Cache and L2 I-Cache. A functional unit, which is actually a First Input First Output (FIFO) queue. When accessing the instructions in the L1 I-Cache continuously, missing in the L1 I-Cache but being able to hit in the stream buffer, then it is directly taken from the stream buffer, greatly reducing the loss of the lower-level storage unit. (For example: L2 I-Cache) The delay caused by fetching instructions.

上述解决方案中，由于stream buffer是一个FIFO队列，只在该stream buffer的头部(该FIFO队列中先出的一列)有一个比较器，当L1 I-Cache缺失时，如果在stream buffer的头数据(head entry)中通过比较器比对后同样找不到对应的指令，即使所需的指令就在原来的stream buffer之中(不位于头部)，整个stream buffer也都会被重置，然后重新从stream buffer中预取。这就导致了stream buffer的利用率不高，预取到的指令有很大可能不会被访问，因此指令缓存的预取准确率较低。In the above solution, since the stream buffer is a FIFO queue, there is only one comparator in the head of the stream buffer (the first column in the FIFO queue). When the L1 I-Cache is missing, if it is in the head of the stream buffer. In the data (head entry), the corresponding instruction can not be found by the comparator comparison. Even if the required instruction is in the original stream buffer (not at the head), the entire stream buffer will be reset, and then Re-fetch from the stream buffer. This results in a low utilization of the stream buffer, and the prefetched instructions are likely to be accessed, so the prefetch accuracy of the instruction cache is low.

发明内容Summary of the invention

本发明实施例提供了一种处理器芯片以及指令缓存的预取方法，能够提高指令缓存的预取准确率。 The embodiment of the invention provides a processor chip and a prefetching method of the instruction cache, which can improve the prefetching accuracy of the instruction cache.

本发明实施例第一方面提供了一种处理器芯片，该处理器芯片包括一个处理器核CPU core以及一个高速缓冲存储器Cache，该Cache包括一级指令缓存L1 I-Cache以及Cache控制器，为了提升CPU core的处理性能，该Cache还可以包括二级指令缓存L2 I-Cache。该Cache可以用高速的静态存储器芯片实现，或者集成到CPU芯片内部，存储CPU经常访问的指令或者操作数据。L1 I-Cache存储了CPU经常访问的指令，该L1 I-Cache包括至少一个缓存单元cache line，每个cache line拥有统一的数据结构，该cache line的数据结构可以包括标签tag域、数据以及标志位。A first aspect of the embodiments of the present invention provides a processor chip, which includes a processor core CPU core and a cache Cache, and the Cache includes a first-level instruction cache L1 I-Cache and a Cache controller. To improve the processing performance of the CPU core, the Cache may also include a secondary instruction cache L2 I-Cache. The Cache can be implemented with a high-speed static memory chip or integrated into the CPU chip to store instructions or operational data that the CPU frequently accesses. The L1 I-Cache stores instructions frequently accessed by the CPU. The L1 I-Cache includes at least one cache unit cache line. Each cache line has a unified data structure, and the data structure of the cache line may include a tag tag field, data, and a flag. Bit.

在此基础上，本发明实施例对cache line的数据结构进行扩展，因此，与传统的cache line的不同之处在于，新的cache line即对于以上介绍的每个cache line的数据结构中还包括用于保存访问地址的偏移信息的扩展位，该访问地址的偏移信息为CPU core向L1 I-Cache发出的相邻两次访问指令所包含的访问地址的地址变化量。如此，形成一种新的cache line的数据结构。On the basis of this, the embodiment of the present invention expands the data structure of the cache line. Therefore, the difference from the traditional cache line is that the new cache line includes the data structure of each cache line introduced above. An extension bit for storing offset information of the access address, the offset information of the access address being an address change amount of an access address included in two adjacent access instructions issued by the CPU core to the L1 I-Cache. In this way, a new cache line data structure is formed.

可选的，Cache控制器在CPU Core获取第一指令的访问地址之前，计算第一指令的访问地址和第二指令的访问地址之间的偏移信息，并将偏移信息写入第一指令的访问地址对应cache line的扩展位中。如此，完成对L1 I-Cache的预先配置，提高指令预取的连续性。Optionally, before the CPU Core obtains the access address of the first instruction, the Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into the first instruction. The access address corresponds to the extension bit of the cache line. In this way, the pre-configuration of the L1 I-Cache is completed, and the continuity of instruction prefetching is improved.

CPU core在调用数据时，需要获取第一指令的访问地址，并根据所述第一指令的访问地址访问所述L1 I-Cache。When calling the data, the CPU core needs to obtain an access address of the first instruction, and access the L1 I-Cache according to the access address of the first instruction.

CPU core通过第一指令的访问地址访问L1 I-Cache，如果第一指令的访问地址存在于L1 I-Cache中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache中会被命中。此时，Cache控制器读取第一cache line的扩展位中访问地址的偏移信息，并根据该访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址。The CPU core accesses the L1 I-Cache through the access address of the first instruction. If the access address of the first instruction exists in the L1 I-Cache, the first cache line corresponding to the access address of the first instruction is in the L1 I-Cache. Was hit. At this time, the Cache controller reads the offset information of the access address in the extension bit of the first cache line, and calculates the access address of the second instruction according to the offset information of the access address and the access address of the first instruction.

可选的，如果第一指令的访问地址不存在于L1 I-Cache中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache中不会被命中，此时，Cache控制器可以根据第一指令的访问地址查找第一指令的位置，并将该第一指令预取到L1 I-Cache中。从而将未命中的cache line所对应的第一指令预取到L1 I-Cache中，提高指令预取后的被访问率。Optionally, if the access address of the first instruction does not exist in the L1 I-Cache, the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache, and at this time, the Cache controller The location of the first instruction may be searched according to the access address of the first instruction, and the first instruction is prefetched into the L1 I-Cache. Therefore, the first instruction corresponding to the missed cache line is prefetched into the L1 I-Cache to improve the access rate after the instruction prefetch.

在计算得到第二指令的访问地址后，CPU core可以通过该第二指令的访问地址查找第二指令的位置，对第二指令执行预取。After calculating the access address of the second instruction, the CPU core can search for the location of the second instruction by using the access address of the second instruction, and perform prefetching on the second instruction.

由于每个cache line还包括用于保存访问地址的偏移信息的扩展位，该访问地址的偏移信息为处理器核CPU core向L1 I-Cache发出的相邻两次访问指令所包含的访问地址的地址变化量；因此，Cache控制器确定在L1 I-Cache中和第一指令的访问地址对应的第一cache line被命中时，可以读取第一cache line的扩展位中访问地址的偏移信息，并根据访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址；CPU core则根据Cache控制器计算得到的第二指令的访问地址，执行对第二指令的预取。通过上述访问地址的偏移信息预取到的指令的被访问率较高，因此，能够提高指令缓存的预取准确率。Since each cache line further includes an extension bit for storing offset information of the access address, the offset information of the access address is an access included by two adjacent access instructions issued by the processor core CPU core to the L1 I-Cache. The address change amount of the address; therefore, the Cache controller determines that the first cache line corresponding to the access address of the first instruction in the L1 I-Cache is hit, and can read the offset of the access address in the extension bit of the first cache line Transmitting information, and calculating an access address of the second instruction according to the offset information of the access address and the access address of the first instruction; and the CPU core executing the access instruction of the second instruction calculated by the Cache controller to execute the second instruction Prefetching. The access rate of the instruction prefetched by the offset information of the access address is high, and therefore, the prefetch accuracy of the instruction cache can be improved.

可选的，Cache还可以包括二级指令缓存L2 I-Cache，当L1 I-Cache中存储空间已满时，CPU Core对该L1 I-Cache中的缓存空间执行驱逐操作，将L1 I-Cache中的第二 cache line驱逐到L2 I-cache中。Cache控制器为该第二cache line设置被驱逐权重，该被驱逐权重用于标记第二cache line在L2 I-cache中被驱逐的优先级。通过设置被驱逐权重，提高被驱逐的cache line的利用率。Optionally, the Cache may further include a second-level instruction cache L2 I-Cache. When the storage space in the L1 I-Cache is full, the CPU Core performs an eviction operation on the cache space in the L1 I-Cache, and the L1 I-Cache is used. Second in The cache line is deported to the L2 I-cache. The Cache controller sets the eviction weight for the second cache line, and the eviction weight is used to mark the priority of the second cache line being evicted in the L2 I-cache. Improve the utilization of the eviction cache line by setting the eviction weight.

可选的，对于被驱逐的cache line可以实现预取经验的继承和共享，在确定CPU Core执行指令的访问地址对应上述被驱逐至L2 I-cache中的第二cache line时，将被驱逐至L2 I-Cache中的第二cache line预取到L1 I-Cache中。如此，将被驱逐至L2 I-Cache中的第二cache line所保存的信息直接取用，节省配置资源，提高cache line的利用率。Optionally, for the eviction cache line, the inheritance and sharing of the prefetching experience may be implemented. When the access address of the CPU Core execution instruction is determined to correspond to the second cache line deported to the L2 I-cache, the eviction will be deported to The second cache line in the L2 I-Cache is prefetched into the L1 I-Cache. In this way, the information saved by the second cache line that is deported to the L2 I-Cache is directly taken, saving configuration resources and improving the utilization of the cache line.

本发明实施例第二方面提供了一种指令缓存的预取方法，应用于处理器芯片，处理器芯片包括一个处理器核CPU core以及一个高速缓冲存储器Cache，Cache包括一级指令缓存L1 I-Cache以及Cache控制器，L1 I-Cache包括至少一个缓存单元cache line，每个cache line包括标签tag域、数据以及标志位，其中，每个cache line还包括用于保存访问地址的偏移信息的扩展位，访问地址的偏移信息为处理器核CPU core向L1I-Cache发出的相邻两次访问指令所包含的访问地址的地址变化量；该方法包括：A second aspect of the embodiments of the present invention provides a prefetching method for an instruction cache, which is applied to a processor chip. The processor chip includes a processor core CPU core and a cache Cache, and the Cache includes a first-level instruction cache L1 I- The Cache and the Cache controller, the L1 I-Cache includes at least one cache unit cache line, each cache line includes a tag tag field, data, and a flag bit, wherein each cache line further includes offset information for storing the access address. The extension bit, the offset information of the access address is the address change amount of the access address included in the adjacent two access instructions sent by the processor core CPU core to the L1I-Cache; the method includes:

CPU core在调用数据时，需要获取第一指令的访问地址，并根据第一指令的访问地址访问L1 I-Cache。When the CPU core calls the data, it needs to obtain the access address of the first instruction, and access the L1 I-Cache according to the access address of the first instruction.

可选的，如果第一指令的访问地址不存在于L1 I-Cache中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache中不会被命中，此时，Cache控制器可以根据第一指令的访问地址查找第一指令的位置，并将该第一指令预取到L1 I-Cache中。从而将未命中的cache line所对应的第一指令预取到L1 I-Cache中，提高指令预取后的被访问率。 Optionally, if the access address of the first instruction does not exist in the L1 I-Cache, the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache, and at this time, the Cache controller The location of the first instruction may be searched according to the access address of the first instruction, and the first instruction is prefetched into the L1 I-Cache. Therefore, the first instruction corresponding to the missed cache line is prefetched into the L1 I-Cache to improve the access rate after the instruction prefetch.

可选的，Cache还可以包括二级指令缓存L2 I-Cache，当L1 I-Cache中存储空间已满时，CPU Core对该L1 I-Cache中的缓存空间执行驱逐操作，将L1 I-Cache中的第二cache line驱逐到L2 I-cache中。Cache控制器为该第二cache line设置被驱逐权重，该被驱逐权重用于标记第二cache line在L2 I-cache中被驱逐的优先级。通过设置被驱逐权重，提高被驱逐的cache line的利用率。Optionally, the Cache may further include a second-level instruction cache L2 I-Cache. When the storage space in the L1 I-Cache is full, the CPU Core performs an eviction operation on the cache space in the L1 I-Cache, and the L1 I-Cache is used. The second cache line is deported to the L2 I-cache. The Cache controller sets the eviction weight for the second cache line, and the eviction weight is used to mark the priority of the second cache line being evicted in the L2 I-cache. Improve the utilization of the eviction cache line by setting the eviction weight.

可选的，Cache还可以包括二级指令缓存L2 I-Cache，当L1 I-Cache中存储空间已满时，CPU Core对该L1 I-Cache中的缓存空间执行驱逐操作，将L1 I-Cache中的第二cache line驱逐到L2 I-cache中。Cache控制器为该第二cache line设置被驱逐权重，该被驱逐权重用于标记第二cache line在L2 I-cache中被驱逐的优先级。如此，将被驱逐至L2 I-Cache中的第二cache line所保存的信息直接取用，节省配置资源，提高cache line的利用率。Optionally, the Cache may further include a second-level instruction cache L2 I-Cache. When the storage space in the L1 I-Cache is full, the CPU Core performs an eviction operation on the cache space in the L1 I-Cache, and the L1 I-Cache is used. The second cache line is deported to the L2 I-cache. The Cache controller sets the eviction weight for the second cache line, and the eviction weight is used to mark the priority of the second cache line being evicted in the L2 I-cache. In this way, the information saved by the second cache line that is deported to the L2 I-Cache is directly taken, saving configuration resources and improving the utilization of the cache line.

本发明实施例第三方面提供了一种缓存单元cache line的数据结构，该cache line包括标签tag域、数据以及标志位，此外，该cache line还包括用于保存访问地址的偏移信息的扩展位，访问地址的偏移信息为处理器核CPU Core向一级指令缓存L1 I-Cache发出的相邻两次访问指令所包含的访问地址的地址变化量。A third aspect of the embodiments of the present invention provides a data structure of a cache unit cache line, where the cache line includes a tag tag field, data, and a flag bit. In addition, the cache line further includes an extension for storing offset information of the access address. The offset information of the bit and the access address is the address change amount of the access address included in the adjacent two access instructions issued by the processor core CPU Core to the level 1 instruction cache L1 I-Cache.

本发明实施例第四方面提供了一种存储介质，该存储介质中存储了程序代码，该程序代码被处理器芯片运行时，执行第二方面或第二方面的任意一种实现方式提供的指令缓存的预取方法。该存储介质包括但不限于快闪存储器(flash memory)，硬盘(hard disk drive，HDD)或固态硬盘(solid state drive，SSD)。A fourth aspect of the embodiments of the present invention provides a storage medium, where the program code is stored, and when the program code is executed by the processor chip, the instruction provided by any one of the second aspect or the second aspect is executed. The prefetch method of the cache. The storage medium includes, but is not limited to, a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).

DRAWINGS

图1为本发明提供的处理器芯片实施例的一个组织结构示意图；1 is a schematic structural diagram of an embodiment of a processor chip provided by the present invention;

图2为本发明提供的指令缓存的预取方法实施例的一个流程示意图；2 is a schematic flowchart of an embodiment of a prefetching method for an instruction cache provided by the present invention;

图3为本发明提供的缓存单元cache line实施例的一个结构示意图；3 is a schematic structural diagram of an embodiment of a cache unit cache line provided by the present invention;

图4为本发明提供的访问地址的偏移信息的配置方法实施例的一个流程示意图。FIG. 4 is a schematic flowchart diagram of an embodiment of a method for configuring offset information of an access address according to the present invention.

detailed description

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second" and the like in the specification and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than what is illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

下面将结合本发明实施例中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

贯穿本说明书，有效的指令缓存预取技术应该具备两个基本素质：较高的预测准确率以及足够小的软硬件开销。较高的预测准确率可以提高I-Cache的命中率，提高CPU执行指令的性能；同时引入足够小的软硬件开销以减小预取(prefetch)对Cache性能造成的影响。目前典型的计算机缓存预取技术从实现手段来看，可以大致划分为两类：基于软件的预取(software-based prefetching)和基于硬件的预取(hardware-based prefetching)。典型的预取机制包括：OBL(One Block Lookahead，块超前)，自适应顺序预取，stream buffer，基于历史的预取技术等。Throughout this specification, an effective instruction cache prefetch technique should have two basic qualities: a higher prediction accuracy rate. And enough hardware and software overhead. Higher prediction accuracy can improve the hit rate of I-Cache and improve the performance of CPU execution instructions. At the same time, it introduces enough hardware and software overhead to reduce the impact of prefetch on Cache performance. At present, the typical computer cache prefetching technology can be roughly divided into two categories: software-based prefetching and hardware-based prefetching. Typical prefetch mechanisms include: OBL (One Block Lookahead), adaptive sequential prefetching, stream buffer, history-based prefetching techniques, and more.

stream buffer是一种典型的指令预取技术，通过在L1 I-Cache与L2 I-Cache直接插入一个stream buffer来暂存预取到的指令数据，避免缓存污染(cache pollution)的同时可以有效地保存预取数据，以便在需要时低延时地被L1 cache取用。Stream buffer is a typical instruction prefetching technology. By inserting a stream buffer directly into L1 I-Cache and L2 I-Cache to temporarily store prefetched instruction data, it can effectively avoid cache pollution. Prefetch data is saved so that it can be accessed by the L1 cache with low latency when needed.

基于历史的预取(history guided prefetch)是另一种典型的指令预取机制，其基于指令执行的历史信息(包括指令执行历史、分支预测历史或cache miss历史)，通过一定的分析预测算法，计算出可能会被执行的指令，并对其进行预取。History guided prefetch is another typical instruction prefetching mechanism based on historical information of instruction execution (including instruction execution history, branch prediction history or cache miss history), through certain analysis and prediction algorithms. Compile instructions that may be executed and prefetch them.

现有技术主要存在两个问题：一是对于访问规则较为复杂的指令序列，现有方法无法保证可以进行有效的分析学习和准确的预测，做出达到一定准确率的预取预测所需软硬件开销取决于输入的指令序列本身的特征；二是对于使用历史信息进行预测的预取机制来说，一般都存在一段不可忽略的较长初始化时间来进行分析和学习，而这段时间较为费时，同时可能产生准确率较低的预取操作，损害CPU指令执行性能。There are two main problems in the prior art: First, for instruction sequences with complex access rules, the existing methods cannot guarantee effective analytical learning and accurate prediction, and the hardware and software required for pre-fetch prediction to achieve a certain accuracy rate. The overhead depends on the characteristics of the input instruction sequence itself. Secondly, for the prefetch mechanism that uses historical information for prediction, there is generally a long initialization time that cannot be ignored for analysis and learning, which is time consuming. At the same time, a prefetch operation with a lower accuracy rate may be generated, which impairs the execution performance of the CPU instruction.

本发明所解决的问题以及与现有技术的区别之处在于：本发明扩展I-Cache的缓存单元(cache line)中的位宽，在cache line中添加下一个指令的地址相对于本指令地址的偏移信息，并在此cache line命中时预取相应的指令。其硬件机制不同于以上提到的各种预取机制，并且能够适应各种不同访问规则的指令序列，并减小预取的训练(training)开销，有效地提升预取的准确率。The problem solved by the present invention and the difference from the prior art is that the present invention extends the bit width in the cache line of the I-Cache, and adds the address of the next instruction to the address of the instruction in the cache line. Offset information and prefetch the corresponding instruction when this cache line hits. The hardware mechanism is different from the various prefetching mechanisms mentioned above, and can adapt to the instruction sequences of various access rules, and reduce the training overhead of prefetching, and effectively improve the accuracy of prefetching.

如图1为本发明实施例所提供的处理器芯片200的组织结构示意图，该处理器芯片200包括一个处理器核CPU core 202以及一个高速缓冲存储器Cache 204，该Cache包括一级指令缓存L1 I-Cache 2042以及Cache控制器2044，为了提升CPU core 202的处理性能，该Cache 204还可以包括二级指令缓存L2 I-Cache 2046。该处理器芯片200还可以包括总线208和通信接口206。1 is a schematic structural diagram of a processor chip 200 according to an embodiment of the present invention. The processor chip 200 includes a processor core CPU core 202 and a cache Cache 204. The Cache includes a first-level instruction cache L1 I. The Cache 204 and the Cache Controller 2044, in order to improve the processing performance of the CPU core 202, the Cache 204 may further include a Level 2 Instruction Cache L2 I-Cache 2046. The processor chip 200 can also include a bus 208 and a communication interface 206.

其中，CPU core 202、Cache 204和通信接口206可以通过总线208实现彼此之间的通信连接，也可以通过无线传输等其他手段实现通信。The CPU core 202, the Cache 204, and the communication interface 206 can implement communication connection with each other through the bus 208, and can also implement communication by other means such as wireless transmission.

本发明实施例中的Cache 204可以是随机存取存储器(random-access memory，RAM)；具体可以是静态随机存取存储器(static random-access memory，SRAM)。在实际应用中，缓存(Cache)基本使用的是RAM，而SRAM是一种具有静态存取功能的存储器，不需要刷新电路即能保存它内部存储的数据，因此SRAM具有较高的性能。该Cache204可以用高速的静态存储器芯片实现，或者集成到CPU芯片内部，存储CPU经常访问的指令或者操作数据。按照数据读取顺序和与CPU结合的紧密程度，CPU缓存可以分为一级缓存，二级缓存，部分高端CPU还具有三级缓存。一般来说，一级缓存可以分为一级数据缓存和一级指令缓存。二者分别用来存放数据以及对执行这些数据的指令进行即时解码，而且两者可以同时被CPU访问，减少了争用Cache所造成的冲突，提高了处理器效能。The Cache 204 in the embodiment of the present invention may be a random-access memory (RAM); specifically, it may be a static random-access memory (SRAM). In practical applications, the Cache basically uses RAM, and the SRAM is a memory with static access function, which can save the data stored therein without refreshing the circuit, so the SRAM has high performance. The Cache 204 can be implemented with a high speed static memory chip or integrated into the CPU chip to store instructions or operational data that the CPU frequently accesses. According to the data reading order and the tightness combined with the CPU, the CPU cache can be divided into a level 1 cache, a level 2 cache, and some high-end CPUs also have a level 3 cache. In general, the level 1 cache can be divided into a level 1 data cache and a level 1 instruction cache. The two are used to store data and decode the instructions for executing the data, and the two can be accessed by the CPU at the same time, which reduces the conflict caused by the contention cache and improves the processor performance.

本发明实施例中的L1 I-Cache 2042包括至少一个缓存单元cache line(图1中示出的若干个缓存单元1～缓存单元n)，现有的每个cache line拥有统一的数据结构，该cache line的数据结构可包括标签tag域、数据以及标志位。在此基础上，图3所示的cache line的数据结构对传统的cache line的数据结构进行扩展，即增加了用于保存访问地址的偏移信息的扩展位(Extra bits)。该访问地址的偏移信息为处理器核CPU core 202向L1 I-Cache 2042发出的相邻两次访问指令所包含的访问地址的地址变化量。The L1 I-Cache 2042 in the embodiment of the present invention includes at least one cache unit cache line (shown in FIG. 1). A plurality of cache units 1 to cache units n), each of the existing cache lines has a unified data structure, and the data structure of the cache line may include a tag tag field, data, and flag bits. On this basis, the data structure of the cache line shown in FIG. 3 extends the data structure of the conventional cache line, that is, the extra bits for storing the offset information of the access address are added. The offset information of the access address is the address change amount of the access address included in the adjacent two access instructions issued by the processor core CPU core 202 to the L1 I-Cache 2042.

本发明实施例中的Cache控制器2044在CPU Core 202获取第一指令的访问地址之前，计算第一指令的访问地址和第二指令的访问地址之间的偏移信息，并将偏移信息写入第一指令的访问地址对应cache line的扩展位中。例如，参阅图4所示的计算访问地址的偏移信息的流程示意图，CPU Core 202访问L1 I-Cache 2042中的cache line，通过Cache控制器2044查询地址寄存器是否为空：如果为空，表示此cache line是访问序列中的第一个cache line，Cache控制器2044将当前的访问地址cur_addr写入到地址寄存器中，并等待下一次cache line访问；如果非空，Cache控制器2044将地址寄存器的值(例如将前次访问的cur_addr)赋予为last_addr，同时将当前的访问地址作为新的cur_addr写入地址寄存器中。The Cache controller 2044 in the embodiment of the present invention calculates the offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information, before the CPU Core 202 acquires the access address of the first instruction. The access address of the first instruction corresponds to the extension bit of the cache line. For example, referring to the flow diagram of calculating the offset information of the access address shown in FIG. 4, the CPU Core 202 accesses the cache line in the L1 I-Cache 2042, and queries the cache controller 2044 whether the address register is empty: if it is empty, it indicates The cache line is the first cache line in the access sequence. The Cache Controller 2044 writes the current access address cur_addr to the address register and waits for the next cache line access; if not, the Cache Controller 2044 will address the address register. The value (for example, the cur_addr of the previous access) is given as last_addr, and the current access address is written to the address register as the new cur_addr.

可以理解的是，地址寄存器为空则表示L1 I-Cache 2042中的cache line未被访问，因此，在被首次访问后，将第一指令的访问地址写入地址寄存器中，等待获取下一个指令的访问地址后再次访问L1 I-Cache 2042中的cache line。It can be understood that the address register is empty, indicating that the cache line in the L1 I-Cache 2042 is not accessed. Therefore, after being accessed for the first time, the access address of the first instruction is written into the address register, waiting for the next instruction. After accessing the address, access the cache line in L1 I-Cache 2042 again.

例如，上面在首次访问后将第一指令的访问地址(cur_addr)写入地址寄存器后，该地址寄存器则为非空，再次访问L1 I-Cache 2042中的cache line时，读取地址寄存器中写入的第一指令的访问地址(由于当前访问L1 I-Cache 2042的并不一定就是该第一指令的访问地址，此时可以将保存在地址寄存器中的第一指令的访问地址理解为last_addr)。Cache控制器2044将该地址寄存器中写入的访问地址(last_addr)更新为再次访问的第二指令的访问地址(此时，由于当前的访问地址是第二指令的访问地址，因此，此时可以将该第二指令的访问地址理解为cur_addr)。此情况下，该第二指令的访问地址即为CPU Core 202继首次访问L1 I-Cache 2042之后第二次访问L1 I-Cache 2042时获取的另一指令的访问地址。For example, after the first access address (cur_addr) is written to the address register after the first access, the address register is non-empty, and when the cache line in the L1 I-Cache 2042 is accessed again, the read address register is written. The access address of the first instruction that is entered (since the current access to the L1 I-Cache 2042 is not necessarily the access address of the first instruction, the access address of the first instruction stored in the address register can be understood as last_addr) . The Cache controller 2044 updates the access address (last_addr) written in the address register to the access address of the second instruction accessed again (at this time, since the current access address is the access address of the second instruction, therefore, The access address of the second instruction is understood as cur_addr). In this case, the access address of the second instruction is the access address of another instruction acquired by the CPU Core 202 when accessing the L1 I-Cache 2042 for the second time after the first access to the L1 I-Cache 2042.

Cache控制器2044则根据上面赋予的last_addr和当前访问地址(即新的cur_addr)，计算地址变化量Δ＝cur_addr-last_addr，并将地址变化量Δ写入上次访问(该场景下即第一指令的访问地址对应)的cache line的扩展位(Extra bits)中。The Cache controller 2044 calculates the address change amount Δ=cur_addr-last_addr according to the last_addr and the current access address (ie, the new cur_addr) given above, and writes the address change amount Δ to the last access (the first instruction in the scenario) The access address corresponds to the extension bits of the cache line.

本发明实施例中的CPU core 202在调用数据时，会读取执行该数据的指令进行即时解码，其中，每个指令对应都有一个访问地址，通过该访问地址可以找到该指令所存储的位置。因此在调用数据时，CPU core 202需要获取第一指令的访问地址，并根据第一指令的访问地址访问L1 I-Cache 2042；通过执行该第一指令，就能够调用对应的数据。When calling the data, the CPU core 202 in the embodiment of the present invention reads the instruction for executing the data for instant decoding, wherein each instruction has an access address, and the location where the instruction is stored can be found by the access address. . Therefore, when the data is called, the CPU core 202 needs to acquire the access address of the first instruction, and accesses the L1 I-Cache 2042 according to the access address of the first instruction; by executing the first instruction, the corresponding data can be called.

CPU core 202通过第一指令的访问地址访问L1 I-Cache 2042，如果第一指令的访问地址存在于L1 I-Cache 2042中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache 2042中会被命中。此时，Cache控制器2044读取第一cache line的扩展位中访问地址的偏移信息，并根据该访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址。The CPU core 202 accesses the L1 I-Cache 2042 through the access address of the first instruction. If the access address of the first instruction exists in the L1 I-Cache 2042, the first cache line corresponding to the access address of the first instruction is at L1 I- Cache 2042 will be hit. At this time, the Cache controller 2044 reads the offset information of the access address in the extension bit of the first cache line, and calculates the access address of the second instruction according to the offset information of the access address and the access address of the first instruction.

如果第一指令的访问地址不存在于L1 I-Cache 2042中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache 2042中不会被命中，此时，Cache控制器2044可以根据第一指令的访问地址查找第一指令的位置，并将该第一指令预取到L1 I-Cache 2042中。If the access address of the first instruction does not exist in the L1 I-Cache 2042, the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache 2042. At this time, the Cache controller 2044 may Finding the location of the first instruction according to the access address of the first instruction, and prefetching the first instruction to the L1 I-Cache 2042.

在计算得到第二指令的访问地址后，CPU core 202可以通过该第二指令的访问地址查找第二指令的位置，对第二指令执行预取，例如通过计算得到的第二指令的访问地址查询到第二指令在L2 I-Cache中，则将第二指令从L2 I-Cache中预取到L1 I-Cache中。After calculating the access address of the second instruction, the CPU core 202 can search for the location of the second instruction by using the access address of the second instruction, and perform prefetching on the second instruction, for example, by using the calculated access address of the second instruction. When the second instruction is in the L2 I-Cache, the second instruction is prefetched from the L2 I-Cache into the L1 I-Cache.

本发明实施例提供的cache line由于还包括用于保存访问地址的偏移信息的扩展位，该访问地址的偏移信息为处理器核CPU core 202向L1 I-Cache 2042发出的相邻两次访问指令所包含的访问地址的地址变化量；因此，Cache控制器2044确定在L1 I-Cache 2042中和第一指令的访问地址对应的第一cache line被命中时，可以读取第一cache line的扩展位中访问地址的偏移信息，并根据访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址；CPU core 202则根据Cache控制器2044计算得到的第二指令的访问地址，执行对所述第二指令的预取。通过上述访问地址的偏移信息预取到的指令的被访问率较高，因此，能够提高指令缓存的预取准确率。The cache line provided by the embodiment of the present invention further includes an extension bit for storing offset information of the access address, and the offset information of the access address is two adjacent times sent by the processor core CPU core 202 to the L1 I-Cache 2042. Accessing the address change amount of the access address included in the instruction; therefore, the Cache controller 2044 determines that the first cache line can be read when the first cache line corresponding to the access address of the first instruction is hit in the L1 I-Cache 2042 The offset information of the access address in the extension bit, and the access address of the second instruction is calculated according to the offset information of the access address and the access address of the first instruction; the second command calculated by the CPU core 202 according to the Cache controller 2044 The access address performs prefetching of the second instruction. The access rate of the instruction prefetched by the offset information of the access address is high, and therefore, the prefetch accuracy of the instruction cache can be improved.

该Cache 204还可以包括二级指令缓存L2 I-Cache 2046，当L1 I-Cache 2042中存储空间已满时，CPU Core 202对该L1 I-Cache 2042中的缓存空间执行驱逐操作。例如，将L1 I-Cache 2042中的第二cache line(为便于说明，将被驱逐的该cache line命名为第二cache line，下同)驱逐到L2 I-Cache 2046中。Cache控制器2044为该第二cache line设置被驱逐权重，该被驱逐权重用于标记第二cache line在L2 I-Cache 2046中被驱逐的优先级。The Cache 204 may further include a secondary instruction cache L2 I-Cache 2046. When the storage space in the L1 I-Cache 2042 is full, the CPU Core 202 performs a eviction operation on the cache space in the L1 I-Cache 2042. For example, the second cache line in the L1 I-Cache 2042 (for convenience of explanation, the cache line to be evicted is named as the second cache line, the same below) is deported into the L2 I-Cache 2046. The Cache Controller 2044 sets the eviction weight for the second cache line, the eviction weight being used to mark the priority of the second cache line being evicted in the L2 I-Cache 2046.

对于上述驱逐阶段，当处理器核CPU Core 202对L1 I-Cache 2042执行驱逐操作时，Cache控制器2044对即将进行驱逐的cache line的扩展位(Extra bits)进行分析：当扩展位(Extra bits)为空(null)时，可不做处理；当扩展位(Extra bits)为非空(非null)时，设置此cache line在L2 I-Cache 2046中的被驱逐权重为最低，使其尽可能地保存在L2 I-Cache2046中。其中，被驱逐权重根据使用的驱逐策略不同而不同，常见的驱逐策略包括最少使用页置换算法(Least Recently Used，LRU)、最不经常使用页置换算法(Least Frequently Used，LFU)和自适应缓存替换(Adaptive Replacement Cache，ARC)等。For the above eviction phase, when the processor core CPU Core 202 performs the eviction operation on the L1 I-Cache 2042, the Cache controller 2044 analyzes the extra bits of the cache line to be eviction: when the extension bits (Extra bits) When it is empty (null), it can be left unprocessed; when the extra bits are non-null (non-null), set the cache line's expulsion weight in L2 I-Cache 2046 to be the lowest, so that it is as far as possible The location is saved in the L2 I-Cache 2046. Among them, the eviction weight varies according to the eviction strategy used. Common eviction strategies include Least Recently Used (LRU), Least Frequently Used (LFU) and Adaptive Cache. Replace (Adaptive Replacement Cache, ARC) and so on.

以LRU为例，如果Cache使用的是LRU驱逐算法，那么在本发明实施例中，L2 I-Cache中记录并保存了每个cache line的被驱逐权重值“age bits”，每当有一个cache line被访问时，那么其他所有L2 I-Cache中cache line的“age bits”都加1。那么可以认为被驱逐权重值“age bits”最大的cache line在L2 I-cache中被驱逐的优先级最高(即最先被驱逐)。当带有扩展位(Extra bits)的cache line从L1 I-Cache中被驱逐出来时，那么只需将该带有扩展位(Extra bits)的cache line在L2 I-Cache中的“age bits”值设置为0，以使该cache line在L2 I-cache中被驱逐的优先级最低(即最后被驱逐)。因此，可以通过设置cache line的“age bits”值的大小实现该cache line在L2 I-cache中被驱逐的优先级(即被驱逐的先后顺序)。For example, in the embodiment of the present invention, the L2 I-Cache records and stores the eviction weight value “age bits” of each cache line, and whenever there is a cache, the LRU is used as an example. When the line is accessed, then the "age bits" of the cache line in all other L2 I-Caches are incremented by one. Then, it can be considered that the cache line with the largest eviction weight value "age bits" has the highest priority (ie, the first eviction) in the L2 I-cache. When the cache line with the extra bits is evicted from the L1 I-Cache, then only the "age bits" of the cache line with the extra bits in the L2 I-Cache are needed. The value is set to 0 so that the cache line is deported in the L2 I-cache with the lowest priority (ie, the last eviction). Therefore, the priority of the cache line being evicted in the L2 I-cache (ie, the order of being evicted) can be achieved by setting the size of the "age bits" value of the cache line.

以LFU为例，如果Cache使用的是LFU驱逐算法，那么在本发明实施例中，保存在L2 I-Cache中的每个cache line都设置了一个计数器，某个cache line被访问时，那么该cache line对应的计数器加1。那么可以认为计数器中的最小值对应的cache line在L2 I-cache中被驱逐的优先级最高(即最先被驱逐)。当带有扩展位(Extra bits)的cache line从L1 I-Cache中被驱逐出来时，那么只需将该带有扩展位(Extra bits)的cache line在L2 I-Cache中对应的计数器值设置为当前L2 I-Cache中所有计数器中的最大值，以使该cache line在 L2 I-Cache中被驱逐的优先级最低(即最后被驱逐)。因此，可以通过设置cache line对应的的计数器值的大小实现该cache line在L2 I-Cache中被驱逐的优先级(即被驱逐的先后顺序)。In the LFU, for example, if the Cache uses the LFU eviction algorithm, in the embodiment of the present invention, each cache line stored in the L2 I-Cache is set with a counter, and when a cache line is accessed, then the The counter corresponding to the cache line is incremented by 1. Then it can be considered that the cache line corresponding to the minimum value in the counter has the highest priority (ie, the first eviction) in the L2 I-cache. When the cache line with the extra bits is evicted from the L1 I-Cache, then only the corresponding counter value of the cache line with the extra bits in the L2 I-Cache is set. Is the maximum value of all counters in the current L2 I-Cache so that the cache line is The expelled in L2 I-Cache has the lowest priority (that is, the last eviction). Therefore, the priority of the cache line being evicted in the L2 I-Cache (ie, the order of being evicted) can be achieved by setting the size of the counter value corresponding to the cache line.

以ARC为例，ARC驱逐算法是LRU与LFU的一种折中和平衡，Cache中包含两个表T1和T2，T1保存使用LRU的cache line，T2保存使用LFU的cache line，T1与T2共同组成了整个Cache。ARC算法根据cache line从T1或T2中被驱逐的历史，来自适应调整每次的驱逐选择。例如，位于L2 I-Cache的中部，即同时远离T1与T2的待驱逐区域的cache line“在L2 I-cache中被驱逐的优先级最低(即最后被驱逐)。当带有扩展位(Extra bits)的cache line从L1 I-Cache中被驱逐出来时，那么只需将该带有扩展位(Extra bits)的cache line插入于L2 I-Cache的中部，即同时远离T1与T2的待驱逐区域的位置，以使该cache line在L2 I-Cache中被驱逐的优先级最低(即最后被驱逐)。因此，可以通过设置cache line在L2 I-Cache中的区域位置实现该cache line在L2 I-Cache中被驱逐的优先级(即被驱逐的先后顺序)。Taking ARC as an example, the ARC eviction algorithm is a compromise and balance between LRU and LFU. The Cache contains two tables T1 and T2. T1 stores the cache line using LRU, and T2 saves the cache line using LFU. T1 and T2 are common. Formed the entire Cache. The ARC algorithm adaptively adjusts each eviction selection based on the history of the cache line being evicted from T1 or T2. For example, the cache line located in the middle of the L2 I-Cache, that is, away from T1 and T2, is the lowest priority (ie, the last eviction) in the L2 I-cache. When there is an extension bit (Extra) When the cache line of bits) is evicted from the L1 I-Cache, then only the cache line with the extra bits is inserted in the middle of the L2 I-Cache, that is, away from T1 and T2. The location of the region so that the cache line is deported in the L2 I-Cache with the lowest priority (ie, the last eviction). Therefore, the cache line can be implemented in L2 by setting the location of the cache line in the L2 I-Cache. The priority of being evicted in the I-Cache (ie, the order in which they were evicted).

对于被驱逐的cache line可以实现预取经验的继承和共享，在确定CPU core 202执行指令的访问地址对应上述被驱逐至L2 I-Cache 2046中的第二cache line时，将被驱逐至L2 I-Cache 2046中的第二cache line预取到L1 I-Cache 2042中。For the eviction cache line, the inheritance and sharing of the prefetch experience can be implemented. When it is determined that the access address of the CPU core 202 execution instruction corresponds to the second cache line deported to the L2 I-Cache 2046, it will be deported to L2 I. The second cache line in the -Cache 2046 is prefetched into the L1 I-Cache 2042.

例如，在处理器核CPU core 202执行指令的访问地址对应上述被驱逐至L2 I-Cache 2046中的第二cache line，可以从L2 I-Cache中对此cache line进行预取，从而直接获取CPU core202产生的，保存有扩展位(Extra bits)的cache line，实现预取“经验”的继承。For example, the access address of the processor core CPU core 202 executing the instruction corresponds to the second cache line deported to the L2 I-Cache 2046, and the cache line can be prefetched from the L2 I-Cache to directly acquire the CPU. The cache line generated by core202 stores the extra bits, which realizes the inheritance of prefetching "experience".

基于图1所提供的处理器芯片200，本发明实施例中该处理器芯片200还可以集成多个处理器核CPU core，每个CPU core可以分别配置一个L1 I-Cache，因此每个CPU core拥有能够独立访问的L1 I-Cache；所有CPU core可以共同配置一个L2 I-Cache 2046，因此每个CPU core共享同一L2 I-Cache 2046，实现L2 I-Cache 2046的资源共享。例如，在该处理器芯片存在多个处理器核CPU core的情况下，对于被驱逐的cache line可以实现多核预取经验的共享，例如，当其他处理器核(如CPU core 2)执行指令的访问地址同样对应上述被驱逐至L2 I-Cache 2046中的第二cache line时，也可以从L2 I-Cache 2046中对此cache line进行预取，实现多核预取“经验”的共享。Based on the processor chip 200 provided in FIG. 1 , in the embodiment of the present invention, the processor chip 200 can also integrate multiple processor core CPU cores, and each CPU core can be configured with one L1 I-Cache, so each CPU core The L1 I-Cache 2046 can be configured independently. All CPU cores can be configured with one L2 I-Cache 2046. Therefore, each CPU core shares the same L2 I-Cache 2046 to implement resource sharing of the L2 I-Cache 2046. For example, in the case where the processor chip has multiple processor core CPU cores, sharing of multi-core prefetching experience can be implemented for the cached cache line, for example, when other processor cores (such as CPU core 2) execute instructions. When the access address is also corresponding to the second cache line deported to the L2 I-Cache 2046, the cache line may be prefetched from the L2 I-Cache 2046 to implement multi-core prefetch "experience" sharing.

本发明实施例还提供了一种指令缓存的预取方法，图1中的处理器芯片200运行时执行该方法，其流程示意图如图2所示。The embodiment of the present invention further provides a prefetching method of the instruction cache. The processor chip 200 in FIG. 1 executes the method during operation, and the flow chart is shown in FIG. 2 .

402、CPU core获取第一指令的访问地址，并根据第一指令的访问地址访问L1 I-Cache。402. The CPU core acquires an access address of the first instruction, and accesses the L1 I-Cache according to the access address of the first instruction.

本发明实施例中的CPU core在调用数据时，会读取执行该数据的指令进行即时解码，其中，每个指令对应都有一个访问地址，通过该访问地址可以找到该指令所存储的位置。因此在调用数据时，CPU core需要获取第一指令的访问地址，并根据第一指令的访问地址访问L1 I-Cache；通过执行该第一指令，就能够调用对应的数据。When the CPU core in the embodiment of the present invention calls the data, it reads the instruction to execute the data for instant decoding, wherein each instruction has an access address, and the location where the instruction is stored can be found through the access address. Therefore, when calling data, the CPU core needs to acquire the access address of the first instruction, and access the L1 I-Cache according to the access address of the first instruction; by executing the first instruction, the corresponding data can be called.

为了提高指令预取的连续性，可以对初始化的Cache预先进行配置。可选的，在CPU core获取第一指令的访问地址之前，该方法还包括：In order to improve the continuity of instruction prefetching, the initialized Cache can be configured in advance. Optionally, before the CPU core obtains the access address of the first instruction, the method further includes:

Cache控制器计算第一指令的访问地址和所述第二指令的访问地址之间的偏移信息，并将偏移信息写入第一指令的访问地址对应cache line的扩展位中。The Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into the extension address of the cache line corresponding to the access address of the first instruction.

例如，参阅图4所示的计算访问地址的偏移信息的流程示意图，CPU Core访问L1 I-Cache 中的cache line，通过Cache控制器查询地址寄存器是否为空：如果为空，表示此cache line是访问序列中的第一个cache line，Cache控制器将当前的访问地址cur_addr写入到地址寄存器中，并等待下一次cache line访问；如果非空，Cache控制器将地址寄存器的值(例如将前次访问的cur_addr)赋予为last_addr，同时将当前的访问地址作为新的cur_addr写入地址寄存器中。相关说明参考装置部分，此处不再赘述。For example, referring to the flow diagram of calculating the offset information of the access address shown in FIG. 4, the CPU Core accesses the L1 I-Cache. In the cache line, the Cache controller queries whether the address register is empty: if it is empty, it indicates that the cache line is the first cache line in the access sequence, and the Cache controller writes the current access address cur_addr into the address register. And wait for the next cache line access; if not empty, the Cache controller assigns the value of the address register (for example, the cur_addr of the previous access) to last_addr, and writes the current access address as a new cur_addr to the address register. The related description refers to the device part and will not be described here.

404、Cache控制器确定在L1 I-Cache中和第一指令的访问地址对应的第一cache line被命中时，读取第一cache line的扩展位中访问地址的偏移信息，并根据访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址。404. The Cache controller determines, when the first cache line corresponding to the access address of the first instruction is hit in the L1 I-Cache, reads the offset information of the access address in the extension bit of the first cache line, and according to the access address. The offset information and the access address of the first instruction are calculated to obtain the access address of the second instruction.

如果第一指令的访问地址不存在于L1 I-Cache中，那么第一指令的访问地址对应的第一cache line在L1 I-Cache中不会被命中，此时，Cache控制器可以根据第一指令的访问地址查找第一指令的位置，并将该第一指令预取到L1 I-Cache中。If the access address of the first instruction does not exist in the L1 I-Cache, the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache. At this time, the Cache controller may be based on the first The access address of the instruction finds the location of the first instruction and prefetches the first instruction into the L1 I-Cache.

406、CPU core根据Cache控制器计算得到的第二指令的访问地址，执行对第二指令的预取。406. The CPU core performs prefetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

在计算得到第二指令的访问地址后，CPU core可以通过该第二指令的访问地址查找第二指令的位置，对第二指令执行预取，例如通过计算得到的第二指令的访问地址查询到第二指令在L2 I-Cache中，则将第二指令从L2 I-Cache中预取到L1 I-Cache中。After calculating the access address of the second instruction, the CPU core may search for the location of the second instruction by using the access address of the second instruction, and perform prefetching on the second instruction, for example, by using the calculated access address of the second instruction. The second instruction is in the L2 I-Cache, and the second instruction is prefetched from the L2 I-Cache into the L1 I-Cache.

例如，Cache控制器计算第一指令的访问地址和第二指令的访问地址之间的偏移信息，并将偏移信息写入第一指令的访问地址对应cache line的扩展位中后，缓存单元cache line中已写入扩展位(Extra bits)中的访问地址的地址变化量Δn如下表1：For example, the Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information into the extension address of the cache line corresponding to the access address of the first instruction, the cache unit The address change amount Δn of the access address in the extension line (Extra bits) in the cache line is as follows:

表1Table 1

Extra bitsExtra bits addressAddress hit/missHit/miss prefetch addrPrefetch addr Δ0Δ0 AA yesYes A+Δ0A+Δ0 Δ1Δ1 B＝A+Δ0B=A+Δ0 yesYes B+Δ1B+Δ1 Δ2Δ2 C＝B+Δ1C=B+Δ1 noNo noneNone Δ3Δ3 DD yesYes D+Δ3D+Δ3 Δ4Δ4 E＝D+Δ3E=D+Δ3 yesYes E+Δ4E+Δ4

表1中列举了5条cache line中已写入扩展位(Extra bits)中的访问地址的地址变化量Δn，分别为：Δ0、Δ1、Δ2、Δ3和Δ4。Table 1 lists the address change amounts Δn of the access addresses in the extended bits (extra bits) of the five cache lines, which are: Δ0, Δ1, Δ2, Δ3, and Δ4, respectively.

以表1中写入的访问地址的地址变化量Δn为例，CPU core需要访问地址为A的cache line，在L1 I-Cache中命中此cache line，读取其Extra bits域中所保存的访问地址的地址变化量Δ0，计算出下一条cache line的地址为A+Δ0＝B，将访问地址B对应的指令预取到L1 I-Cache中。Taking the address change amount Δn of the access address written in Table 1 as an example, the CPU core needs to access the cache line with the address A, hit the cache line in the L1 I-Cache, and read the access saved in the Extra bits field. The address change amount Δ0 of the address is calculated as the address of the next cache line as A+Δ0=B, and the instruction corresponding to the access address B is prefetched into the L1 I-Cache.

CPU core需要访问地址为B的cache line，在L1 I-Cache中命中此cache line，读取其Extra bits域中所保存的地址变化量Δ1，计算出下一条cache line的地址为B+Δ1＝C，将访问地址C对应的指令预取到L1 I-Cache中。 The CPU core needs to access the cache line with the address B, hit the cache line in the L1 I-Cache, read the address change amount Δ1 stored in the Extra bits field, and calculate the address of the next cache line as B+Δ1= C, prefetch the instruction corresponding to the access address C into the L1 I-Cache.

CPU core需要访问地址为X的cache line，在L1 I-Cache中缺失此cache line，不进行预取。The CPU core needs to access the cache line with address X. This cache line is missing in the L1 I-Cache and is not prefetched.

CPU core需要访问地址为D的cache line，在L1 I-Cache中命中此cache line，读取其Extra bits域中所保存的地址变化量Δ3，计算出下一条cache line的地址为D+Δ3＝E，将访问地址E对应的指令预取到L1 I-Cache中。The CPU core needs to access the cache line with the address D, hit the cache line in the L1 I-Cache, read the address change amount Δ3 stored in the Extra bits field, and calculate the address of the next cache line as D+Δ3= E. Prefetch the instruction corresponding to the access address E into the L1 I-Cache.

CPU core需要访问地址为E的cache line，在L1 I-Cache中命中此cache line，读取其Extra bits域中所保存的地址变化量Δ4，计算出下一条cache line的地址为E+Δ4，将访问地址E+Δ4对应的指令预取到L1 I-Cache中。The CPU core needs to access the cache line with the address E, hit the cache line in the L1 I-Cache, read the address change amount Δ4 stored in the Extra bits field, and calculate the address of the next cache line as E+Δ4. The instruction corresponding to the access address E+Δ4 is prefetched into the L1 I-Cache.

在预取阶段，可以设置合适的预取深度(prefetch_depth)。例如，访问当前cache line并通过该cache line中的访问地址的偏移信息计算得到的下一次访问数据可能执行的指令的访问地址。如果此指令的访问地址对应的cache line在L1 I-Cache被命中，通过该指令的访问地址将该指令预取到L1 I-Cache中。根据预取深度，对预取到的指令继续执行预取，通过读取预取到L1 I-Cache中的指令所在的cache line的访问地址的偏移信息，继续预取下一次访问数据可能执行的指令的访问地址(即执行循环预取)。其中，预取深度的确定可以是根据工作负载(workload)的类型不同而可能有所不同，不同的工作负载类型可能适用不同的预取深度，故可能需要现针对不同的典型应用进行测试，设置一些默认值。注意预取深度需要恰当地进行设置，太小影响预取效果，降低局部性；太大则可能会造成缓存污染(cache pollution)或预取到的数据被覆盖。In the prefetch phase, a suitable prefetch depth can be set (prefetch_depth). For example, accessing the current cache line and calculating the access address of the instruction that may be executed by the next access data is calculated by the offset information of the access address in the cache line. If the cache line corresponding to the access address of this instruction is hit in the L1 I-Cache, the instruction is prefetched into the L1 I-Cache through the access address of the instruction. According to the prefetching depth, the prefetching of the prefetched instruction is continued, and the offset information of the access address of the cache line in which the instruction prefetched into the L1 I-Cache is read, and the prefetching of the next access data may be performed. The access address of the instruction (ie, performing a loop prefetch). The pre-fetch depth may be determined according to the type of the workload. Different workload types may be applied to different pre-fetch depths. Therefore, it may be necessary to test and set different typical applications. Some default values. Note that the prefetch depth needs to be set appropriately. Too small affects the prefetching effect and reduces locality. If it is too large, it may cause cache pollution or prefetched data to be overwritten.

本发明实施例提供的cache line由于还包括用于保存访问地址的偏移信息的扩展位，该访问地址的偏移信息为所述处理器核向L1 I-Cache发出的相邻两次访问指令所包含的访问地址的地址变化量；因此，Cache控制器确定在L1 I-Cache中和第一指令的访问地址对应的第一cache line被命中时，可以读取第一cache line的扩展位中访问地址的偏移信息，并根据访问地址的偏移信息和第一指令的访问地址计算得到第二指令的访问地址；CPU core则根据Cache控制器计算得到的第二指令的访问地址，执行对第二指令的预取。通过上述访问地址的偏移信息预取到的指令的被访问率较高，因此，能够提高指令缓存的预取准确率。The cache line provided by the embodiment of the present invention further includes an extension bit for storing offset information of the access address, where the offset information of the access address is an adjacent two access instruction sent by the processor core to the L1 I-Cache. The address change amount of the included access address; therefore, the Cache controller determines that the first cache line corresponding to the access address of the first instruction is hit in the L1 I-Cache, and the extension bit of the first cache line can be read. Accessing the offset information of the address, and calculating the access address of the second instruction according to the offset information of the access address and the access address of the first instruction; the CPU core performs the pair according to the access address of the second instruction calculated by the Cache controller. Prefetching of the second instruction. The access rate of the instruction prefetched by the offset information of the access address is high, and therefore, the prefetch accuracy of the instruction cache can be improved.

该Cache204还可以包括二级指令缓存L2 I-Cache，当L1 I-Cache中存储空间已满时，CPU Core对该L1 I-Cache中的缓存空间执行驱逐操作。例如，将L1 I-Cache中的第二cache line(为便于说明，将被驱逐的该cache line命名为第二cache line，下同)驱逐到L2 I-cache中。Cache控制器为该第二cache line设置被驱逐权重，该被驱逐权重用于标记第二cache line在L2 I-cache中被驱逐的优先级。对于驱逐阶段设置被驱逐权重的相关说明参考装置部分，此处不再赘述。The Cache 204 may further include a secondary instruction cache L2 I-Cache. When the storage space in the L1 I-Cache is full, the CPU Core performs an eviction operation on the cache space in the L1 I-Cache. For example, the second cache line in the L1 I-Cache (for convenience of explanation, the cache line to be evicted is named as the second cache line, the same below) is deported to the L2 I-cache. The Cache controller sets the eviction weight for the second cache line, and the eviction weight is used to mark the priority of the second cache line being evicted in the L2 I-cache. For the description of the deportation weights in the expulsion phase, refer to the device section, which will not be described here.

对于被驱逐的cache line可以实现预取经验的继承和共享，在确定CPU Core执行指令的访问地址对应上述被驱逐至L2 I-cache中的第二cache line时，将被驱逐至L2 I-Cache中的第二cache line预取到L1 I-Cache中。相关说明参考装置部分，此处不再赘述。For the eviction cache line, the inheritance and sharing of prefetching experience can be implemented. When the access address of the CPU Core execution instruction is determined to correspond to the second cache line deported to the L2 I-cache, it will be deported to the L2 I-Cache. The second cache line is prefetched into the L1 I-Cache. The related description refers to the device part and will not be described here.

在该处理器芯片存在多个处理器核的情况下，对于被驱逐的cache line可以实现多核预取经验的共享，例如，当其他处理器核(如CPU Core 2)执行指令的访问地址同样对应上述被驱逐至L2 I-cache中的第二cache line时，也可以从L2 I-Cache中对此cache line进行预取，实现多核预取“经验”的共享。相关说明参考装置部分，此处不再赘述。In the case where there are multiple processor cores in the processor chip, the sharing of the multi-core prefetch experience can be implemented for the cached cache line, for example, when the access addresses of other processor cores (such as CPU Core 2) execute instructions are also corresponding. When the above is expelled to the second cache line in the L2 I-cache, the cache line can also be prefetched from the L2 I-Cache to realize the sharing of the multi-core prefetch "experience". The related description refers to the device part and will not be described here.

本发明实施例还提供一种计算机存储介质，其中，该计算机存储介质可存储有程序，该程序执行时包括上述方法实施例中记载的指令缓存的预取方法的部分或全部步骤。An embodiment of the present invention further provides a computer storage medium, wherein the computer storage medium can store a program, where The program execution includes some or all of the steps of the prefetching method of the instruction cache described in the above method embodiments.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

在本发明所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

以上所述，以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。 The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.

Claims

A processor chip, comprising: a processor core CPU core and a cache cache, the cache includes a first level instruction cache L1 I-Cache and a Cache controller, and the L1 I-Cache includes at least one a cache unit cache line, each of the cache lines includes a tag tag field, data, and a flag bit.

Each of the cache lines further includes an extension bit for storing offset information of the access address, where the offset information of the access address is an adjacent two access command sent by the CPU core to the L1 I-Cache. The amount of address change of the included access address;

The CPU core is configured to obtain an access address of the first instruction, and access the L1 I-Cache according to the access address of the first instruction;

The Cache controller is configured to: when the first cache line corresponding to the access address of the first instruction is hit in the L1 I-Cache, read an access address in an extension bit of the first cache line Offset information, and calculating an access address of the second instruction according to the offset information of the access address and the access address of the first instruction;

The CPU core is further configured to perform prefetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

The chip of claim 1 wherein:

The Cache controller is further configured to: when the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache, searching for the first address according to the access address of the first instruction The location of the first instruction and prefetching the first instruction into the L1 I-Cache.

The chip according to claim 1 or 2, wherein

The Cache controller is further configured to calculate offset information between an access address of the first instruction and an access address of the second instruction, and obtain the offset, before acquiring an access address of the first instruction The access address of the information written to the first instruction corresponds to the extension bit of the cache line.

The chip according to any one of claims 1 to 3, wherein the Cache further comprises a second level instruction cache L2 I-Cache,

The Cache controller is further configured to: when the second cache line in the L1 I-Cache performs eviction to the L2 I-Cache operation, set an eviction weight for the second cache line, where The eviction weight is used to mark the priority of the second cache line being evicted in the L2 I-Cache.

The chip according to claim 4, wherein

The Cache controller is further configured to prefetch the second cache line that is deported to the L2 I-Cache when determining that the access address of the CPU Core execution instruction corresponds to the second cache line In the L1 I-Cache.

A prefetching method for an instruction cache is applied to a processor chip, the processor chip including a processor a core CPU core and a cache Cache, the Cache includes a first level instruction cache L1 I-Cache and a Cache controller, the L1 I-Cache including at least one cache unit cache line, each of the cache lines including a tag tag Domain, data, and flag bits, characterized in that

Each of the cache lines further includes an extension bit for storing offset information of the access address, where the offset information of the access address is an adjacent two access command sent by the CPU core to the L1 I-Cache. The amount of address change of the included access address; the method includes:

The CPU core acquires an access address of the first instruction, and accesses the L1 I-Cache according to the access address of the first instruction;

Determining, by the Cache controller, the offset of the access address in the extension bit of the first cache line when the first cache line corresponding to the access address of the first instruction is hit in the L1 I-Cache Information, and calculating an access address of the second instruction according to the offset information of the access address and the access address of the first instruction;

The CPU core performs prefetching of the second instruction according to the access address of the second instruction calculated by the Cache controller.

The method of claim 6 wherein the method further comprises:

Determining, by the Cache controller, that the first cache line corresponding to the access address of the first instruction is not hit in the L1 I-Cache, searching for the first instruction according to the access address of the first instruction Positioning and prefetching the first instruction into the L1 I-Cache.

The method according to claim 6 or 7, wherein before the CPU core acquires an access address of the first instruction, the method further includes:

The Cache controller calculates offset information between the access address of the first instruction and the access address of the second instruction, and writes the offset information to an extension of the cache address corresponding to the access address of the first instruction In the bit.

The method according to any one of claims 6 to 8, wherein the Cache further comprises a secondary instruction cache L2 I-Cache, the method further comprising:

The Cache controller sets an eviction weight for the second cache line when performing eviction of the second cache line in the L1 I-Cache to the L2 I-Cache operation, and the eviction weight is used for Marking the priority of the second cache line being evicted in the L2 I-Cache.

The method of claim 9 wherein the method further comprises:

The Cache controller prefetches the second cache line that is deported to the L2 I-Cache to the L1 I when determining that the access address of the CPU Core execution instruction corresponds to the second cache line -Cache.