WO2026011589A1

WO2026011589A1 - Method for reducing bus load, cxl module, processing system, and processor chip

Info

Publication number: WO2026011589A1
Application number: PCT/CN2024/124497
Authority: WO
Inventors: 张凯; 戴瑾; 张云森; 孟凡辉
Original assignee: Beijing Superstring Academy of Memory Technology
Current assignee: Beijing Superstring Academy of Memory Technology
Priority date: 2024-07-10
Filing date: 2024-10-12
Publication date: 2026-01-15
Anticipated expiration: 2027-01-10
Also published as: CN118779280A; CN118779280B

Abstract

A method for reducing the bus load, a CXL module, a processing system, and a processor chip, applied to the field of computers. In the processing system comprising a plurality of processing units, each processing unit acquires configuration information of a coherence cache area by means of a configuration interface having cache coherence, and a partial area of a cache of a current processing unit is configured as a coherence cache area, so as to use the coherence cache area to cache data in shared memories of a plurality of processing units. The CXL module can acquire configuration information of the shared memories by means of a shared memory configuration instruction at configuration interfaces of the shared memories, and configure partial memory areas in memory mediums as the shared memories.

Description

Methods to reduce bus load, CXL module, processing system and processor chip

本申请要求于2024年7月10日提交的、申请号为202410925305.7的中国专利申请的优先权，其内容应理解为通过引用的方式并入本申请中。This application claims priority to Chinese patent application No. 202410925305.7, filed on July 10, 2024, the contents of which are to be understood as incorporated herein by reference.

Technical Field

本公开实施例涉及但不仅限于计算机领域，特别涉及一种降低总线负载的方法、CXL模组、处理系统和处理器芯片。This disclosure relates to, but is not limited to, the field of computers, and particularly to a method for reducing bus load, a CXL module, a processing system, and a processor chip.

Background Technology

在一些多处理器架构中，多个处理器芯片通过总线连接，处理器芯片中设置有处理器和存储器，基于高带宽接口与处理器连接的存储器可用作处理器的缓存，以提供高性能计算能力。多处理器架构可以使用缓存一致性协议，使得一个处理器所做的修改对其他所有处理器可见，而不能出现某个处理器修改了数据但其他处理器还看到旧数据的情况。同时需要保证不同处理器对共享数据的访问顺序符合设定的操作顺序，防止由于乱序执行或并发访问导致的数据不一致。这种机制可以提高系统性能和并发访问效率，同时保证数据的正确性。In some multiprocessor architectures, multiple processor chips are connected via a bus. Each processor chip contains a processor and memory. Memory connected to the processor via a high-bandwidth interface can be used as a cache to provide high-performance computing capabilities. Multiprocessor architectures can use cache coherency protocols to ensure that modifications made by one processor are visible to all other processors, preventing situations where one processor modifies data but other processors still see old data. Simultaneously, it's necessary to ensure that the order in which different processors access shared data conforms to a pre-defined operation order to prevent data inconsistency caused by out-of-order execution or concurrent access. This mechanism can improve system performance and concurrent access efficiency while guaranteeing data integrity.

发明内容Summary of the Invention

以下是对本公开详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the subject matter described in detail in this disclosure. This overview is not intended to limit the scope of the claims.

本公开实施例提供了一种CXL模组，所述CXL模组包括CXL控制器及与所述CXL控制器连接的内存介质，所述CXL控制器被配置为执行以下处理：This disclosure provides a CXL module, which includes a CXL controller and a memory medium connected to the CXL controller. The CXL controller is configured to perform the following processes:

接收共享内存配置指令并解析,得到共享内存的配置信息；其中，所述共享内存是与所述CXL模组连接的多个处理单元共享的内存区域；The system receives and parses shared memory configuration instructions to obtain shared memory configuration information; wherein, the shared memory is a memory region shared by multiple processing units connected to the CXL module.

根据所述共享内存的配置信息确定所述共享内存的地址范围；The address range of the shared memory is determined based on the configuration information of the shared memory;

根据所述共享内存的地址范围，将所述内存介质中的部分内存区域配置为所述共享内存，仅对所述共享内存中的数据进行缓存一致性维护。Based on the address range of the shared memory, a portion of the memory region in the memory medium is configured as the shared memory, and cache consistency maintenance is performed only on the data in the shared memory.

本公开实施例还提供了一种共享内存配置方法，应用于CXL模组，所述CXL模组包括CXL控制器及与所述CXL控制器连接的内存介质，所述方法包括：This disclosure also provides a shared memory configuration method applied to a CXL module, the CXL module including a CXL controller and a memory medium connected to the CXL controller, the method including:

接收共享内存配置指令并解析，得到共享内存的配置信息；其中，所述共享内存是与所述CXL模组连接的多个处理单元共享的内存区域；The system receives and parses shared memory configuration instructions to obtain shared memory configuration information; wherein, the shared memory is a memory region shared by multiple processing units connected to the CXL module.

本公开上述实施例对CXL协议进行拓展，增加共享内存配置的接口。CXL模组根据该接口定义的共享内存配置指令，解析得到共享内存的配置信息；再根据所述共享内存的配置信息确定所述共享内存的地址范围，从而将所述内存介质中的部分内存区域配置为所述共享内存，且仅对所述共享内存中的数据进行缓存一致性维护。即可以通过配置来限制CXL模组中的共享内存的大小，且CXL模组仅对共享内存中的数据进行缓存一致性维护。本公开实施例通过配置将CXL模组中的共享内存的大小限制在设定范围之内，可以防止维护缓存一致性的操作过于频繁，导致总线负载过大的问题。This disclosure extends the CXL protocol by adding a shared memory configuration interface. The CXL module parses the shared memory configuration information according to the shared memory configuration instructions defined by this interface; then, based on the shared memory configuration information, it determines the address range of the shared memory, thereby configuring a portion of the memory region in the memory medium as the shared memory, and only maintaining cache consistency for the data in the shared memory. That is, it can limit... The size of the shared memory in the CXL module is controlled, and the CXL module only performs cache consistency maintenance on the data in the shared memory. This embodiment of the disclosure, by configuring the size of the shared memory in the CXL module to be limited within a set range, can prevent excessively frequent cache consistency maintenance operations from causing excessive bus load.

本公开实施例还提供了一种处理系统，包括总线和多个处理单元，多个所述处理单元通过所述总线连接，其中，所述处理单元均设置有缓存且具有缓存一致性的配置接口，所述处理单元被配置为：This disclosure also provides a processing system, including a bus and multiple processing units connected via the bus. Each processing unit is equipped with a cache and has a cache consistency configuration interface. The processing unit is configured to:

通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，根据所述一致性缓存区域的配置信息确定所述一致性缓存区域的地址范围；The configuration information of the consistent cache region is obtained through the configuration interface of the cache consistency, and the address range of the consistent cache region is determined based on the configuration information of the consistent cache region.

根据所述一致性缓存区域的地址范围，将本处理单元的缓存的部分区域配置为一致性缓存区域，以使用所述一致性缓存区域缓存多个所述处理单元的共享内存中的数据。Based on the address range of the consistency cache region, a portion of the cache of this processing unit is configured as a consistency cache region to cache data in the shared memory of multiple processing units using the consistency cache region.

本公开实施例还提供了一种降低总线负载的方法，应用于包括多个处理单元的系统，多个所述处理单元之间通过总线连接，多个所述处理单元均设置有缓存且具有缓存一致性的配置接口，所述方法包括：This disclosure also provides a method for reducing bus load, applied to a system including multiple processing units connected via a bus, wherein each processing unit is configured with a cache and has a cache consistency configuration interface, the method comprising:

所述处理单元通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，根据所述一致性缓存区域的配置信息确定所述一致性缓存区域的地址范围；The processing unit obtains the configuration information of the consistency cache region through the configuration interface of the cache consistency, and determines the address range of the consistency cache region based on the configuration information of the consistency cache region;

所述处理单元根据所述一致性缓存区域的地址范围，将本处理单元的缓存的部分区域配置为一致性缓存区域，以使用所述一致性缓存区域缓存多个所述处理单元的共享内存中的数据。The processing unit configures a portion of its cache as a consistency cache region based on the address range of the consistency cache region, so as to use the consistency cache region to cache data in the shared memory of multiple processing units.

本公开实施例还提供了一种处理器芯片，包括处理器核和缓存，所述处理器核被配置为执行如本公开任一实施例所述的由处理单元执行的降低总线负载的方法。This disclosure also provides a processor chip including a processor core and a cache, the processor core being configured to perform a method for reducing bus load as described in any embodiment of this disclosure, executed by a processing unit.

本公开实施例还提供了一种非瞬态计算机存储介质，保存有计算机程序，所述计算机程序被处理器执行时，可实现本公开任一实施例所述的的降低总线负载的方法。This disclosure also provides a non-transient computer storage medium storing a computer program, which, when executed by a processor, can implement the method for reducing bus load described in any embodiment of this disclosure.

本公开实施例还提供了一种计算机产品，所述计算机产品被处理器执行时，可实现本公开任一实施例所述的降低总线负载的方法。This disclosure also provides a computer product that, when executed by a processor, can implement the method for reducing bus load described in any embodiment of this disclosure.

本公开的其它特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本公开而了解。本公开的其他优点可通过在说明书以及附图中所描述的方案来实现和获得。Other features and advantages of this disclosure will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the disclosure. Other advantages of this disclosure may be realized and obtained by means of the methods described in the description and the accompanying drawings.

在阅读并理解了附图和详细描述后，可以明白其他方面。After reading and understanding the accompanying diagrams and detailed descriptions, the other aspects can be understood.

附图概述Overview of the attached figures

附图用来提供对本公开技术方案的理解，并且构成说明书的一部分，与本公开的实施例一起用于解释本公开的技术方案，并不构成对本公开技术方案的限制。The accompanying drawings are used to provide an understanding of the technical solutions of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure.

图1是本公开一实施例多处理器架构的示意图；Figure 1 is a schematic diagram of a multiprocessor architecture according to an embodiment of the present disclosure;

图2是本公开一实施例降低总线负载的方法的流程图；Figure 2 is a flowchart of a method for reducing bus load according to an embodiment of the present disclosure;

图3是本公开一实施例处理单元的缓存区域划分的示意图；Figure 3 is a schematic diagram of the cache region division of a processing unit according to an embodiment of the present disclosure;

图4是本公开一实施例CXL模组中的内存区域划分的示意图；Figure 4 is a schematic diagram of the memory region division in a CXL module according to an embodiment of this disclosure;

图5是本公开一实施例处理器芯片中的一致性缓存区域与多个CXL模组中的共享内存之间的地址映射的示意图；Figure 5 is a schematic diagram of the address mapping between the coherence cache region in a processor chip and the shared memory in multiple CXL modules according to an embodiment of the present disclosure;

图6是本公开一实施例根据地址查找缓存数据的示意图； Figure 6 is a schematic diagram of a method for finding cached data based on an address according to an embodiment of the present disclosure;

图7是本公开一实施例共享内存配置方法的流程图。Figure 7 is a flowchart of a shared memory configuration method according to an embodiment of the present disclosure.

详述Detailed Explanation

本公开描述了多个实施例，但是该描述是示例性的，而不是限制性的，并且对于本领域的普通技术人员来说显而易见的是，在本公开所描述的实施例包含的范围内可以有更多的实施例和实现方案。尽管在附图中示出了许多可能的特征组合，并在具体实施方式中进行了讨论，但是所公开的特征的许多其它组合方式也是可能的。除非特意加以限制的情况以外，任何实施例的任何特征或元件可以与任何其它实施例中的任何其他特征或元件结合使用，或可以替代任何其它实施例中的任何其他特征或元件。This disclosure describes several embodiments, but these descriptions are exemplary and not limiting, and it will be apparent to those skilled in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are also possible. Unless specifically limited, any feature or element of any embodiment may be used in combination with, or may replace, any feature or element of any other embodiment.

本公开包括并设想了与本领域普通技术人员已知的特征和元件的组合。本公开已经公开的实施例、特征和元件也可以与任何常规特征或元件组合，以形成受本公开保护的发明方案。任何实施例的任何特征或元件也可以与来自其它发明方案的特征或元件组合，以形成另一种受本公开保护的发明方案。因此，应当理解，在本公开中示出和/或讨论的任何特征可以单独地或以任何适当的组合来实现。因此，除了根据所附权利要求及其等同替换所做的限制以外，实施例不受其它限制。此外，可以在所附权利要求的保护范围内进行各种修改和改变。This disclosure includes and contemplates combinations of features and elements known to those skilled in the art. The embodiments, features, and elements disclosed in this disclosure may also be combined with any conventional features or elements to form an inventive scheme protected by this disclosure. Any feature or element of any embodiment may also be combined with features or elements from other inventive schemes to form another inventive scheme protected by this disclosure. Therefore, it should be understood that any feature shown and/or discussed in this disclosure may be implemented individually or in any suitable combination. Therefore, the embodiments are not limited except by the limitations imposed by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

此外，在描述具有代表性的实施例时，说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而，在该方法或过程不依赖于本文所述步骤的特定顺序的程度上，该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的，其它的步骤顺序也是可能的。因此，说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外，针对该方法和/或过程的权利要求不应限于按照所写顺序执行它们的步骤，本领域技术人员可以容易地理解，这些顺序可以变化，并且仍然保持在本公开实施例的精神和范围内。Furthermore, in describing representative embodiments, the specification may have presented methods and/or processes as a specific sequence of steps. However, the method or process should not be limited to the specific order of steps described herein, to the extent that the method or process does not depend on the specific order of steps described herein. As will be understood by those skilled in the art, other sequences of steps are also possible. Therefore, the specific order of steps set forth in the specification should not be construed as a limitation of the claims. Moreover, the claims relating to the method and/or process should not be limited to the steps performed in the order written, and those skilled in the art will readily understand that these orders can be varied and still remain within the spirit and scope of the embodiments disclosed herein.

图1所示是本公开一实施例的示例性的多处理器架构，该处理器构架包括通过总线相互连接的多个处理器，多个处理器分别位于不同的处理器芯片中，处理器芯片对外提供CXL接口且通过CXL总线相互连接。处理器芯片可以包括一个或多个处理器核(core)(图中只示了一个)。多处理器架构的处理器之间也可以通过其他类型的总线如高速串行计算机扩展总线(Peripheral Component Interconnect express，PCIe)总线连接。多个处理器可以通过互连网络连接，处理器之间的总线可以是共享的也可以是独立的，取决于系统设计，图中所示的总线连接仅为示意性的。Figure 1 illustrates an exemplary multiprocessor architecture according to an embodiment of this disclosure. This architecture includes multiple processors interconnected via a bus. The processors reside in different processor chips, which provide CXL interfaces and are interconnected via a CXL bus. Each processor chip may include one or more processor cores (only one is shown in the figure). The processors in the multiprocessor architecture can also be connected via other types of buses, such as the Peripheral Component Interconnect Express (PCIe) bus. Multiple processors can be connected via an interconnect network. The buses between the processors can be shared or independent, depending on the system design. The bus connections shown in the figure are merely illustrative.

本实施例的处理器可以是图形处理器(Graphics Processing Unit，GPU)、中央处理器(Central Processing Unit，CPU)或者其他类型的处理器，一个处理器芯片中也可以集成一个或多个处理器核，这些处理器核的类型可以相同也可以不同。处理器芯片中封装有内存介质和处理器核(core)，内存介质和处理器核基于高带宽接口连接，其容量和带宽远高于通用的双数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory，DDR)接口，该内存介质在图示的示例为集成在处理器芯片中的嵌入式动态随机存取存储器(Dynamic Random Access Memory，DRAM)，如高带宽内存(High Bandwidth Memory，HBM)、GDDR6X等类型的介质，这些内存介质可以用作处理器的缓存，以提供高性能计算能力。处理器芯片还可以包括图中未示出的其他内存接口(如DDR接口)来连接外部的内存设备，还包括时钟电路、内存管理等部分。The processor in this embodiment can be a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), or other types of processors. A single processor chip can also integrate one or more processor cores, which can be of the same or different types. The processor chip encapsulates memory media and processor cores, which are connected via a high-bandwidth interface. The capacity and bandwidth of this high-bandwidth interface are significantly higher than those of general-purpose Double Data Rate Synchronous Dynamic Random Access Memory (DDR). In the illustrated example, this memory media is embedded Dynamic Random Access Memory (DRAM) integrated into the processor chip, such as High Bandwidth Memory (HBM) or GDDR6X. This memory media can be used as a processor cache to provide high-performance computing capabilities. The processor chip may also include other memory interfaces (such as a DDR interface, not shown in the figure) for connecting external memory devices, as well as clock circuitry, memory management components, etc.

多处理器架构下，多个处理器可以访问同一内存区域，称为共享内存。多个处理器的共享内存可以设置有一个或多个，可以是处理器本地内存(Local memory)中的区域，或者是外接设备如CXL模组的内存中的区域，或者既包括处理器本地内存中的区域又包括外接设备的内存中的区域。多处理器架构下，处理器可以有自己的缓存，需要确保不同处理器之间对共享数据的访问是一致的，为此需要基于缓存一致性协议来维护共享内存中的数据在各处理器缓存中的一致性(简称缓存一致性)，以提高系统性能和并发访问效率。In a multiprocessor architecture, multiple processors can access the same memory region, known as shared memory. There can be one or more shared memory locations. These can be regions within the processor's local memory, regions within the memory of external devices such as the CXL module, or regions that include both processor-local memory and external device memory. In a multiprocessor architecture, each processor can have its own cache. It is necessary to ensure that access to shared data is consistent across different processors. To achieve this, a cache coherence protocol is needed to maintain the consistency of data in shared memory across the caches of each processor (referred to as cache coherence) to improve system performance and concurrent access efficiency.

多处理器架构与维护缓存一致性相关的操作可以包括：Operations related to maintaining cache coherency in multiprocessor architectures may include:

处理器需要时刻关注总线上关于缓存行(CacheLine)相关的操作，其他处理器对CacheLine的操作会触发总线广播事件，从而增加总线负载。The processor needs to constantly monitor operations related to cache lines on the bus. Operations on cache lines by other processors will trigger bus broadcast events, thereby increasing the bus load.

处理器接收到总线嗅探事件后，需要和本处理器对应的缓存区域内的缓存行进行比较，从而对是否修改当前缓存行的状态进行决策。After receiving a bus sniffing event, the processor needs to compare it with the cache line in the corresponding cache area of the processor to decide whether to modify the state of the current cache line.

此外，在处理器核(如x64)使用存储缓存(Store buffer)的情况下，在基于CXL接口将Store buffer的数据写回主内存时，时延会增加很多，这也会影响总线的通信开销，增加总线负载。Furthermore, when the processor core (such as x64) uses a store buffer, the latency will increase significantly when writing data from the store buffer back to main memory via the CXL interface. This will also affect the communication overhead of the bus and increase the bus load.

CXL基于PCIe 5.0发展而来，运行在PCIe物理层之上，依据访问内存的特点，CXL分为三种子协议：CXL.io、CXL.mem和CXL.cache。相应地，CXL设备中分为第一类型的CXL设备(Type1 CXL device，简称Type1设备)、第二类型的CXL设备(Type2 CXL device，简称Type2设备)和第三类型的CXL设备(Type3 CXL device，简称Type3设备)。其中Type2设备和Type3设备均设置有设备内存，这些设备内存可以是DDR、HBM等类型。以Type2设备为例，Type2设备可以是实际应用场景中的加速卡(如GPU)，由Host提供数据，加速卡负责计算。主机直接管理(Host Direct Management，HDM)模型允许主机直接管理和控制Type2设备上的存储空间，根据应用需求动态分配和配置设备内存并直接进行读写操作。Type3设备可以为Host提供大容量DRAM，也可以支持持久化介质(即非易失性介质，如包括闪存芯片)。CXL is based on PCIe 5.0 and runs on top of the PCIe physical layer. Based on the characteristics of memory access, CXL is divided into three sub-protocols: CXL.io, CXL.mem, and CXL.cache. Correspondingly, CXL devices are divided into three types: Type 1 CXL device, Type 2 CXL device, and Type 3 CXL device. Both Type 2 and Type 3 devices have dedicated device memory, which can be DDR, HBM, or other types. Taking Type 2 devices as an example, a Type 2 device can be an accelerator card (such as a GPU) in a real-world application scenario, with the host providing data and the accelerator card handling computation. The Host Direct Management (HDM) model allows the host to directly manage and control the storage space on the Type 2 device, dynamically allocating and configuring device memory according to application needs and directly performing read and write operations. Type 3 devices can provide the host with large-capacity DRAM and can also support persistent media (i.e., non-volatile media, such as flash memory chips).

可选地，如图1所示，多处理器架构中的多个处理器还可以通过CXL总线与CXL模组相连。本公开的CXL模组指包括内存介质的CXL设备，如Type2设备、Type3设备等。如图所示，CXL模组包括控制芯片及与控制芯片连接的一个或多个内存介质，其中内存介质在图示的示例中是以DRAM颗粒为例。该CXL模组中的内存可以作为与其相连的处理器的扩展内存，该扩展内存可以用于存储多处理器或多处理核共享的数据，如作为共享内存存储全局变量等类型的数据。Optionally, as shown in Figure 1, multiple processors in a multiprocessor architecture can also be connected to the CXL module via a CXL bus. The CXL module disclosed herein refers to a CXL device including memory media, such as a Type 2 device, a Type 3 device, etc. As shown in the figure, the CXL module includes a control chip and one or more memory media connected to the control chip, wherein the memory media in the illustrated example is a DRAM chip. The memory in this CXL module can serve as extended memory for the processors connected to it. This extended memory can be used to store data shared by multiple processors or multiple processing cores, such as storing global variables and other types of data as shared memory.

在本实施例的多处理器架构下，处理器芯片内部使用了大容量和高带宽的内存介质，因此可使用的缓存空间增大；处理器芯片之间基于CXL总线连接，内存使用范围可基于CXL模组进行扩展，因此可用的内存范围也会增大，但相应地会引入下述问题：In the multiprocessor architecture of this embodiment, the processor chip uses a large-capacity and high-bandwidth memory medium, thus increasing the available cache space; the processor chips are connected based on the CXL bus, and the memory usage range can be expanded based on the CXL module, thus increasing the available memory range as well. However, this also introduces the following problems:

1)每个处理器缓存的数据增多，导致总线上维护缓存一致性的负载开销增大；在极端情况下，如果限制仅维护一条Cacheline用于维护缓存一致性，则相关信令交互则会降到最少。1) The increase in data cached by each processor leads to an increase in the overhead of maintaining cache coherency on the bus; in extreme cases, if only one cacheline is maintained for maintaining cache coherency, the related signaling interactions will be minimized.

2)CXL的访问时延与本地DDR的访问时延相比较长，当前CPU在总线发出广播后，广播事件有可能触发其他处理器先访问CXL存储(如刷新当前处理器的缓存数据)，然后再给当前CPU做广播事件的反馈，会增大处理时延，导致总线负载增加；其中一种方式是主处理器(Host CPU)的本地写操作(Local Write)，如果此时CXL设备中同位置的缓存行(Cacheline)为修改(Modified)状态，需要将数据写回主内存并修改状态；同时将当前的主处理器缓存行的状态修改为Modified(CXL设备将数据写回主内存的时延较长)。数据的反馈时长较长导致控制信息交互慢，访问时延过长也会导致总线负载拥塞。每个Cacheline都会贡献总线交互的负载，当需要维护缓存一致性的缓存较大即Cacheline数目较多时，更是加重了这些问题。2) CXL access latency is longer compared to local DDR access latency. After the current CPU broadcasts an event on the bus, the broadcast event may trigger other processors to access CXL memory first (e.g., refresh the current processor's cache data) before feeding back the broadcast event to the current CPU, increasing processing latency and leading to increased bus load. One possible cause is a local write operation by the host CPU. If the cache line at the same location in the CXL device is in a modified state, the data needs to be written back to main memory and its state modified. Simultaneously, the state of the current host CPU's cache line is modified (the latency of writing data back to main memory by the CXL device is relatively long). The long data feedback time results in slow control information exchange, and excessive access latency also leads to increased bus load. Each cacheline contributes to the bus interaction load, and these problems are exacerbated when the number of caches that need to maintain cache consistency is large, i.e., when the number of cachelines is large.

3)CXL时延高是由于CXL路径决定的，而当缓存一致性信息到达CXL设备端，依据特定地址寻找特定Cacheline的算法也存在时延。缓存较大时，处理器芯片内部查找定位Cacheline的负载增加，Cacheline定位效率低，间接导致反馈时延增长，即时延会随Cacheline的数目增大而增大。Cacheline的定位在主机端(如处理器端)或者CXL设备端都会存在。3) The high latency of CXL is determined by the CXL path. When cache coherency information arrives at the CXL device, the algorithm for finding a specific cacheline based on a specific address also introduces latency. With a large cache, the load on the processor chip to locate the cacheline increases, leading to low cacheline location efficiency and indirectly increasing feedback latency. In other words, latency increases with the number of cachelines. Cacheline location can occur at either the host (e.g., processor) or the CXL device.

因此总线负载与用于维护一致性的Cacheline的数量，及一致性信令交互时延相关，后者又与设备端和主机端定位Cacheline的效率有关。Therefore, bus load is related to the number of cachelines used to maintain consistency and the latency of consistency signaling interactions, the latter of which is related to the efficiency of the device and host in locating cachelines.

以上虽然以多处理器架构的系统为例，但是其他一些包括多个处理单元的系统也存在以上问题。例如，在一些具有多核的芯片组中多个处理器核之间也存在共享内存和维护缓存一致性的问题，如处理器核设置的缓存如果过大，也会导致多个处理器核之间的总线负载过大。又如，包括多个处理器且需要进行缓存一致性维护的分布式系统或大规模并行计算的其他系统也存在类似的问题。While the above examples use multi-processor architectures, other systems with multiple processing units also exhibit these issues. For instance, in some multi-core chipsets, there are problems with shared memory and maintaining cache coherency among the multiple processor cores. If the cache size of a processor core is too large, it can lead to excessive bus load across multiple processor cores. Similarly, distributed systems or other systems involving large-scale parallel computing that include multiple processors and require cache coherency maintenance also face similar problems.

在实际应用场景中，多核的进程共享内存(Memory)的范围较小；例如在基于AI或者大语言模型的训练过程中，层间交互数据较少，从CPU或者GPU并行方面考虑，意味着大规模核间交互数据的场景较少，也就是说，在一些场景下，没有必要把所有的缓存和外部内存做缓存一致性维护。In practical applications, the scope of shared memory for multi-core processes is relatively small. For example, in the training process based on AI or large language models, there is less data interaction between layers. From the perspective of CPU or GPU parallelism, this means that there are fewer scenarios with large-scale data interaction between cores. In other words, in some scenarios, it is not necessary to maintain cache consistency for all caches and external memory.

针对具有较大缓存空间的处理单元(如处理器或处理器核)且需要进行缓存一致性维护的系统，在本公开的一些实施例中，通过增加缓存一致性的配置接口(如定义了相应的配置指令)，以配置处理单元的用于缓存一致性的缓存区域，减少用于缓存一致性的Cacheline的数目，进而减少系统维护一致性的总线开销。在本公开的一些实施例中，通过增加缓存一致性配置接口，以配置多个处理单元共享内存的区域，减少共享内存的容量，进而减少系统维护一致性的总线开销。在本公开的一些实施例中，通过改进缓存中Cacheline的搜索算法，提高Cacheline定位效率，使得反馈时延减少，进而减少系统维护一致性的总线开销。For systems with large cache spaces (such as processors or processor cores) that require cache coherency maintenance, some embodiments of this disclosure add a cache coherency configuration interface (e.g., defining corresponding configuration instructions) to configure the cache region for cache coherency in the processing unit, reducing the number of cachelines for cache coherency and thus reducing the bus overhead for maintaining system coherency. In some embodiments of this disclosure, a cache coherency configuration interface is added to configure a region of shared memory for multiple processing units, reducing the capacity of shared memory and thus reducing the bus overhead for maintaining system coherency. In some embodiments of this disclosure, the cacheline search algorithm in the cache is improved to increase cacheline location efficiency, reducing feedback latency and thus reducing the bus overhead for maintaining system coherency.

需要说明的是，虽然本公开实施例应用于具有较大缓存空间的处理单元时效果更为明显，但本公开实施例的应用并不以缓存空间大小界定，也可以应用于包括较小缓存空间的处理单元构成的需要维护缓存一致性的系统，也可以减少总线的负载。It should be noted that although the embodiments of this disclosure are more effective when applied to processing units with larger cache spaces, the application of the embodiments of this disclosure is not limited to the size of the cache space. They can also be applied to systems that require maintaining cache consistency, including processing units with smaller cache spaces, and can also reduce the load on the bus.

本公开一实施例提供了一种降低总线负载的方法，应用于包括多个处理单元的系统，多个所述处理单元之间通过总线连接，多个所述处理单元均设置有缓存且具有缓存一致性的配置接口，如图2所示，所述方法包括：One embodiment of this disclosure provides a method for reducing bus load, applied to a system including multiple processing units connected via a bus, wherein each processing unit is equipped with a cache and has a configuration interface for cache consistency, as shown in FIG2. The method includes:

步骤110，所述处理单元通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，根据所述一致性缓存区域的配置信息确定所述一致性缓存区域的地址范围；Step 110: The processing unit obtains the configuration information of the consistency cache region through the configuration interface of the cache consistency, and determines the address range of the consistency cache region based on the configuration information of the consistency cache region;

步骤120，所述处理单元根据所述一致性缓存区域的地址范围，将本处理单元的缓存的部分区域配置为一致性缓存区域，以使用所述一致性缓存区域缓存多个所述处理单元的共享内存中的数据。Step 120: The processing unit configures a portion of its cache as a consistency cache region based on the address range of the consistency cache region, so as to use the consistency cache region to cache data in the shared memory of multiple processing units.

通过本实施例方法，可以将处理单元的缓存划分为两部分，如图3所示，一部分为一致性缓存区域(Coherence Area)，另一部分为其他缓存区域，图中标记为非一致性缓存区域(Non Coherence Area)。处理单元使用一致性缓存区域缓存多个处理单元的共享内存中的数据，即表示一个处理单元对多个处理单元的共享内存中的数据进行缓存时，均缓存至为该处理单元配置的一致性缓存区域，不会缓存至非一致性缓存区域。本实施例可以为系统中的一个或多个处理单元设置缓存一致性的配置接口。为多个处理单元设置缓存一致性的配置接口时，多个处理单元的缓存一致性的配置接口可以相互独立，也可以使用一个统一的接口，不同处理单元的配置接口的类型可以相同或不同。The method in this embodiment divides the processing unit's cache into two parts, as shown in Figure 3: a coherence area and other cache areas, marked as non-coherence areas in the figure. The processing unit uses the coherence area to cache data in the shared memory of multiple processing units. This means that when a processing unit caches data in the shared memory of multiple processing units, it caches it in the coherence area configured for that processing unit and not in the non-coherence area. This embodiment can... Configure a cache consistency configuration interface for one or more processing units in the system. When configuring cache consistency configuration interfaces for multiple processing units, the cache consistency configuration interfaces for multiple processing units can be independent of each other, or a unified interface can be used. The configuration interface types for different processing units can be the same or different.

图3中的非一致性缓存区域也可以再细分为多个不同性质的区域以满足实际需要。在一示例中，可以将处理器芯片中的嵌入式DRAM(如HBM)作为处理单元的缓存，该缓存的容量远大于传统缓存，可以达到GB的级别。The non-uniform cache region in Figure 3 can be further subdivided into multiple regions of different natures to meet actual needs. In one example, embedded DRAM (such as HBM) in the processor chip can be used as a cache for the processing unit, and the capacity of this cache is much larger than that of traditional caches, reaching the GB level.

相关技术中并不存在对缓存进行上述区域划分的接口，本实施例通过新增缓存一致性的配置接口，可以在处理单元的缓存中划分出一部分区域作为该处理单元的一致性缓存区域，从而可以通过配置来限制该处理单元的一致性缓存区域的大小，而不是使用该处理单元的所有缓存空间来缓存多个处理单元共享内存中的数据。因为多个处理单元的共享内存的数据仅缓存至各个处理单元的一致性缓存区域，通过将一致性缓存区域的大小限制在设定范围之内，可以防止一致性缓存区域中的缓存行数量太多时，维护缓存一致性的操作导致总线负载过大的问题。The relevant technologies do not have an interface for partitioning the cache into the aforementioned regions. This embodiment adds a configuration interface for cache consistency, which allows a portion of the processing unit's cache to be designated as a consistency cache region for that processing unit. This allows for configuration to limit the size of the consistency cache region for that processing unit, instead of using all of the processing unit's cache space to cache data in the shared memory of multiple processing units. Because the data in the shared memory of multiple processing units is only cached in the consistency cache region of each individual processing unit, limiting the size of the consistency cache region within a set range prevents excessive bus load caused by maintaining cache consistency when the number of cache lines in the consistency cache region is too large.

本实施例的处理单元为处理器或处理器核。系统中多个处理单元的类型可以相同也可以不同。在一示例中，多个处理单元均为处理器，包括多个处理单元的系统可以为具有多处理器架构的系统；在另一示例中，多个处理单元均为处理器核，包括多个处理单元的系统可以包括多核架构的处理器；在又一示例中，多个处理单元中的部分为处理器，另一部分为处理器核，则系统即包括多核架构的处理器，又包括单核的处理器。其中的处理器核可以是CPU核但不局限于此。另外需要指出的是，本实施例方法所应用的系统中只要存在多个通过总线连接且设置有缓存的处理单元即可，并不要求系统中的所有处理单元之间均通过总线连接，也不要求所有处理单元均设置有自己的缓存。In this embodiment, the processing unit is a processor or a processor core. The types of multiple processing units in the system can be the same or different. In one example, all multiple processing units are processors, and the system including multiple processing units can be a system with a multi-processor architecture; in another example, all multiple processing units are processor cores, and the system including multiple processing units can include a multi-core processor architecture; in yet another example, some of the multiple processing units are processors, and others are processor cores, then the system includes both multi-core processors and single-core processors. The processor core can be a CPU core, but is not limited to this. It should also be noted that the system to which this method is applied only needs to have multiple processing units connected via a bus and equipped with caches; it is not required that all processing units in the system be connected via a bus, nor that all processing units have their own caches.

本公开一示例性的实施例中，至少一个所述处理单元的缓存的容量大于或等于128M，或者，大于或等于256M。所述缓存空间可以由集成在处理器芯片中的嵌入式高带宽存储器提供。但本公开并不局限于此，多个所述处理单元的缓存的容量均小于128M也是可以的。In one exemplary embodiment of this disclosure, the cache capacity of at least one of the processing units is greater than or equal to 128M, or greater than or equal to 256M. The cache space can be provided by an embedded high-bandwidth memory integrated into the processor chip. However, this disclosure is not limited to this; it is also possible for the cache capacity of multiple processing units to be less than 128M.

本公开一示例性的实施例中，所述一致性缓存区域的配置信息包括所述一致性缓存区域的大小，还可以包括所述一致性缓存区域的位置信息如起始位置等。In an exemplary embodiment of this disclosure, the configuration information of the consistency cache region includes the size of the consistency cache region, and may also include the location information of the consistency cache region, such as the starting position.

本公开一示例性的实施例中，所述处理单元通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，包括：所述处理单元接收所述系统的中央控制硬件单元下发的缓存区域配置指令，对所述缓存区域配置指令进行解析，得到所述一致性缓存区域的配置信息；或者，所述处理单元根据系统软件提供的编程接口获取所述一致性缓存区域的配置信息；其中，本实施例为多个所述处理单元中不同处理单元配置的一致性缓存区域的大小可以相同或不同。In an exemplary embodiment of this disclosure, the processing unit obtains configuration information of the consistency cache region through the cache consistency configuration interface, including: the processing unit receiving a cache region configuration instruction issued by the central control hardware unit of the system, parsing the cache region configuration instruction, and obtaining the configuration information of the consistency cache region; or, the processing unit obtains the configuration information of the consistency cache region according to the programming interface provided by the system software; wherein, in this embodiment, the size of the consistency cache region configured by different processing units among the multiple processing units may be the same or different.

在本实施例的一示例中，处理单元接收中央控制硬件单元下发的缓存区域配置指令，对所述缓存区域配置指令进行解析，得到所述一致性缓存区域的配置信息。在系统中包括多个处理单元的服务器时，该中央控制硬件单元例如可以是该服务器的基本输入输出系统(Basic Input/Output System，BIOS)，由所述BIOS启动一致性缓存区域的设置。在其中的一种情况下，所述一致性缓存区域的配置信息可以包括一致性缓存区域的大小和位置，处理单元根据一致性缓存区域的配置信息即可确定一致性缓存区域的地址范围。在另一种情况下，一致性缓存区域的配置信息包括一致性缓存区域的大小，处理单元结合本地存储的一致性缓存区域的默认位置或者通过其他方式获得的一致性缓存区域的位置，来确定一致性缓存区域的地址范围。本文中，多个处理单元的服务器是指系统中可以对多个处理单元进行配置管理的实体，如可以包括但不限于服务器的BIOS硬件单元或者服务器固件等。In one example of this embodiment, the processing unit receives a cache region configuration instruction from the central control hardware unit, parses the instruction, and obtains the configuration information of the consistency cache region. When the system includes a server with multiple processing units, the central control hardware unit may be, for example, the server's Basic Input/Output System (BIOS), which initiates the consistency cache region configuration. In one case, the consistency cache region configuration information may include the size and location of the consistency cache region, and the processing unit can determine the address range of the consistency cache region based on this configuration information. In another case, the consistency cache region configuration information includes the size of the consistency cache region, and the processing unit, in conjunction with the default location of the locally stored consistency cache region or the location obtained through other means, determines the address range of the consistency cache region. In this document, a server with multiple processing units refers to a system capable of processing multiple... An entity that performs configuration management for a processing unit, which may include, but is not limited to, the server's BIOS hardware unit or server firmware.

在本实施例的另一示例中，处理单元可以基于系统软件提供的编程接口获取所述一致性缓存区域的配置信息。操作系统负载维护管理计算机设备的所有硬件资源，处理单元可以基于操作系统提供的应用程序编程接口(Application Programming Interface，API)来获取一致性缓存区域的配置信息。例如，CPU可以提供图形界面，根据用户的输入确定一致性缓存区域的大小。例如，针对Cache提供两个可以设定大小的字段：Coherence和Non Coherence，将用户设定的Coherence字段的大小作为用户配置的一致性缓存区域的大小，将用户设定的Non Coherence字段的大小作为用户配置的非一致性缓存区域的大小。该图形界面还可以允许用户配置一致性缓存区域的位置，此时处理单元通过图形界面获取的一致性缓存区域的配置信息包括一致性缓存区域的大小和位置。在图形界面仅允许用户配置一致性缓存区域的大小的情况下，该一致性缓存区域的位置可以使用预先存储的默认位置或者其他方式获取，处理单元根据配置信息中一致性缓存区域的大小，再结合一致性缓存区域的位置可以确定一致性缓存区域在缓存中的地址范围。In another example of this embodiment, the processing unit can obtain the configuration information of the consistency cache region based on the programming interface provided by the system software. The operating system manages all hardware resources of the computer device, and the processing unit can obtain the configuration information of the consistency cache region based on the application programming interface (API) provided by the operating system. For example, the CPU can provide a graphical interface to determine the size of the consistency cache region based on user input. For example, for the cache, two fields with configurable sizes are provided: Coherence and Non-Coherence. The size of the Coherence field set by the user is used as the size of the user-configured consistency cache region, and the size of the Non-Coherence field set by the user is used as the size of the user-configured non-consistent cache region. The graphical interface can also allow the user to configure the location of the consistency cache region. In this case, the configuration information of the consistency cache region obtained by the processing unit through the graphical interface includes the size and location of the consistency cache region. If the graphical interface only allows the user to configure the size of the consistency cache region, the location of the consistency cache region can be obtained using a pre-stored default location or other methods. The processing unit can determine the address range of the consistency cache region in the cache based on the size of the consistency cache region in the configuration information and its location.

在本实施例的一示例中，系统中存在不同类型的处理单元，如GPU、CPU等，不同处理单元的缓存的大小不一定相同，为不同处理单元配置的一致性缓存区域的大小也可以相同或不同。In one example of this embodiment, there are different types of processing units in the system, such as GPUs and CPUs. The cache sizes of different processing units are not necessarily the same, and the sizes of the consistency cache regions configured for different processing units can also be the same or different.

本公开一示例性的实施例中，所述系统还包括通过总线与多个所述处理单元连接的外接设备，所述外接设备设置有内存介质；所述方法还包括：In an exemplary embodiment of this disclosure, the system further includes an external device connected to the plurality of processing units via a bus, the external device being provided with a memory medium; the method further includes:

所述外接设备接收共享内存配置指令并解析，得到所述共享内存的配置信息，并根据所述共享内存的配置信息确定所述共享内存的地址信息；The external device receives and parses the shared memory configuration command to obtain the configuration information of the shared memory, and determines the address information of the shared memory based on the configuration information of the shared memory;

所述外接设备根据所述共享内存的地址范围，将本设备中的部分内存区域配置为多个所述处理单元的共享内存。本文中，该共享内存也可以称为共享内存区域，即多个所述处理单元可以访问的内存区域。The external device configures a portion of its memory area as shared memory for multiple processing units based on the address range of the shared memory. In this document, this shared memory can also be referred to as a shared memory region, that is, a memory region accessible to the multiple processing units.

上述外接设备是指设备位于处理器芯片之外，而非指系统外的设备。虽然本实施例是将外接设备中内存介质的部分内存区域作为共享内存，但本公开不局限于此。在其他实施例中，也可以将嵌入处理器芯片中的内存介质的部分内存区域配置为共享内存。在此情况下，可以由处理单元根据共享内存配置指令完成共享内存的配置，即所述处理单元接收共享内存配置指令并解析，得到所述共享内存的配置信息，并根据所述共享内存的配置信息确定所述共享内存的地址信息；及，所述处理单元根据所述共享内存的地址范围，将本设备中的部分内存区域配置为多个所述处理单元的共享内存。The aforementioned external device refers to a device located outside the processor chip, not a device outside the system. Although this embodiment uses a portion of the memory medium in the external device as shared memory, this disclosure is not limited to this. In other embodiments, a portion of the memory medium embedded in the processor chip can also be configured as shared memory. In this case, the processing unit can complete the configuration of the shared memory according to the shared memory configuration instruction; that is, the processing unit receives and parses the shared memory configuration instruction to obtain the configuration information of the shared memory, and determines the address information of the shared memory according to the configuration information of the shared memory; and the processing unit configures a portion of the memory region in this device as shared memory for multiple processing units according to the address range of the shared memory.

在一示例中，共享内存配置指令携带的共享内存的配置信息包括共享内存的大小和位置(如起始位置)，外接设备根据对共享内存配置指令解析得到的配置信息即可确定共享内存的地址信息。在另一示例中，共享内存指令携带的共享内存的配置信息包括共享内存的大小，外接设备需要对共享内存配置指令进行解析得到共享内存的大小，再结合为共享内存预设的默认位置或通过其他方式得到的共享内存的位置，以得到共享内存的配置信息，进而确定共享内存的地址信息。In one example, the shared memory configuration instruction carries configuration information including the size and location (e.g., starting location) of the shared memory. The external device can determine the address information of the shared memory based on the configuration information obtained by parsing the shared memory configuration instruction. In another example, the shared memory instruction carries configuration information including the size of the shared memory. The external device needs to parse the shared memory configuration instruction to obtain the size of the shared memory, and then combine it with the default location preset for the shared memory or the location of the shared memory obtained through other means to obtain the configuration information of the shared memory, and then determine the address information of the shared memory.

本公开一示例性的实施例中，所述外接设备包括通过CXL接口与多个所述处理单元通信的CXL模组，所述CXL模组包括控制器(如控制芯片或控制芯片组)和与所述控制器连接的一组内存芯片；所述共享内存配置指令由所述系统的互联网络管理器通过CXL接口，将所述共享内存配置指令下发给所述CXL模组。互联网络管理器(Fabric Manager，FM)在本文中指可用于配置、管理CXL设备的实体，本实施例为CXL接口新增了共享内存配置功能，可以通过共享内存配置指令对CXL模组进行共享内存的配置；互联网络管理器可以集成在系统中的处理单元中，也可以独立于系统中的任何处理单元，与系统中的多个处理单元设置在不同的位置，如可以使用系统外的处理单元实现。In an exemplary embodiment of this disclosure, the external device includes a CXL module that communicates with multiple processing units via a CXL interface. The CXL module includes a controller (such as a control chip or control chipset) and a set of memory chips connected to the controller. The shared memory configuration command is issued by the system's Fabric Manager to the CXL module via the CXL interface. The Fabric Manager (FM) in this document refers to an entity that can be used to configure and manage CXL devices. This embodiment adds a shared memory configuration command to the CXL interface. The memory configuration function allows for the configuration of shared memory for CXL modules using shared memory configuration commands; the Internet Manager can be integrated into the system's processing unit or can be independent of any processing unit in the system, and can be located in a different position from multiple processing units in the system, such as using an external processing unit.

如图1和图4所示的示例中，CXL模组中包括3个DRAM颗粒，在进行共享内存的配置时，CXL模组中的控制芯片可以将其中的一个DRAM颗粒配置为多个处理单元的共享内存，其他内存区域即另外两个DRAM颗粒可以称为非共享内存或非共享内存区域。容易理解，图示的划分方式仅仅是示意性的，共享内存也可以是一个DRAM颗粒中的部分区域，也可以由多个DRAM颗粒中的区域组合而成。在由多个DRAM颗粒构成的模组中，共享内存和非共享内存对外以地址范围(逻辑概念)展现。In the examples shown in Figures 1 and 4, the CXL module includes three DRAM chips. When configuring shared memory, the control chip in the CXL module can configure one DRAM chip as shared memory for multiple processing units. The other memory areas, i.e., the other two DRAM chips, can be referred to as non-shared memory or non-shared memory regions. It's easy to understand that the diagrammatic division is merely illustrative; shared memory can be a portion of a single DRAM chip or a combination of regions from multiple DRAM chips. In a module composed of multiple DRAM chips, shared memory and non-shared memory are presented externally as address ranges (logical concepts).

虽然本实施例是以CXL模组作为外接设备的示例，但在其他实施例中，所述外接设备也可以是通过内存接口连接到处理器芯片的内存芯片如DDR芯片等。Although this embodiment uses the CXL module as an example of an external device, in other embodiments, the external device may also be a memory chip, such as a DDR chip, that is connected to the processor chip via a memory interface.

本公开一示例性的实施例中，所述互联网络管理器还将所述共享内存的配置信息分别发送给多个所述处理单元；本公开另一示例性的实施例中，所述互联网络管理器还将所述共享内存的配置信息发送给多个所述处理单元的服务器，由所述服务器将所述共享内存的配置信息分别发送给多个所述处理单元，以完成整个配置过程。在这种情况下，所述服务器可以结合所述共享内存的配置信息确定多个所述处理单元中的一致性缓存区域的配置信息，将多个所述一致性缓存区域的配置信息分别携带在多个缓存区域配置指令中，通过系统的中央控制硬件单元分别下发给多个所述处理单元。其中，所述共享内存的配置信息包括共享内存的大小，所述一致性缓存区域的配置信息包括一致性缓存区域的大小。服务器结合共享内存的配置信息来确定多个所述处理单元中的一致性缓存区域的大小，例如可以使配置的一致性缓存区域的大小与共享内存的大小正相关，从而协调共享内存和一致性缓存区域的大小，使得两者相适配而取得更好的效率(如果共享内存和一致性缓存区域的大小相差悬殊，如共享内存大而一致性缓存区域过小，会使得在缓存中搜索共享内存的数据时速度太慢，而如果共享内存小而一致性缓存区域过大，会造成资源的浪费)。完成配置后，所述服务器可以通过中央控制硬件单元，向多个所述处理单元分别下发缓存区域配置指令。所述中央控制硬件单元可以是所述服务器的一部分，或者所述中央控制硬件单元独立于所述服务器且可以与所述服务器通信。In one exemplary embodiment of this disclosure, the Internet Manager further sends the shared memory configuration information to multiple processing units respectively. In another exemplary embodiment of this disclosure, the Internet Manager further sends the shared memory configuration information to a server of the multiple processing units, and the server then sends the shared memory configuration information to the multiple processing units respectively to complete the entire configuration process. In this case, the server can determine the configuration information of the consistency cache region in the multiple processing units based on the shared memory configuration information, and carry the configuration information of the multiple consistency cache regions in multiple cache region configuration instructions, which are then issued to the multiple processing units respectively through the system's central control hardware unit. The shared memory configuration information includes the size of the shared memory, and the consistency cache region configuration information includes the size of the consistency cache region. The server determines the size of the consistency cache region in each of the multiple processing units by combining the shared memory configuration information. For example, the configured consistency cache region size can be positively correlated with the shared memory size, thereby coordinating the sizes of shared memory and the consistency cache region to achieve better efficiency. (If the sizes of shared memory and the consistency cache region differ significantly, such as a large shared memory and a small consistency cache region, searching for data in shared memory will be too slow; conversely, if the shared memory is small and the consistency cache region is too large, it will waste resources.) After configuration, the server can issue cache region configuration instructions to each of the multiple processing units through a central control hardware unit. The central control hardware unit can be part of the server, or it can be independent of the server but capable of communicating with it.

本公开一示例性的实施例中，所述方法还包括：所述处理单元根据所述共享内存的地址范围和所述一致性缓存区域的地址范围，建立所述共享内存和所述一致性缓存区域之间的第一地址映射关系。本实施例中，一个一致性缓存区域可以与一个或多个共享内存建立地址映射关系，而共享内存和一致性缓存区域之间的第一地址映射关系可以使用直接相连映射、组组相连映射或全相连等方法。处理单元根据一致性缓存区域的地址范围配置好一致性缓存区域之后，结合之前获取的共享内存的地址范围，即可采用三种映射方法中的任一种建立起一致性缓存区域和共享内存之间的第一地址映射关系。In an exemplary embodiment of this disclosure, the method further includes: the processing unit establishing a first address mapping relationship between the shared memory and the consistency cache region based on the address range of the shared memory and the address range of the consistency cache region. In this embodiment, a consistency cache region can establish an address mapping relationship with one or more shared memories, and the first address mapping relationship between the shared memory and the consistency cache region can use methods such as direct associative mapping, group-associative mapping, or full associative mapping. After configuring the consistency cache region according to its address range, the processing unit, combined with the previously obtained address range of the shared memory, can establish the first address mapping relationship between the consistency cache region and the shared memory using any of the three mapping methods.

如图5所示，示出了多处理器架构下的一个处理器芯片中缓存的一致性缓存区域与多个CXL模组中内存介质的共享内存之间的地址映射关系。Figure 5 illustrates the address mapping relationship between the coherent cache region in a processor chip and the shared memory of the memory media in multiple CXL modules under a multiprocessor architecture.

上述第一地址映射关系反映了共享内存中的数据在共享内存中的存储位置和在一致性缓存区域中的缓存位置之间的关系，处理单元据此可以生成访问所述数据的物理地址或虚拟地址，以在一致性缓存区域中查找共享内存中的数据。处理器中的内存管理单元(Memory Management Unit，MMU)可以参与地址管理如生成访问共享内存中数据的地址，生成的地址可以作为Cache的输入地址。MMU与缓存(Cache)的相对位置不同时，该输入地址可以是虚拟地址(Cache位于CPU和MMU之间)或者物理地址(MMU位于Cache和CPU之间)。 The aforementioned first address mapping reflects the relationship between the storage location of data in shared memory and its cache location in the coherent cache region. Based on this, the processing unit can generate a physical or virtual address to access the data in the shared memory, thereby locating the data in the coherent cache region. The processor's Memory Management Unit (MMU) can participate in address management, such as generating addresses for accessing data in shared memory. These generated addresses can serve as input addresses for the cache. When the relative positions of the MMU and the cache differ, the input address can be a virtual address (with the cache located between the CPU and the MMU) or a physical address (with the MMU located between the cache and the CPU).

本实施例的一个示例中，所述处理单元还可以根据所述共享内存的地址范围和所述一致性缓存区域的地址范围，建立本处理单元除所述一致性缓存区域外的其他缓存区域与所述外接设备中除所述共享内存外的其他内存区域之间的第二地址映射关系。参见图3和图4，该第二地址映射关系是建立起缓存中的非一致性缓存区域和CXL模组中的非共享内存之间的地址映射关系，该地址映射关系可以采用直接相连映射、组组相连映射或全相连的方式建立。本实施例的另一示例中，处理单元只建立本处理单元的部分非一致性缓存区域与外接设备的部分非共享内存之间的地址映射关系，也即外接设备中的部分内存区域不需要与缓存之间建立映射。In one example of this embodiment, the processing unit can also establish a second address mapping relationship between other cache regions of the processing unit (excluding the consistent cache region) and other memory regions of the external device (excluding the shared memory) based on the address range of the shared memory and the address range of the consistent cache region. Referring to Figures 3 and 4, this second address mapping relationship establishes an address mapping relationship between the non-consistent cache regions in the cache and the non-shared memory in the CXL module. This address mapping relationship can be established using direct associative mapping, group-associative mapping, or full associative mapping. In another example of this embodiment, the processing unit only establishes an address mapping relationship between a portion of the non-consistent cache regions of the processing unit and a portion of the non-shared memory of the external device; that is, a portion of the memory regions in the external device do not need to be mapped to the cache.

本公开一示例性的实施例中，所述一致性缓存区域划分为多个缓存组，每一缓存组包括多个缓存行；所述处理单元访问所述共享内存中的数据时使用的地址包括以下字段：Tagmeta、Tagdata和Offset；或者包括以下字段：Tagmeta、Indexmeta、Tagdata和Offset；In an exemplary embodiment of this disclosure, the consistency cache region is divided into multiple cache groups, each cache group including multiple cache lines; the address used by the processing unit when accessing data in the shared memory includes the following fields: Tagmeta, Tagdata, and Offset; or includes the following fields: Tagmeta, Indexmeta, Tagdata, and Offset;

其中：in:

Tagmeta为用于标识所述一致性缓存区域的字段；Tagmeta is a field used to identify the consistency cache region;

Indexmeta为用于标识所述一致性缓存区域中的缓存组的字段；Indexmeta is a field used to identify the cache group in the consistent cache region;

Tagdata为用于标识所述一致性缓存区域中缓存组的缓存行的字段；Tagdata is a field used to identify cache rows in the cache group of the consistent cache region;

Offset为用于标识数据偏移的字段。Offset is a field used to identify data offset.

在通过配置接口将缓存划分为一致性缓存区域和非一致性缓存区域的基础上，本实施例还对访问数据的地址进行改进，增加了用于区分一致性缓存区域和非一致性缓存区域的字段Tagmeta，例如可以使用1bit数据的值来标识，在Tagmeta为0时表示要访问的数据可能缓存在非一致性缓存区域，Tagmeta为1时表示要访问的数据可能缓存在一致性缓存区域。从而不需要在整个缓存范围中进行搜索定位，可以提高定位的效率。此外，还可以通过设置字段Indexmeta将搜索范围定位到一致性缓存区域中的缓存组，从而进一步提高定位的效率。缓存行定位效率的提高，可以减少访问共享内存中数据的时延，访问时延的减少有利于避免总线的拥塞。Building upon the existing configuration interface that divides the cache into consistent and inconsistent cache regions, this embodiment further improves the data access address by adding a Tagmeta field to distinguish between consistent and inconsistent cache regions. For example, a 1-bit value can be used: Tagmeta = 0 indicates that the data to be accessed may be cached in the inconsistent cache region, while Tagmeta = 1 indicates that the data to be accessed may be cached in the consistent cache region. This eliminates the need for searching across the entire cache range, improving location efficiency. Furthermore, the Indexmeta field can be set to target the search range to cache groups within the consistent cache region, further enhancing location efficiency. Improved cache line location efficiency reduces latency when accessing data in shared memory, which helps avoid bus congestion.

虽然本实施例地址包括的字段并不局限于包括Tagmeta、Tagdata和Offset，或者包括Tagmeta、Indexmeta、Tagdata和Offset，还可以包括其他的字段。如在地址包括字段Tagmeta、Indexmeta、Tagdata和Offset的情况下，还包括一个用于在Indexmeta标识的缓存组中定位缓存子组的字段Indexdata，Although the fields included in the address in this embodiment are not limited to Tagmeta, Tagdata, and Offset, or even include Tagmeta, Indexmeta, Tagdata, and Offset, other fields may also be included. For example, if the address includes the fields Tagmeta, Indexmeta, Tagdata, and Offset, it may also include a field Indexdata for locating a cache subgroup within the cache group identified by Indexmeta.

本公开一示例性的实施例中，所述系统中配置有多种共享内存，不同共享内存所属的内存介质的特性不同；所述一致性缓存区域划分的多种缓存组与所述多种共享内存一一对应；其中，所述内存介质的特性包括时延和带宽中的任意一种或多种。根据共享内存所属的内存介质的特性来划分一致性缓存区域中的缓存组，可以统一划分的标准，便于不同厂商设备之间的互连互通。本实施例的Indexmeta可以定义若干个CXL DRAM之间的共享，即将系统中的CXL DRAM分组，从而依据外部内存介质的时延、带宽等特性，划分不同的缓存组。例如，可以将一些缓存组映射到较慢的内存介质(例如CXL based shared SSD)，另一些缓存组映射到快速的内存介质(如CXL based DRAM)等。同一缓存组中的CXL模组的内存介质有相同的访问时延，对应的Indexmeta相同，便于处理单元采用相同的Cacheline访问策略进行访问，In an exemplary embodiment of this disclosure, the system is configured with multiple shared memories, each belonging to a memory medium with different characteristics. The coherence cache region is divided into multiple cache groups, each corresponding one-to-one with one of the shared memories. The characteristics of the memory medium include any one or more of latency and bandwidth. Dividing the cache groups in the coherence cache region according to the characteristics of the shared memory's memory medium allows for a unified standard, facilitating interconnection between devices from different manufacturers. In this embodiment, the Indexmeta can define sharing among several CXL DRAMs, grouping the CXL DRAMs in the system and dividing them into different cache groups based on the latency, bandwidth, and other characteristics of the external memory medium. For example, some cache groups can be mapped to slower memory media (e.g., CXL-based shared SSDs), while others can be mapped to faster memory media (e.g., CXL-based DRAMs). CXL modules within the same cache group have the same access latency and corresponding Indexmeta, facilitating the use of the same cacheline access strategy by the processing units.

本公开一示例性的实施例中，所述方法还包括：所述处理单元在缓存中查找所述共享缓存区域中的数据时，根据所述地址中的Tagmeta、Indexmeta确定所述一致性缓存区域中的缓存组，再根据所述地址中的Tagdata在确定的缓存组中查找缓存有该数据的缓存行。In an exemplary embodiment of this disclosure, the method further includes: when the processing unit searches for data in the shared cache region in the cache, determining the consistent cache region based on the Tagmeta and Indexmeta in the address. The cache group is then used to find the cache line containing the data in the specified cache group based on the Tagdata in the address.

在实施例的一示例中，Tagmeta具有多个取值，其中一个取值表示数据的缓存区域为所述一致性缓存区域；Indexmeta表示所述一致性缓存区域中的缓存组的索引；In one example of the embodiment, Tagmeta has multiple values, one of which indicates that the data cache area is the consistent cache area; Indexmeta represents the index of the cache group in the consistent cache area;

所述处理单元先根据所述地址中的Tagmeta、Indexmeta确定所述一致性缓存区域中的缓存组，再根据所述地址中的Tagdata在确定的缓存组中查找缓存有该数据的缓存行，包括：The processing unit first determines the cache group in the consistent cache region based on the Tagmeta and Indexmeta in the address, and then searches for the cache line containing the data in the determined cache group based on the Tagdata in the address, including:

根据所述地址中的Tagmeta的值查找所述一致性缓存区域的元数据表，所述元数据表中的每条记录包括以下字段：Tagmeta和缓存组偏移；其中，Tagmeta的取值表示数据的缓存区域为所述一致性缓存区域，缓存组偏移的值表示记录中的缓存组在所述一致性缓存区域中的位置偏移；The metadata table of the consistent cache region is searched based on the value of Tagmeta in the address. Each record in the metadata table includes the following fields: Tagmeta and cache group offset; wherein, the value of Tagmeta indicates that the cache region of the data is the consistent cache region, and the value of cache group offset indicates the position offset of the cache group in the record in the consistent cache region.

根据所述地址中的Indexmeta的值查找到所述元数据表中的相应记录，再根据相应记录中的缓存组偏移确定在所述一致性缓存区域中的查找范围；及，根据所述地址中的Tagdata的值在所述查找范围中查找相应的缓存行，判断是否命中缓存。The corresponding record in the metadata table is found based on the value of Indexmeta in the address, and the search range in the consistent cache area is determined based on the cache group offset in the corresponding record; and the corresponding cache line is searched in the search range based on the value of Tagdata in the address to determine whether a cache hit occurs.

本实施例的一示例中，一致性缓存区域中的缓存组的大小可以是固定设置的。在另一示例中，一致性缓存区域中的缓存组的大小是可以变化的，如根据各自对应的共享内存的大小结合预设算法计算得到。缓存组的大小可以用缓存组包括的缓存行的数据表示但不局限于此。在缓存组的大小可以变化的情况下，可以在元数据表中增加“缓存组大小”这一字段来记录，也可以根据其他元数据(如对应的共享内存大小)计算得到，或者根据下一记录中的缓存组偏移确定。在一致性缓存区域的元数据表中，还可以设置其他字段，如可以设置替换标志以表示缓存组中缓存行的替换机制。In one example of this embodiment, the size of the cache group in the consistent cache region can be fixed. In another example, the size of the cache group in the consistent cache region can vary, such as being calculated based on the size of its corresponding shared memory using a preset algorithm. The size of the cache group can be represented by the data of the cache lines included in the cache group, but is not limited to this. When the size of the cache group can vary, a "cache group size" field can be added to the metadata table to record it, or it can be calculated based on other metadata (such as the corresponding shared memory size), or determined based on the cache group offset in the next record. Other fields can also be set in the metadata table of the consistent cache region, such as a replacement flag to indicate the replacement mechanism of cache lines in the cache group.

上述一致性缓存区域的元数据表可以在设置缓存一致性区域的过程中建立，记录中的缓存组偏移等字段的值可以在配置过程中生成。The metadata table for the aforementioned consistent cache region can be created during the setup of the cache consistency region, and the values of fields such as cache group offset in the records can be generated during the configuration process.

如图6所示的示例中，处理器(CPU或GPU)生成的地址(以物理地址为例)依次包括4个字段：区域标志(Tagmeta)、缓存组标志(Indexmeta)、标签(Tagdata)和偏移量(Offset)。而一致性缓存区域的元数据表中，每条记录对应于一致性缓存区域的一个缓存组，包括以下字段：区域标志(Tagmeta)、替换标志(Ctrl flag)和缓存组偏移(Cacheline Area Offset)等字段。当访问的数据是共享内存中的数据时，访问缓存的输入地址中Tagmeta的值与所述元数据表中的Tagmeta的值相同，而Indexmeta可以视为一致性缓存区域的元数据表中记录的索引，因此根据地址中的Tagmeta和Indexmeta的值，可以定位到元数据表中的一条记录。查找该条记录的字段Ctrl flag和Cacheline Area Offset可以确定缓存组的替换机制，并找到对应缓存组在一致性缓存区域的位置偏移。一致性缓存区域的每一缓存行包括字段Tagdata和缓存的数据，在该缓存组的范围内查找是否有缓存行的Tagdata的值等于地址中的Tagdata的值，如果找到，则表示缓存命中，可以根据地址中的Offset的值提取所需要的数据；如果没有找到，表示缓存未命中，可以根据该地址访问共享内存以读取相应数据。As shown in the example in Figure 6, the address generated by the processor (CPU or GPU) (taking the physical address as an example) includes four fields: region flag (Tagmeta), cache group flag (Indexmeta), tag (Tagdata), and offset (Offset). In the metadata table of the consistent cache region, each record corresponds to a cache group within the consistent cache region and includes the following fields: region flag (Tagmeta), replacement flag (Ctrl flag), and cache group offset (Cacheline Area Offset). When accessing data in shared memory, the value of Tagmeta in the input address of the cache access is the same as the value of Tagmeta in the metadata table. Since Indexmeta can be considered an index of the record in the metadata table of the consistent cache region, the values of Tagmeta and Indexmeta in the address can locate a record in the metadata table. Searching for the Ctrl flag and Cacheline Area Offset fields of this record determines the cache group replacement mechanism and finds the corresponding cache group's offset within the consistent cache region. Each cache line in the consistent cache region includes a Tagdata field and the cached data. Within the scope of this cache group, it is searched to see if the value of Tagdata in the cache line is equal to the value of Tagdata at the address. If found, it means a cache hit, and the required data can be retrieved based on the Offset value at the address. If not found, it means a cache miss, and the corresponding data can be read by accessing the shared memory based on the address.

上述的元数据表等与缓存一致性维护相关的信息，可以利用系统中一个指定的处理器如主处理器进行信息维护。该主处理器可以对系统中所有处理单元的一致性缓存区域，或对系统中所有处理单元的一致性缓存区域和系统的共享内存进行管理、控制，包括对缓存行的控制权(ownership)进行控制。The aforementioned metadata tables and other information related to cache consistency maintenance can be maintained using a designated processor in the system, such as the main processor. This main processor can manage and control the consistency cache regions of all processing units in the system, or the consistency cache regions of all processing units in the system and the system's shared memory, including controlling the ownership of cache lines.

本实施例可以为了提高CacheLine的定位效率，在Coherence Area增加Meta区域，用于描述相应Cacheline的位置信息，依据处理器生成的物理地址，定义元数据的相关字段，如TagMeta，Indexmeta；通过引入Meta相关的字段，可将一致性缓存区域的Cacheline 进一步分组(如按照外部内存特性分组)，降低组内Cacheline数目，当查找特定Cacheline时，减少组内Cacheline比对次数。现有方案是依据tag值，和特定组内(用index标识)所有的Cacheline比较，在Cache空间较小时适用；Cache空间较大时，效率低。本实施例可以提高Cacheline定位的效率。This embodiment can improve the location efficiency of CacheLines by adding a Meta region to the Coherence Area to describe the location information of the corresponding CacheLine. Based on the physical address generated by the processor, relevant metadata fields, such as TagMeta and IndexMeta, are defined. By introducing Meta-related fields, the CacheLines in the Coherence Area can be identified. Further grouping (e.g., grouping by external memory characteristics) reduces the number of cachelines within a group, thus reducing the number of cacheline comparisons within the group when searching for a specific cacheline. Existing solutions compare cachelines within a specific group (identified by an index) based on tag values, which is suitable when the cache space is small; however, it becomes inefficient when the cache space is large. This embodiment can improve the efficiency of cacheline location.

CPU内部的缓存(Cache)的大小通常取决于CPU制造商的实现，针对缓存，CPU仅是提供静态信息描述，没有提供可更改和设置的接口。针对这种情况，本公开上述实施例提出用于设置一致性缓存区域的配置接口，还提出了用于设置共享内存的配置接口，可以对CPU内部的Cache和共享内存进行配置，设定一致性缓存区域的大小和共享内存的大小，解决处理器芯片内部缓存空间过大时，因多处理器或多处理器核之间因一致性维护导致的总线负载较高的问题，也可以提高缓存行的定位效率。The size of the CPU's internal cache typically depends on the CPU manufacturer's implementation. The CPU only provides static information about the cache, without offering an interface for modification or configuration. To address this, the embodiments described above propose configuration interfaces for setting coherent cache regions and shared memory. These interfaces allow configuration of the CPU's internal cache and shared memory, setting the size of the coherent cache region and the shared memory. This solves the problem of high bus load caused by coherence maintenance between multiple processors or cores when the processor chip's internal cache space is too large, and also improves cache line location efficiency.

本公开一实施例还提供了一种处理系统，参见图1和图5，该处理系统包括总线和多个处理单元，多个所述处理单元通过所述总线连接，所述处理单元均设置有缓存且具有缓存一致性的配置接口，所述处理单元被配置为：An embodiment of this disclosure also provides a processing system, referring to Figures 1 and 5. The processing system includes a bus and multiple processing units connected via the bus. Each processing unit is equipped with a cache and has a cache consistency configuration interface. The processing unit is configured to:

本公开一示例性的实施例中，所述处理单元为处理器或处理器核；至少一个所述处理单元的缓存的容量大于或等于128M，或者，大于或等于256M；所述一致性缓存区域的配置信息包括所述一致性缓存区域的大小。In an exemplary embodiment of this disclosure, the processing unit is a processor or a processor core; the cache capacity of at least one of the processing units is greater than or equal to 128M, or greater than or equal to 256M; the configuration information of the consistency cache region includes the size of the consistency cache region.

本公开一示例性的实施例中，In one exemplary embodiment of this disclosure,

所述处理系统还包括：中央控制硬件单元，被配置为向所述处理单元下发缓存区域配置指令，携带所述一致性缓存区域的配置信息；The processing system further includes: a central control hardware unit, configured to issue a cache region configuration instruction to the processing unit, carrying the configuration information of the consistent cache region;

所述处理单元通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，包括：接收所述中央控制硬件单元下发的缓存区域配置指令，对所述缓存区域配置指令进行解析，得到所述一致性缓存区域的配置信息；The processing unit obtains the configuration information of the consistent cache region through the cache consistency configuration interface, including: receiving the cache region configuration instruction issued by the central control hardware unit, parsing the cache region configuration instruction, and obtaining the configuration information of the consistent cache region;

其中，为不同处理单元配置的一致性缓存区域的大小相同或不同。The size of the consistency cache area configured for different processing units may be the same or different.

本公开一示例性的实施例中，所述处理单元通过所述缓存一致性的配置接口获取一致性缓存区域的配置信息，包括：所述处理单元基于系统软件提供的编程接口获取所述一致性缓存区域的配置信息。In an exemplary embodiment of this disclosure, the processing unit obtains configuration information of the consistency cache region through the cache consistency configuration interface, including: the processing unit obtains the configuration information of the consistency cache region based on the programming interface provided by the system software.

所述系统还包括通过总线与多个所述处理单元连接的外接设备，所述外接设备设置有内存介质；The system also includes an external device connected to the plurality of processing units via a bus, the external device being provided with a memory medium;

所述外接设备被配置为：The external device is configured as follows:

接收共享内存配置指令并解析，得到所述共享内存的配置信息，并根据所述共享内存的配置信息确定所述共享内存的地址信息；Receive and parse the shared memory configuration instruction to obtain the configuration information of the shared memory, and determine the address information of the shared memory based on the configuration information of the shared memory;

根据所述共享内存的地址范围，将本设备中的部分内存区域配置为多个所述处理单元的共享内存。Based on the address range of the shared memory, a portion of the memory region in this device is configured as shared memory for multiple processing units.

本公开一示例性的实施例中，所述外接设备包括通过CXL接口与多个所述处理单元通信的CXL模组，所述CXL模组包括控制器和与所述控制器连接的一组内存芯片；所述处理系统还包括：互联网络管理器，被配置为通过CXL接口，将所述共享内存配置指令下发给所述CXL模组。In an exemplary embodiment of this disclosure, the external device includes a CXL module that communicates with a plurality of the processing units via a CXL interface. The CXL module includes a controller and a set of memory chips connected to the controller. The processing system also includes an Internet Manager configured to send the shared memory configuration instructions to the CXL module via the CXL interface.

本公开一示例性的实施例中，所述互联网络管理器还被配置为将所述共享内存的配置信息分别发送给多个所述处理单元；其中，所述共享内存的配置信息包括共享内存的大小，所述一致性缓存区域的配置信息包括一致性缓存区域的大小。In an exemplary embodiment of this disclosure, the Internet Manager is further configured to send the configuration information of the shared memory to a plurality of the processing units respectively; wherein the configuration information of the shared memory includes the size of the shared memory, and the configuration information of the consistency cache region includes the size of the consistency cache region.

所述处理系统还包括多个所述处理单元的服务器和中央控制硬件单元；The processing system also includes servers and a central control hardware unit for multiple processing units;

所述互联网络管理器还被配置为将所述共享内存的配置信息发送给所述服务器；The network manager is also configured to send the configuration information of the shared memory to the server;

所述服务器被配置为将所述共享内存的配置信息分别发送给多个所述处理单元；及结合所述共享内存的配置信息确定多个所述处理单元中的一致性缓存区域的配置信息，将多个所述一致性缓存区域的配置信息分别携带在多个缓存区域配置指令中，通过所述中央控制硬件单元分别下发给多个所述处理单元；The server is configured to send the configuration information of the shared memory to multiple processing units respectively; and to determine the configuration information of the consistency cache region in the multiple processing units based on the configuration information of the shared memory, and to carry the configuration information of the multiple consistency cache regions in multiple cache region configuration instructions respectively, and to send them to the multiple processing units respectively through the central control hardware unit.

其中，所述共享内存的配置信息包括共享内存的大小，所述一致性缓存区域的配置信息包括一致性缓存区域的大小。The configuration information of the shared memory includes the size of the shared memory, and the configuration information of the consistency cache region includes the size of the consistency cache region.

上述互联网络管理器、服务器和中央控制硬件单元在实体上可以集成在一起，也可以单独设置。The aforementioned network manager, server, and central control hardware unit can be physically integrated together or set up separately.

本公开一示例性的实施例中，所述处理单元还被配置为：根据所述共享内存的地址范围和所述一致性缓存区域的地址范围，建立所述共享内存和所述一致性缓存区域之间的第一地址映射关系。In an exemplary embodiment of this disclosure, the processing unit is further configured to: establish a first address mapping relationship between the shared memory and the consistency cache region based on the address range of the shared memory and the address range of the consistency cache region.

本公开一示例性的实施例中，所述一致性缓存区域划分为多个缓存组，每一缓存组包括多个缓存行；In one exemplary embodiment of this disclosure, the consistency cache region is divided into multiple cache groups, and each cache group includes multiple cache lines;

所述处理单元访问所述共享内存中的数据时使用的地址包括以下字段：Tagmeta、Tagdata和Offset；或者包括以下字段：Tagmeta、Indexmeta、Tagdata和Offset；其中：The address used by the processing unit when accessing data in the shared memory includes the following fields: Tagmeta, Tagdata, and Offset; or includes the following fields: Tagmeta, Indexmeta, Tagdata, and Offset; wherein:

本公开一示例性的实施例中，所述系统中配置有多种共享内存，不同共享内存所属的内存介质的特性不同；所述一致性缓存区域划分的多种缓存组与所述多种共享内存一一对应；其中，所述内存介质的特性包括时延和带宽中的任意一种或多种。In an exemplary embodiment of this disclosure, the system is configured with multiple shared memories, and the memory media to which the different shared memories belong have different characteristics; the multiple cache groups divided by the consistency cache region correspond one-to-one with the multiple shared memories; wherein, the characteristics of the memory media include any one or more of latency and bandwidth.

本公开一示例性的实施例中，所述方法还包括：所述处理单元在缓存中查找所述共享缓存区域中的数据时，先根据所述地址中的Tagmeta、Indexmeta确定所述一致性缓存区域中的缓存组，再根据所述地址中的Tagdata在确定的缓存组中查找缓存有该数据的缓存行。In an exemplary embodiment of this disclosure, the method further includes: when the processing unit searches for data in the shared cache region in the cache, it first determines the cache group in the consistent cache region based on Tagmeta and Indexmeta in the address, and then searches for the cache line containing the data in the determined cache group based on Tagdata in the address.

Tagmeta具有多个取值，其中一个取值表示数据的缓存区域为所述一致性缓存区域；Indexmeta表示所述一致性缓存区域中的缓存组的索引；Tagmeta has multiple values, one of which indicates that the data cache area is the consistent cache area; Indexmeta represents the index of the cache group in the consistent cache area;

所述处理单元根据所述地址中的Tagmeta、Indexmeta确定所述一致性缓存区域中的缓存组，再根据所述地址中的Tagdata在确定的缓存组中查找缓存有该数据的缓存行，包括：The processing unit determines the cache group in the consistent cache region based on the Tagmeta and Indexmeta in the address, and then searches for the cache line containing the data in the determined cache group based on the Tagdata in the address. include:

本公开一实施例还提供了一种处理器芯片，包括处理器和缓存，参见图1，所述处理器被配置为执行本公开任一实施例所述的由处理器执行的降低总线负载的方法。An embodiment of this disclosure also provides a processor chip, including a processor and a cache, as shown in FIG1, wherein the processor is configured to perform a method for reducing bus load as described in any embodiment of this disclosure.

本公开一实施例还提供了一种非瞬态计算机存储介质，保存有计算机程序，所述计算机程序被处理器执行时，可实现本公开任一实施例所述的降低总线负载的方法或共享内存配置方法。An embodiment of this disclosure also provides a non-transient computer storage medium storing a computer program, which, when executed by a processor, can implement the bus load reduction method or shared memory configuration method described in any embodiment of this disclosure.

本公开一实施例还提供了一种CXL模组，可参见图1，所述CXL模组包括CXL控制器(图中的示例为控制芯片)及与所述CXL控制器连接的内存介质(图中的示例为DRAM颗粒)，所述CXL控制器被配置为执行以下处理：An embodiment of this disclosure also provides a CXL module, as shown in FIG1. The CXL module includes a CXL controller (in the example of a control chip in the figure) and a memory medium (in the example of a DRAM chip in the figure) connected to the CXL controller. The CXL controller is configured to perform the following processes:

在本公开一示例性的实施例中，所述控制器接收共享内存配置指令，包括：所述控制器接收由系统中的互联网络管理器通过共享内存的配置接口下发的共享内存配置指令。互联网络管理器指包括CXL设备的系统中可用于配置、管理CXL设备的实体。In an exemplary embodiment of this disclosure, the controller receiving a shared memory configuration instruction includes: the controller receiving a shared memory configuration instruction issued by the Internet Manager in the system through a shared memory configuration interface. The Internet Manager refers to an entity in the system including the CXL device that can be used to configure and manage the CXL device.

本公开一实施例还提供了一种共享内存配置方法，应用于CXL模组，所述CXL模组包括CXL控制器及与所述CXL控制器连接的内存介质，所述方法包括：This disclosure also provides a shared memory configuration method in an embodiment, applied to a CXL module, the CXL module including a CXL controller and a memory medium connected to the CXL controller, the method including:

步骤210，接收共享内存配置指令并解析，得到共享内存的配置信息；其中，所述共享内存是与所述CXL模组连接的多个处理单元共享的内存区域；Step 210: Receive and parse the shared memory configuration instruction to obtain the configuration information of the shared memory; wherein, the shared memory is a memory area shared by multiple processing units connected to the CXL module;

步骤220，根据所述共享内存的配置信息确定所述共享内存的地址范围；Step 220: Determine the address range of the shared memory based on the configuration information of the shared memory;

步骤230，根据所述共享内存的地址范围，将所述内存介质中的部分内存区域配置为所述共享内存，仅对所述共享内存中的数据进行缓存一致性维护。Step 230: Based on the address range of the shared memory, configure a portion of the memory region in the memory medium as the shared memory, and only maintain cache consistency for the data in the shared memory.

本实施例中，主机发送memory读写请求时，可以使用设备共享内存的元数据表，如果主机发送的请求包含落在共享内存地址范围内的地址，则进行一致性维护；反之，落在非共享内存的地址范围，则不需要进行一致性维护。共享内存的一致性维护可以以下两种方式实现：第一种，通过设备端的硬件维护(HDM-DB)，多个主机首先申请Cacheline的控制权(ownership)，然后开始访问；第二种，设备以HDM-H的方式暴露内容，由主机维护，如由中央处理单元处理多主机之间关于缓存行的cacheline的控制权(ownership)。In this embodiment, when a host sends a memory read/write request, it can use the metadata table of the device's shared memory. If the request sent by the host contains an address within the shared memory address range, consistency maintenance is performed; otherwise, if the address falls within the non-shared memory address range, consistency maintenance is not required. Shared memory consistency maintenance can be implemented in two ways: First, through device-side hardware maintenance (HDM-DB), multiple hosts first request cacheline ownership and then begin access; second, the device exposes content in an HDM-H manner, maintained by the host, such as the central processing unit handling cacheline ownership among multiple hosts.

本公开上述实施例对CXL协议进行拓展，增加共享内存配置的接口。CXL模组根据该接口定义的共享内存配置指令，解析得到共享内存的配置信息；再根据所述共享内存的配置信息确定所述共享内存的地址范围，从而将所述内存介质中的部分内存区域配置为所述共享内存，且仅对所述共享内存中的数据进行缓存一致性维护。即可以通过配置来限制CXL模组中的共享内存的大小，且CXL模组仅对共享内存中的数据进行缓存一致性维护。本公开实施例通过配置将CXL模组中的共享内存的大小限制在设定范围之内，可以防止维护缓存一致性的操作过于频繁，导致总线负载过大的问题。The embodiments disclosed above extend the CXL protocol by adding an interface for shared memory configuration. The CXL module parses the shared memory configuration information according to the shared memory configuration instructions defined by this interface; then, based on the shared memory configuration information, it determines the address range of the shared memory, thereby configuring a portion of the memory region in the memory medium as... The shared memory is used, and cache consistency maintenance is performed only on the data within the shared memory. That is, the size of the shared memory in the CXL module can be limited by configuration, and the CXL module only performs cache consistency maintenance on the data within the shared memory. This embodiment of the disclosure, by configuring the size of the shared memory in the CXL module to be within a set range, can prevent excessively frequent cache consistency maintenance operations from causing excessive bus load.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中，在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分；例如，一个物理组件可以具有多个功能，或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器，如数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。 It will be understood by those skilled in the art that all or some of the steps, systems, or apparatuses disclosed above, and their functional modules/units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules/units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

Claims

A CXL module includes a CXL controller and a memory medium connected to the CXL controller, the CXL controller being configured to perform the following processes:

The system receives and parses shared memory configuration instructions to obtain shared memory configuration information; wherein, the shared memory is a memory region shared by multiple processing units connected to the CXL module.

The address range of the shared memory is determined based on the configuration information of the shared memory;

Based on the address range of the shared memory, a portion of the memory region in the memory medium is configured as the shared memory, and cache consistency maintenance is performed only on the data in the shared memory.

The CXL module as described in claim 1, wherein:

The controller receiving shared memory configuration instructions includes: the controller receiving shared memory configuration instructions issued by the Internet Manager in the system through the shared memory configuration interface.

A shared memory configuration method is applied to a CXL module, the CXL module including a CXL controller and a memory medium connected to the CXL controller, the method comprising:

A processing system includes a bus and multiple processing units connected via the bus. Each processing unit is equipped with a cache and has a cache consistency configuration interface. The processing units are configured to:

The configuration information of the consistent cache region is obtained through the configuration interface of the cache consistency, and the address range of the consistent cache region is determined based on the configuration information of the consistent cache region.

Based on the address range of the consistency cache region, a portion of the cache of this processing unit is configured as a consistency cache region to cache data in the shared memory of multiple processing units using the consistency cache region.

The processing system of claim 4, wherein:

The processing unit is a processor or a processor core; the cache capacity of at least one of the processing units is greater than or equal to 128M, or greater than or equal to 256M; the configuration information of the consistency cache region includes the size of the consistency cache region.

The processing system of claim 4, wherein:

The processing system further includes: a central control hardware unit, configured to issue a cache region configuration instruction to the processing unit, carrying the configuration information of the consistent cache region;

The processing unit obtains the configuration information of the consistent cache region through the cache consistency configuration interface, including: receiving the cache region configuration instruction issued by the central control hardware unit, parsing the cache region configuration instruction, and obtaining the configuration information of the consistent cache region;

The size of the consistency cache area configured for different processing units may be the same or different.

The processing system of claim 4, wherein:

The processing unit obtains the configuration information of the consistency cache region through the configuration interface of the cache consistency, including: the processing unit obtains the configuration information of the consistency cache region based on the programming interface provided by the system software.

The processing system of claim 4, wherein:

The system also includes an external device connected to the plurality of processing units via a bus, the external device being provided with a memory medium;

The external device is configured as follows:

Receive and parse the shared memory configuration instruction to obtain the configuration information of the shared memory, and determine the address information of the shared memory based on the configuration information of the shared memory;

Based on the address range of the shared memory, a portion of the memory region in this device is configured as shared memory for multiple processing units.

The processing system of claim 8, wherein:

The external device includes a CXL module that communicates with multiple processing units via a CXL interface. The CXL module includes a controller and a set of memory chips connected to the controller.

The processing system further includes an Internet Manager configured to send the shared memory configuration command to the CXL module via the CXL interface.

The processing system of claim 9, wherein:

The network manager is further configured to send the configuration information of the shared memory to multiple processing units respectively; wherein the configuration information of the shared memory includes the size of the shared memory, and the configuration information of the consistency cache region includes the size of the consistency cache region.

The processing system of claim 9, wherein:

The processing system also includes servers and a central control hardware unit for multiple processing units;

The network manager is also configured to send the configuration information of the shared memory to the server;

The server is configured to send the configuration information of the shared memory to multiple processing units respectively; and to determine the configuration information of the consistency cache region in the multiple processing units based on the configuration information of the shared memory, and to carry the configuration information of the multiple consistency cache regions in multiple cache region configuration instructions respectively, and to send them to the multiple processing units respectively through the central control hardware unit.

The configuration information of the shared memory includes the size of the shared memory, and the configuration information of the consistency cache region includes the size of the consistency cache region.

The processing system of claim 8, wherein:

The processing unit is further configured to: establish a first address mapping relationship between the shared memory and the consistency cache region based on the address range of the shared memory and the address range of the consistency cache region.

The processing system of claim 4, wherein:

The consistency cache region is divided into multiple cache groups, and each cache group includes multiple cache lines;

The address used by the processing unit when accessing data in the shared memory includes the following fields: Tagmeta, Tagdata, and Offset; or includes the following fields: Tagmeta, Indexmeta, Tagdata, and Offset; wherein:

Tagmeta is a field used to identify the consistency cache region;

Indexmeta is a field used to identify the cache group in the consistent cache region;

Tagdata is a field used to identify cache rows in the cache group of the consistent cache region;

Offset is a field used to identify data offset.

The processing system of claim 13, wherein:

The system is configured with multiple types of shared memory, and the memory media of different shared memory have different characteristics;

The various cache groups divided by the consistency cache region correspond one-to-one with the various shared memory types; wherein, the characteristics of the memory medium include any one or more of latency and bandwidth.

A method for reducing bus load, applied to a system comprising multiple processing units connected via a bus, wherein each processing unit is configured with a cache and has a cache consistency configuration interface, the method comprising:

The processing unit obtains the configuration information of the consistency cache region through the configuration interface of the cache consistency, and determines the address range of the consistency cache region based on the configuration information of the consistency cache region;

The processing unit configures a portion of its cache as a consistency cache region based on the address range of the consistency cache region, so as to use the consistency cache region to cache data in the shared memory of multiple processing units.

The method of claim 15, wherein:

The processing unit is a processor or a processor core;

The configuration information of the consistency cache region includes the size of the consistency cache region; the cache capacity of at least one of the processing units is greater than or equal to 128M, or greater than or equal to 256M.

The method of claim 15, wherein:

The processing unit obtains the configuration information of the consistent cache region through the cache consistency configuration interface, including: the processing unit receives a cache region configuration instruction issued by the central control hardware unit of the system, parses the cache region configuration instruction, and obtains the configuration information of the consistent cache region; or, the processing unit obtains the configuration information of the consistent cache region based on the programming interface provided by the system software.

The method of claim 15, wherein:

The system further includes an external device connected to the plurality of processing units via a bus, the external device being provided with a memory medium; the method further includes:

The external device receives and parses the shared memory configuration command to obtain the configuration information of the shared memory, and determines the address information of the shared memory based on the configuration information of the shared memory;

The external device configures a portion of the memory region in this device as shared memory for multiple processing units based on the address range of the shared memory.

The method of claim 18, wherein:

The shared memory configuration command is sent from the system's Internet Manager to the CXL module via the CXL interface.

The method of claim 19, wherein:

The network manager also sends the configuration information of the shared memory to multiple processing units respectively; or,

The network manager also sends the shared memory configuration information to the servers of the multiple processing units, and the servers send the shared memory configuration information to the multiple processing units respectively; and the servers also determine the configuration of the consistency cache region in the multiple processing units based on the shared memory configuration information. The information carries the configuration information of the multiple consistent cache regions in multiple cache region configuration instructions, and sends them to the multiple processing units through the central control hardware unit of the system.

The method of claim 18, wherein:

The method further includes: the processing unit establishing a first address mapping relationship between the shared memory and the consistency cache region based on the address range of the shared memory and the address range of the consistency cache region.

The method of claim 15, wherein:

The consistent cache region is divided into multiple cache groups, each cache group including multiple cache lines; the address used by the processing unit when accessing data in the shared memory includes the following fields: Tagmeta, Tagdata, and Offset; or includes the following fields: Tagmeta, Indexmeta, Tagdata, and Offset; wherein:

Tagmeta is a field used to identify the consistency cache region;

Offset is a field used to identify data offset.

The method of claim 22, wherein:

The system is configured with multiple types of shared memory, and the memory media to which the different shared memory belong have different characteristics; the multiple cache groups divided by the consistency cache region correspond one-to-one with the multiple types of shared memory; wherein, the characteristics of the memory media include any one or more of latency and bandwidth.

The method of claim 22, wherein:

The method further includes: when the processing unit searches for data in the shared cache region in the cache, it first determines the cache group in the consistent cache region based on the Tagmeta and Indexmeta in the address, and then searches for the cache line containing the data in the determined cache group based on the Tagdata in the address.

The method of claim 24, wherein:

Tagmeta has multiple values, one of which indicates that the data cache area is the consistent cache area; Indexmeta represents the index of the cache group in the consistent cache area;

The processing unit determines the cache group in the consistent cache region based on the Tagmeta and Indexmeta in the address, and then searches for the cache line containing the data in the determined cache group based on the Tagdata in the address, including:

The metadata table of the consistent cache region is searched based on the value of Tagmeta in the address. Each record in the metadata table includes the following fields: Tagmeta and cache group offset; wherein, the value of Tagmeta indicates that the cache region of the data is the consistent cache region, and the value of cache group offset indicates the position offset of the cache group in the record in the consistent cache region.

The corresponding record in the metadata table is found based on the value of Indexmeta in the address, and the search range in the consistent cache area is determined based on the cache group offset in the corresponding record; and the corresponding cache line is searched in the search range based on the value of Tagdata in the address to determine whether a cache hit occurs.

A processor chip includes a processor core and a cache, the processor core being configured to perform the method as described in any one of claims 15 to 17, 21 to 25.

A non-transient computer storage medium storing a computer program, which is executed by a processor. At that time, the method described in any one of claims 15 to 25 can be implemented.