[go: up one dir, main page]

CN106201939A - Multinuclear catalogue concordance device towards GPDSP framework - Google Patents

Multinuclear catalogue concordance device towards GPDSP framework Download PDF

Info

Publication number
CN106201939A
CN106201939A CN201610503703.5A CN201610503703A CN106201939A CN 106201939 A CN106201939 A CN 106201939A CN 201610503703 A CN201610503703 A CN 201610503703A CN 106201939 A CN106201939 A CN 106201939A
Authority
CN
China
Prior art keywords
request
data
directory
last
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610503703.5A
Other languages
Chinese (zh)
Other versions
CN106201939B (en
Inventor
刘胜
李昭然
陈海燕
许邦建
鲁建壮
陈俊杰
孔宪停
康子扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201610503703.5A priority Critical patent/CN106201939B/en
Publication of CN106201939A publication Critical patent/CN106201939A/en
Application granted granted Critical
Publication of CN106201939B publication Critical patent/CN106201939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • G06F13/30Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal with priority control
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/28DMA
    • G06F2213/2806Space or buffer allocation for DMA transfers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种面向GPDSP架构的多核目录一致性装置,包括:内核,包含DMA和L1D,L1D为一级数据Cache;所述DMA用来完成外设和内核间数据的搬运;所述L1D包括有Normal Deal和Monitor Deal两个并行处理单元,所述Normal Deal处理单元完成load、store指令的处理,所述Monitor Deal处理单元用来响应任意时刻抵达的监听请求,且处理过程不受Normal Deal处理单元的影响;片上末级Cache,分布式的接在片上互连网络上;片外存储DDR,数据缓存在L1D和片上末级Cache中;片上互连网络,用来接收网络请求,接收到网络请求后会首先进行译码处理,译码出目的节点和目的设备后将请求发向对应的位置。本发明具有原理简单、操作方便、灵活性高、适用范围广等优点。

A multi-core directory consistency device for GPDSP architecture, including: a kernel, including DMA and L1D, L1D is a first-level data cache; the DMA is used to complete the transfer of data between peripherals and cores; the L1D includes Normal Deal and Monitor Deal two parallel processing units, the Normal Deal processing unit completes the processing of load and store instructions, and the Monitor Deal processing unit is used to respond to the monitoring request arriving at any time, and the processing process is not affected by the Normal Deal processing unit ; On-chip last-level Cache, which is distributed and connected to the on-chip interconnection network; off-chip storage DDR, data cached in L1D and on-chip last-level Cache; on-chip interconnection network, used to receive network requests, will receive network requests First, the decoding process is performed, and the destination node and the destination device are decoded, and the request is sent to the corresponding location. The invention has the advantages of simple principle, convenient operation, high flexibility, wide application range and the like.

Description

面向GPDSP架构的多核目录一致性装置Multi-core directory consistency device for GPDSP architecture

技术领域technical field

本发明主要涉及到处理器微体系结构设计领域,特指一种适合通用数字信号处理器(General-Purpose Digital Signal Processor,GPDSP)多核存储通路的设计。The invention mainly relates to the field of processor micro-architecture design, in particular to a design suitable for multi-core storage paths of a General-Purpose Digital Signal Processor (GPDSP).

背景技术Background technique

作为处理器领域的三驾马车(CPU、DSP和GPU)之一,DSP以其高性能功耗比、良好的可编程性、低功耗等优点,被广泛地应用于嵌入式系统中。区别于CPU,DSP具有以下特点:1)计算能力强,关注实施计算胜于关注控制和事物处理;2)对于典型的信号处理有专门的硬件支持,如乘加运算、循环寻址等;3)嵌入式微处理器的共性特征,如地址和指令通路不多于32位;非精确中断;短期离线调试、长期在线驻留运行的程序工作方式(而非通用CPU调试即运行的方法);4)集成外设接口以快速外设为主,特别利于在线收发AD/DA信号,也支持DSP间高速直连。As one of the troika (CPU, DSP and GPU) in the field of processors, DSP is widely used in embedded systems due to its advantages of high performance per power ratio, good programmability and low power consumption. Different from the CPU, DSP has the following characteristics: 1) Strong computing power, focusing on implementation calculations rather than control and transaction processing; 2) It has specialized hardware support for typical signal processing, such as multiplication and addition operations, circular addressing, etc.; 3 ) The common characteristics of embedded microprocessors, such as address and instruction path not more than 32 bits; imprecise interrupt; short-term offline debugging, long-term online resident running program working mode (rather than general-purpose CPU debugging and running method); 4 ) The integrated peripheral interface is mainly based on fast peripherals, which is especially conducive to online sending and receiving of AD/DA signals, and also supports high-speed direct connection between DSPs.

鉴于DSP的高性能功耗比和强大的计算能力,如何对传统的DSP体系结构加以升级和改进使之适用于高性能科学计算,是目前国内外的研究热点。In view of DSP's high-performance power consumption ratio and powerful computing power, how to upgrade and improve the traditional DSP architecture to make it suitable for high-performance scientific computing is a research hotspot at home and abroad.

如中国专利申请(通用计算数字信号处理器,申请号:201310725118.6)中提出了一种既保持DSP的基本特征和高性能低功耗的优势,又能高效支持通用科学计算的多核微处理器GPDSP。该结构具有如下特征:1)具有双精度浮点和64位定点数据的直接表示,通用寄存器、数据总线、指令位宽64位及以上,地址总线宽度40位及以上;2)CPU与DSP异构多核紧密耦合,CPU核支持完整操作系统,DSP核的标量单元支持操作系统微核;3)考虑CPU核、DSP核及DSP核内向量阵列结构的统一编程模式;4)保持它机交叉仿真调试,同时提供本地CPU宿主调试模式;5)保留除位数之外的普通DSP的基本特征。For example, a Chinese patent application (General Computing Digital Signal Processor, application number: 201310725118.6) proposes a multi-core microprocessor GPDSP that not only maintains the basic characteristics of DSP and has the advantages of high performance and low power consumption, but also can efficiently support general scientific computing. . The structure has the following characteristics: 1) It has direct representation of double-precision floating-point and 64-bit fixed-point data, the general-purpose register, data bus, and instruction bit width are 64 bits and above, and the address bus width is 40 bits and above; 2) CPU and DSP are different The structure is tightly coupled with multiple cores, the CPU core supports a complete operating system, and the scalar unit of the DSP core supports the micro-kernel of the operating system; 3) A unified programming model that considers the structure of the vector array in the CPU core, DSP core, and DSP core; 4) Maintains cross-simulation of other machines Debugging, while providing local CPU host debugging mode; 5) retaining the basic characteristics of ordinary DSP except the number of bits.

又如中国专利申请(GPDSP多层次协同与共享的存储装置级访存方法,申请号:20150135194.0)针对GPDSP的应用需求,提出了多层次协同和功效的存储架构。每个DSP核包含全局编址的、且对程序员可编程的本地大容量标量、向量阵列存储器,多个DSP核通过片上网络共享大容量片上全局Cache;通过直接存储器访问控制器(DMA)和片上网络提供高带宽的核间、核内存储器与全局Cache的数据阐述,实现了单核内、多核间数据访问的协同和共享。Another example is the Chinese patent application (GPDSP multi-level coordination and shared storage device-level memory access method, application number: 20150135194.0), aiming at the application requirements of GPDSP, it proposes a storage architecture with multi-level coordination and efficiency. Each DSP core contains global addressing and programmable local large-capacity scalar and vector array memory for programmers. Multiple DSP cores share large-capacity on-chip global Cache through the on-chip network; through direct memory access controller (DMA) and The network-on-chip provides high-bandwidth inter-core, in-core memory, and global Cache data interpretation, realizing the collaboration and sharing of data access within a single core and between multiple cores.

在GPDSP中,核内的片上存储器和DMA后台搬移机制被保留,这将对数据一致性的实现带来难度。在中国专利申请(一种显式的面向流应用的多核Cache一致性主动管理方法,申请号:201310166383.5)中采用了软件可编程的数据一致性管理方法,将维护数据一致性的工作交给了程序员,需要程序员主动管理维护数据的生产消费过程的正确性。如果在GPDSP中采用硬件管理硬件一致性协议,将能够减轻程序员的编程复杂度,进而提高GPDSP的通用性。In GPDSP, the on-chip memory and DMA background transfer mechanism in the core are reserved, which will bring difficulties to the realization of data consistency. In the Chinese patent application (an explicit streaming application-oriented multi-core cache consistency active management method, application number: 201310166383.5), a software-programmable data consistency management method is adopted, and the work of maintaining data consistency is handed over to Programmers need to actively manage and maintain the correctness of the production and consumption process of data. If the hardware management hardware coherence protocol is adopted in GPDSP, it will be able to reduce the programming complexity of programmers and improve the versatility of GPDSP.

发明内容Contents of the invention

本发明要解决的技术问题就在于:针对现有技术存在的问题,本发明提供一种原理简单、操作方便、灵活性高、适用范围广的面向GPDSP架构的多核目录一致性装置。The technical problem to be solved by the present invention is that, aiming at the problems existing in the prior art, the present invention provides a multi-core directory consistency device for GPDSP architecture with simple principle, convenient operation, high flexibility and wide application range.

为解决上述技术问题,本发明采用以下技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种面向GPDSP架构的多核目录一致性装置,包括:A multi-core directory consistency device for GPDSP architecture, including:

内核,包含DMA和L1D,L1D为一级数据Cache;所述DMA用来完成外设和内核间数据的搬运;所述L1D包括有Normal Deal和Monitor Deal两个并行处理单元,所述Normal Deal处理单元完成load、store指令的处理,所述Monitor Deal处理单元用来响应任意时刻抵达的监听请求,且处理过程不受Normal Deal处理单元的影响;Kernel, including DMA and L1D, L1D is a first-level data cache; the DMA is used to complete the transfer of data between peripherals and the core; the L1D includes two parallel processing units, Normal Deal and Monitor Deal, and the Normal Deal processing The unit completes the processing of load and store instructions, and the Monitor Deal processing unit is used to respond to the monitoring request arriving at any time, and the processing process is not affected by the Normal Deal processing unit;

片上末级Cache,分布式的接在片上互连网络上;On-chip last-level Cache, distributed and connected to the on-chip interconnection network;

片外存储DDR,它的数据会缓存在L1D和片上末级Cache中;Off-chip storage DDR, its data will be cached in L1D and on-chip last-level Cache;

片上互连网络,用来接收网络请求,接收到网络请求后会首先进行译码处理,译码出目的节点和目的设备后将请求发向对应的位置。The on-chip interconnection network is used to receive network requests. After receiving network requests, it will first perform decoding processing, and after decoding the destination node and destination device, the request will be sent to the corresponding location.

作为本发明的进一步改进:所述片上末级Cache内分为若干个体,每个所述体均由输入缓冲单元IBUF、流水线单元PipeLine、输出缓冲单元OBUF以及返回环网处理逻辑单元Rtn NAC组成;所述输入缓冲单元用来负责缓存从片上网络进入末级Cache的请求;所述流水线单元用来对来自输入缓冲的访问DDR存储空间请求进行流水化处理;所述输出缓冲单元用来负责缓存末级Cache访问DDR的请求;所述返回环网处理逻辑单元用来负责对多种类型的进入片上网络请求进行仲裁处理。As a further improvement of the present invention: the last stage Cache on the chip is divided into several individuals, each of which is composed of an input buffer unit IBUF, a pipeline unit PipeLine, an output buffer unit OBUF and a return ring network processing logic unit Rtn NAC; The input buffer unit is used to be responsible for buffering the request entering the final stage Cache from the on-chip network; the pipeline unit is used to perform pipeline processing on the access DDR storage space request from the input buffer; Level Cache accesses the request of the DDR; the return ring network processing logic unit is used to be responsible for arbitrating various types of incoming network-on-chip requests.

作为本发明的进一步改进:还包括MSI目录协议单元,用来对L1D发出的请求进行一致性维护;所述MSI目录协议单元由M、S、I三个目录状态组成;M状态表示数据被某个DSPCore独占且数据为脏;S状态表示数据被一个或多个DSP Core共享且数据为干净;I状态表示所有的DSP Core都没有数据拷贝。As a further improvement of the present invention: it also includes an MSI directory protocol unit, which is used to maintain the consistency of the request sent by the L1D; the MSI directory protocol unit is composed of three directory states of M, S, and I; the M state indicates that data is One DSPCore is exclusive and the data is dirty; the S state indicates that the data is shared by one or more DSP Cores and the data is clean; the I state indicates that all DSP Cores have no data copy.

作为本发明的进一步改进:还包括目录控制器,所述目录控制器在协议层次上判断方案的正确性,用来完成对单个请求在不同目录状态下的处理、多个相关请求的冲突处理以及处于目录“中间态”的数据块对相关请求的响应处理;所述目录控制器分为两类,一类放置在L1D中,另一类放置在片上末级Cache中,目录操作在L1D和片上末级Cache中进行。As a further improvement of the present invention: it also includes a directory controller, which judges the correctness of the scheme at the protocol level, and is used to complete the processing of a single request in different directory states, the conflict processing of multiple related requests, and Response processing of data blocks in the "intermediate state" of the directory to related requests; the directory controller is divided into two types, one is placed in the L1D, and the other is placed in the last-level Cache on the chip, and the directory operation is performed on the L1D and on-chip It is performed in the last level Cache.

作为本发明的进一步改进:所述片上末级Cache中存放有完全目录的目录结构,所述目录结构为缓存在片上末级Cache中的每一数据块分配目录项;所述目录项包括目录状态和共享列表两部分,所述共享列表为每一个DSP Core分配一位来表示数据是否在对应的DSP Core中存在拷贝。As a further improvement of the present invention: the directory structure of the complete directory is stored in the last-level Cache on the chip, and the directory structure allocates directory entries for each data block cached in the last-level Cache on the chip; the directory entries include directory status There are two parts and a sharing list, and the sharing list allocates one bit for each DSP Core to indicate whether the data is copied in the corresponding DSP Core.

作为本发明的进一步改进:在所述L1D中采用流水线结构,用以完成指令译码、地址计算、读标记及状态位、判断命中、数据体访问以及数据返回操作的流水过程。As a further improvement of the present invention: a pipeline structure is adopted in the L1D to complete the pipeline process of instruction decoding, address calculation, reading flag and status bit, judging hit, data body access and data return operation.

作为本发明的进一步改进:所述L1D的流水线结构分别为DC1、DC2、EX1、EX2、EX3、EX4、EX5;当L1D接收到load、store指令后,首先对其进行两站流水的译码处理,即DC1和DC2,判断指令的操作类型及要实现的功能;指令译码完成后进行地址计算,即EX1,完成功能实现;接着完成读执行3拍,写执行2拍的L1DCache访存流水线,其中读操作流水线在SMAC访存主流水线中处于EX2、EX3、EX4位置,分别是命中缺失判断、访存/缺失处理、访存输出;写操作流水线则处于EX2、EX3位置,分别是缺失判断、访存/缺失处理。As a further improvement of the present invention: the pipeline structure of the L1D is respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5; when the L1D receives the load and store instructions, it first performs two-station pipeline decoding processing , that is, DC1 and DC2, to judge the operation type of the instruction and the function to be realized; after the instruction is decoded, the address calculation is performed, that is, EX1, to complete the function realization; then complete the L1DCache memory access pipeline that reads and executes 3 shots, and writes and executes 2 shots. Among them, the read operation pipeline is in the position of EX2, EX3, and EX4 in the mainstream pipeline of SMAC memory access, which are hit-miss judgment, memory access/missing processing, and memory access output; the write operation pipeline is in the position of EX2 and EX3, respectively. Fetch/miss handling.

作为本发明的进一步改进:在所述片上末级Cache中使用流水线结构,用以实现读标记及状态位、读目录项、判断命中、监听处理、数据体访问以及数据返回操作的流水过程。As a further improvement of the present invention: a pipeline structure is used in the last stage Cache on the chip to realize the pipeline process of reading flags and status bits, reading directory entries, judging hits, monitoring processing, data body access and data return operations.

作为本发明的进一步改进:所述片上末级Cache的流水线结构包括:As a further improvement of the present invention: the pipeline structure of the last stage Cache on the chip comprises:

流水线第一级Req_Arb:与Flush请求进行轮转仲裁,并将仲裁出的请求发向下一级流水;同时,还读取请求数据块的有效位、脏位以及Tag信息;The first stage of the pipeline, Req_Arb: performs round-robin arbitration with the Flush request, and sends the arbitrated request to the next-level pipeline; at the same time, it also reads the valid bit, dirty bit, and Tag information of the requested data block;

流水线第二级Tag_Wait:判断是否为目录请求,以及读取目录信息;The second stage of the pipeline Tag_Wait: determine whether it is a directory request, and read the directory information;

流水线第三级Tag_Judge:首先判断请求是否命中,若缺失则再判断是否与MBUF中地址相关;若为相关的缺失请求,则将请求发向MBUF;若为不相关的缺失请求,则将请求发向MBUF和OBUF;命中请求根据是否为目录请求,进行相应的处理;若为非目录请求,则产生访问数据体使能;若为目录请求,则查看目录项信息,根据目录状态以及共享列表的不同请求的处理也不一样;目录请求的处理方式分为三类:第一类直接进行操作,并产生访问数据体使能;第二类等待L1D数据返回,待数据抵达后再产生访问数据体使能;第三类等待Inv-Ack请求,待无效应答全部抵达后再产生访问数据体使能;The third stage of the pipeline Tag_Judge: first judge whether the request is hit, if it is missing, then judge whether it is related to the address in MBUF; if it is a related missing request, send the request to MBUF; if it is an irrelevant missing request, send the request to To MBUF and OBUF; the hit request is processed according to whether it is a directory request; if it is a non-directory request, the access data body is enabled; if it is a directory request, check the directory item information, according to the directory status and share list The processing of different requests is also different; the processing methods of directory requests are divided into three types: the first type directly operates and generates the access data body enable; the second type waits for the return of L1D data, and generates the access data body after the data arrives. Enable; the third type waits for the Inv-Ack request, and generates the access data body after all invalid responses arrive;

流水线第四级Data_Acc:首先判断请求处理数据体的类别;若为读操作,则将请求访问的数据块从数据体读出后锁存一拍;若为写操作,则先对写入数据进行编码,再进行更新数据体操作;进行数据体读操作的请求最后会将读出的数据发向下一级流水;The fourth stage of the pipeline, Data_Acc: first judge the type of the data body requested to be processed; if it is a read operation, read the requested data block from the data body and then latch it for one shot; if it is a write operation, first perform Encoding, and then update the data body operation; the request for data body read operation will finally send the read data to the next level of pipeline;

流水线第五级Data_Dec:对上一级读出的请求数据进行译码操作,并将读返回数据请求发向上环仲裁处理模块进行处理。The fifth stage of the pipeline, Data_Dec: decodes the request data read by the upper stage, and sends the read back data request to the upper ring arbitration processing module for processing.

与现有技术相比,本发明的优点在于:Compared with the prior art, the present invention has the advantages of:

1、本发明的面向GPDSP架构的多核目录一致性装置,原理简单、操作方便、灵活性高、适用范围广,除了支持L1D请求的数据一致性维护外,对DMA请求的数据一致性维护同样支持,扩宽了应用领域。1. The multi-core directory consistency device oriented to the GPDSP architecture of the present invention has simple principles, convenient operation, high flexibility, and wide application range. In addition to supporting data consistency maintenance for L1D requests, it also supports data consistency maintenance for DMA requests. , broaden the field of application.

2、本发明的面向GPDSP架构的多核目录一致性装置,采用目录控制器,利用目录控制器机制能在协议层次上判断多核数据一致性方案的正确性,大大缩短了多核目录一致性装置的设计与验证周期。2. The multi-core directory consistency device oriented to the GPDSP architecture of the present invention adopts a directory controller, and the directory controller mechanism can be used to judge the correctness of the multi-core data consistency scheme at the protocol level, which greatly shortens the design of the multi-core directory consistency device and validation cycle.

附图说明Description of drawings

图1是本发明在具体应用实例中结构原理示意图。Fig. 1 is a schematic diagram of the structure and principle of the present invention in a specific application example.

图2是本发明在具体应用实例中指令处理原理示意图;其中(a)为指令处理原理图(1),(b)为指令处理原理图(2),(c)为指令处理原理图(3),(d)为一级数据Cache替换处理原理图。Fig. 2 is the schematic diagram of instruction processing principle in specific application example of the present invention; Wherein (a) is instruction processing principle diagram (1), (b) is instruction processing principle diagram (2), (c) is instruction processing principle diagram (3) ), (d) is a schematic diagram of primary data Cache replacement processing.

图3是本发明在具体应用实例中DMA处理原理示意图;其中(a)为DMA读请求处理原理图,(b)为DMA写请求处理原理图。3 is a schematic diagram of the DMA processing principle in a specific application example of the present invention; wherein (a) is a schematic diagram of DMA read request processing, and (b) is a schematic diagram of DMA write request processing.

图4是本发明在具体应用实例中末级Cache目录存储体中某一组的目录结构图。Fig. 4 is a directory structure diagram of a certain group in the last-level Cache directory storage body in a specific application example of the present invention.

图5是本发明在具体应用实例中L1D流水线整体结构示意图。FIG. 5 is a schematic diagram of the overall structure of the L1D pipeline in a specific application example of the present invention.

图6是本发明在具体应用实例中末级Cache流水线整体结构示意图。FIG. 6 is a schematic diagram of the overall structure of the last-stage Cache pipeline in a specific application example of the present invention.

具体实施方式detailed description

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

在具体应用实例中,GPDSP架构包含N个核;GetX请求表示GetS请求或者GetM请求;Fwd-GetX请求表示Fwd-GetS请求或者Fwd-GetM请求;PutX请求表示PutS请求或者PutM+data请求;目录X状态表示目录S状态或者目录M状态;目录Y状态表示目录S状态或者目录I状态。In a specific application example, the GPDSP architecture includes N cores; GetX request means GetS request or GetM request; Fwd-GetX request means Fwd-GetS request or Fwd-GetM request; PutX request means PutS request or PutM+data request; directory X The status indicates the directory S status or the directory M status; the directory Y status indicates the directory S status or the directory I status.

如图1所示,本发明的面向GPDSP架构的多核目录一致性装置,包括:As shown in Figure 1, the multi-core directory consistency device for GPDSP architecture of the present invention includes:

内核,包含DMA和一级数据Cache(L1D);其中DMA用来完成外设和内核间数据的搬运,启动前需要程序员的配置。L1D中有Normal Deal和Monitor Deal两个并行处理单元。其中,Normal Deal处理单元完成load、store指令的处理。若指令在L1D中判断缺失,则可能还需进行替换处理;Monitor Deal处理单元可以响应任意时刻抵达的监听请求,且处理过程不受Normal Deal处理单元的影响。Kernel, including DMA and Level 1 Data Cache (L1D); among them, DMA is used to transfer data between peripherals and kernel, and requires programmer configuration before starting. There are two parallel processing units, Normal Deal and Monitor Deal, in L1D. Among them, the Normal Deal processing unit completes the processing of load and store instructions. If the instruction is judged to be missing in L1D, it may need to be replaced; the Monitor Deal processing unit can respond to the monitoring request arriving at any time, and the processing process is not affected by the Normal Deal processing unit.

片上末级Cache,分布式的接在片上互连网络上,其可以分为N个体。每个体均由输入缓冲单元(IBUF)、流水线单元(PipeLine)、输出缓冲单元(OBUF)以及返回环网处理逻辑单元(Rtn NAC)组成。其中,输入缓冲单元用来负责缓存从片上网络进入末级Cache的请求;流水线单元用来对来自输入缓冲的访问DDR存储空间请求进行流水化处理;输出缓冲单元用来负责缓存末级Cache访问DDR的请求;返回环网处理逻辑单元用来负责对多种类型的进入片上网络请求进行仲裁处理。The on-chip last-level Cache is distributed and connected to the on-chip interconnection network, which can be divided into N individuals. Each body consists of an input buffer unit (IBUF), a pipeline unit (PipeLine), an output buffer unit (OBUF) and a return ring network processing logic unit (Rtn NAC). Among them, the input buffer unit is used to cache the request from the on-chip network to the last-level Cache; the pipeline unit is used to streamline the request from the input buffer to access the DDR storage space; the output buffer unit is used to cache the last-level Cache to access the DDR requests; the return ring network processing logic unit is responsible for arbitrating various types of incoming network-on-chip requests.

片外存储DDR,也称为主存,其数据会缓存在L1D和末级Cache中。Off-chip storage DDR, also known as main memory, its data will be cached in L1D and last-level Cache.

片上互连网络,用来接收网络请求,接收到网络请求后会首先进行译码处理,译码出目的节点和目的设备后将请求发向对应的位置。除此之外,片上互连网络也能对来自末级Cache的请求进行处理,原理与上一情况类似。所述网络请求为除去本地请求之外的其他请求,而本地请求为末级Cache中每个体对应的核发出的请求。The on-chip interconnection network is used to receive network requests. After receiving network requests, it will first perform decoding processing, and after decoding the destination node and destination device, the request will be sent to the corresponding location. In addition, the on-chip interconnection network can also process requests from the last-level Cache, and the principle is similar to the previous case. The network request is other requests except the local request, and the local request is a request sent by the core corresponding to each body in the last-level Cache.

下表1是所有目录请求的详细描述,其包含了目录操作过程中可能会出现的所有请求类型,以此方便对目录一致性机制进行介绍。Table 1 below is a detailed description of all directory requests, which includes all request types that may occur during directory operations, so as to facilitate the introduction of the directory consistency mechanism.

表1Table 1

本发明在具体应用过程中,进一步采用扩展的MSI目录协议。首先对基本的目录协议进行详细描述,基本的目录协议只对L1D进行一致性维护。In the specific application process of the present invention, the extended MSI directory protocol is further adopted. Firstly, the basic directory protocol is described in detail. The basic directory protocol only maintains the consistency of L1D.

L1D接收到指令后,会立刻对指令进行操作。根据指令访问的数据块在L1D的缓存及其脏位信息的不同,处理过程也会有所不同。其中,L1D读命中或者脏的写命中情况下,直接对指令进行处理。除此之外的所有情况,L1D都会向末级Cache发送请求来引发下一步的操作。After L1D receives the command, it will immediately operate on the command. Depending on the data block accessed by the instruction in the L1D cache and its dirty bit information, the processing process will be different. Among them, in the case of L1D read hit or dirty write hit, the instruction is directly processed. In all other cases, L1D will send a request to the last level Cache to trigger the next operation.

在具体应用过程中,如图2所示,本发明会根据操作的复杂程度可以分为四类,分别为图(a)、图(b)、图(c)、图(d)所示的情况。In the specific application process, as shown in Figure 2, the present invention can be divided into four categories according to the complexity of the operation, which are shown in Figure (a), Figure (b), Figure (c), and Figure (d) Condition.

图(a)是指令处理原理图(1)。从图中可以看出,指令在两步内操作完成。根据指令类别的不同,可以分为两种情况。L1D读缺失后,会产生GetS请求并将其发向末级Cache。末级Cache接收到GetS请求后,首先查看目录项。由于目录状态为I或者S,表明最新数据缓存在末级Cache中。因此,末级Cache直接将数据取出并返回给请求源;L1D写缺失后,会产生GetM请求并将其发向末级Cache。末级Cache接收到GetM请求后,首先查看目录项。由于目录状态为I,表明最新数据缓存在末级Cache中。因此,末级Cache直接将数据取出并返回给请求源。Figure (a) is the instruction processing schematic diagram (1). It can be seen from the figure that the instruction is completed in two steps. Depending on the type of instruction, it can be divided into two cases. After an L1D read miss, a GetS request is generated and sent to the last-level Cache. After receiving the GetS request, the last-level Cache first checks the directory entry. Since the directory status is I or S, it indicates that the latest data is cached in the last level Cache. Therefore, the last-level Cache directly fetches the data and returns it to the request source; after an L1D write miss, a GetM request is generated and sent to the last-level Cache. After receiving the GetM request, the last-level Cache first checks the directory entry. Since the directory status is I, it indicates that the latest data is cached in the last level Cache. Therefore, the last-level Cache directly fetches the data and returns it to the requesting source.

图(b)是指令处理原理图(2)。从图中可以看出,指令在三步内操作完成。根据指令类别的不同,可以分为两种情况。L1D读缺失后,会产生GetS请求并将其发向末级Cache。末级Cache接收到GetS请求后,首先查看目录项。由于目录状态为M,表明最新数据不在末级Cache中。因此,末级Cache会发送Fwd-GetS请求给拥有最新数据拷贝的L1D。该L1D接收到Fwd-GetS请求后,会同时向末级Cache和请求源发送读返回数据请求,并将本地目录变为S状态;L1D写缺失后,会产生GetM请求并将其发向末级Cache。末级Cache接收到GetM请求后,首先查看目录项。由于目录状态为M,表明最新数据不在末级Cache中。因此,末级Cache会发送Fwd-GetM请求给拥有最新数据拷贝的L1D。该L1D接收到Fwd-GetM请求后,只向请求源发送读返回数据请求,并将本地目录变为I状态。Figure (b) is the instruction processing principle diagram (2). It can be seen from the figure that the instruction is completed in three steps. Depending on the type of instruction, it can be divided into two cases. After an L1D read miss, a GetS request is generated and sent to the last-level Cache. After receiving the GetS request, the last-level Cache first checks the directory entry. Since the status of the directory is M, it indicates that the latest data is not in the last level Cache. Therefore, the last-level Cache will send a Fwd-GetS request to the L1D that has the latest copy of the data. After the L1D receives the Fwd-GetS request, it will send a read and return data request to the last-level Cache and the request source at the same time, and change the local directory to the S state; after the L1D write is missing, it will generate a GetM request and send it to the last level Cache. After receiving the GetM request, the last-level Cache first checks the directory entry. Since the status of the directory is M, it indicates that the latest data is not in the last level Cache. Therefore, the last-level Cache will send a Fwd-GetM request to the L1D that has the latest copy of the data. After the L1D receives the Fwd-GetM request, it only sends a read return data request to the request source, and changes the local directory into the I state.

图(c)是指令处理原理图(3)。从图中可以看出,指令在三步内操作完成。根据指令是否命中和数据是否为脏的不同,可以分为两种情况。L1D写缺失后,会产生GetM请求并将其发向末级Cache。末级Cache接收到GetM请求后,首先查看目录项。由于目录状态为S,末级Cache会根据共享列表信息向所有拥有数据拷贝的L1D发送Inv-L请求,同时读出数据并返回给请求者,且携带Ack信号。L1D接收到Inv-L请求后,会作废掉本地数据块,并向请求者发送Inv-Ack请求。写缺失的L1D在接收到数据以及所有的无效应答后,才进行写操作;L1D干净写命中后,也会产生GetM请求。由于其处理原理与写缺失情况一致,这里不再描述。Figure (c) is the instruction processing schematic diagram (3). It can be seen from the figure that the instruction is completed in three steps. According to whether the instruction hits and whether the data is dirty, it can be divided into two situations. After an L1D write miss, a GetM request is generated and sent to the last-level Cache. After receiving the GetM request, the last-level Cache first checks the directory entry. Since the directory status is S, the last-level Cache will send an Inv-L request to all L1Ds that have data copies according to the shared list information, and at the same time read the data and return it to the requester with an Ack signal. After receiving the Inv-L request, L1D will invalidate the local data block and send an Inv-Ack request to the requester. Write-missing L1Ds do not perform write operations until they receive data and all invalid responses; L1Ds also generate GetM requests after clean write hits. Since its processing principle is consistent with the case of write miss, it will not be described here.

图(d)是L1D替换处理原理图。从图中可以看出,指令在两步内操作完成。根据数据是否为脏,可以分为两种情况。当替换行为干净时,L1D会向末级Cache发送PutS请求。末级Cache接收到PutS请求后,会更新共享列表信息,若更新后的共享列表为全零,则还需要修改目录状态,使其变为I。操作完成后,末级Cache会向请求者发送Put-Ack请求;当替换行为脏时,L1D会向末级Cache发送PutM+data请求。末级Cache接收到PutM+data请求后,会更新数据体和目录信息。操作完成后,末级Cache会向请求者发送Put-Ack请求。Figure (d) is a schematic diagram of L1D replacement processing. It can be seen from the figure that the instruction is completed in two steps. According to whether the data is dirty, it can be divided into two cases. When the replacement behavior is clean, L1D will send a PutS request to the last level Cache. After receiving the PutS request, the last-level Cache will update the shared list information. If the updated shared list is all zeros, the directory status needs to be modified to change to I. After the operation is completed, the last-level Cache will send a Put-Ack request to the requester; when the replacement behavior is dirty, L1D will send a PutM+data request to the last-level Cache. After receiving the PutM+data request, the last-level Cache will update the data body and directory information. After the operation is completed, the last level Cache will send a Put-Ack request to the requester.

在DSP中,需要DMA在外设和内核间进行大量的数据搬运操作。若不对DMA进行一致性维护,难免会出现多核数据不一致的问题;若对DMA采用同步单元的软硬件协同机制进行一致性维护,需要程序员实时监控存储空间处理情况,给程序员提出不小的挑战。因此,本发明对基本目录协议进行了协议扩展,使其也支持对DMA请求进行一致性维护操作。DMA通过互连网络直接访问末级Cache。根据访问存储体操作的不同,可以将DMA请求分为两类,分别为图3所示的情况。In DSP, DMA is required to carry out a large number of data handling operations between the peripheral hardware and the core. If the consistency of DMA is not maintained, the problem of multi-core data inconsistency will inevitably arise; if the consistency of the software and hardware coordination mechanism of the synchronization unit is used for DMA, the programmer needs to monitor the processing of the storage space in real time, and it is not a small problem for the programmer. challenge. Therefore, the present invention extends the protocol of the basic directory protocol, so that it also supports the consistency maintenance operation on the DMA request. DMA directly accesses the last-level Cache through the interconnection network. According to different operations of accessing memory banks, DMA requests can be divided into two types, which are the situations shown in FIG. 3 .

图(a)是DMA读请求处理原理图。末级Cache接收到DMA读请求后,会查看目录信息,并根据目录信息的不同做相应的处理。当目录为I、S状态时,由于最新数据在末级Cache中,DMA读请求将数据读出后直接返回给DMA;当目录为M状态时,由于末级Cache没有最新数据,其会向拥有最新拷贝的L1D发送Fwd-Rd请求。该L1D接收到Fwd-Rd请求后,会同时向末级Cache和DMA发送读返回数据请求,并将目录变为S状态。Figure (a) is a schematic diagram of DMA read request processing. After receiving the DMA read request, the last-level Cache will check the directory information and perform corresponding processing according to the different directory information. When the directory is in the I and S state, since the latest data is in the last-level Cache, the DMA read request will directly return the data to the DMA; when the directory is in the M state, since the last-level Cache does not have the latest data, it will send the data to the owner The newly copied L1D sends a Fwd-Rd request. After receiving the Fwd-Rd request, the L1D will simultaneously send a read and return data request to the last-level Cache and DMA, and change the directory to the S state.

图(b)是DMA写请求处理原理图。末级Cache接收到DMA写请求后,会查看目录信息,并根据目录信息的不同做相应的处理。当目录为I状态时,由于最新数据在末级Cache中,DMA写请求会更新数据体,并在操作完成后向DMA返回应答信号;当目录为M状态时,由于没有最新数据,末级Cache会向拥有最新拷贝的L1D发送Fwd-Wrt请求。该L1D接收到Fwd-Wrt请求后,只向末级Cache发送读返回数据请求,并作废掉本地数据块。末级Cache接收到最新数据返回后,与DMA写请求携带的数据进行整合,然后再更新数据体和目录信息。操作完成后,末级Cache会向DMA返回应答信号;当目录为S状态时,末级Cache会根据共享列表信息向所有拥有数据拷贝的L1D发送Inv-DE请求。这些L1D接收到Inv-DE请求后,会作废掉本地数据块,并向末级Cache发送Inv-Ack请求。末级Cache接收到所有的无效应答后,才进行写操作,并在操作完成后向DMA返回应答信号。Figure (b) is a schematic diagram of DMA write request processing. After receiving the DMA write request, the last-level Cache will check the directory information and perform corresponding processing according to the different directory information. When the directory is in the I state, because the latest data is in the last-level Cache, the DMA write request will update the data body, and return a response signal to the DMA after the operation is completed; when the directory is in the M state, because there is no latest data, the last-level Cache A Fwd-Wrt request will be sent to the L1D which has the latest copy. After receiving the Fwd-Wrt request, the L1D only sends a read return data request to the last-level Cache, and invalidates the local data block. After the last-level Cache receives the latest data and returns it, it integrates with the data carried in the DMA write request, and then updates the data body and directory information. After the operation is completed, the last-level Cache will return a response signal to the DMA; when the directory is in the S state, the last-level Cache will send Inv-DE requests to all L1Ds that have data copies according to the shared list information. After receiving the Inv-DE request, these L1Ds will invalidate the local data block and send an Inv-Ack request to the last-level Cache. After receiving all invalid responses, the last-level Cache performs a write operation, and returns a response signal to the DMA after the operation is completed.

目录操作具有原子性,但在设计实现过程中会出现很多冲突的情况。例如:在芯片全局角度上,前一个目录请求处理完成前,后一个相关的目录请求抵达时会产生冲突。对于这些冲突情况,本发明使用目录控制器机制予以解决。根据目录控制器所在的位置将其分为两种,分别为L1D目录控制器和末级Cache目录控制器。Directory operations are atomic, but there will be many conflicts in the design and implementation process. For example: From the global perspective of the chip, conflicts will occur when the next related directory request arrives before the processing of the previous directory request is completed. For these conflict situations, the present invention uses the directory controller mechanism to resolve them. According to the location of the directory controller, it is divided into two types, namely, the L1D directory controller and the last-level Cache directory controller.

下表2.1、2.2、2.3是L1D目录控制器的详细描述。从表中可以看出,数据块目录状态除了M、S、I三个稳定状态外,还存在很多“中间”状态。例如:L1D读缺失至缺失数据返回期间,读请求访问的数据块一直处于ISD状态。处于“中间”状态的数据块可以响应部分相关的监听请求,对于不能响应的监听请求只能进行停顿(stall)处理。这样不仅保证了数据一致性,而且提高了系统性能。例如:L1D脏替换后替换行目录状态变为MIA。在末级Cache发送相应的应答信号抵达前,若此L1D接收到一个相关的Fwd-Rd请求,其会立刻响应并在操作结束时将目录状态变为SIATables 2.1, 2.2, and 2.3 below are detailed descriptions of the L1D directory controller. It can be seen from the table that in addition to the three stable states of M, S, and I, there are many "intermediate" states in the data block directory state. For example: during the period from the L1D read miss to the return of the missing data, the data block accessed by the read request is always in the IS D state. The data block in the "intermediate" state can respond to some relevant monitoring requests, and can only perform stall processing for the monitoring requests that cannot be responded to. This not only ensures data consistency, but also improves system performance. For example: after L1D dirty replacement, the directory status of the replacement row changes to MI A . If the L1D receives a related Fwd-Rd request before the corresponding response signal arrives from the last-level Cache, it will respond immediately and change the directory status to SIA at the end of the operation.

表2.1Table 2.1

表2.2Table 2.2

表2.3Table 2.3

本发明设计的目录协议中L1D干净写命中操作较为特殊。正常情况下,请求命中无需向下一级Cache取数据。为了简化目录协议复杂度,将其与L1D写缺失处理归为同一类。The L1D clean write hit operation in the directory protocol designed by the present invention is relatively special. Under normal circumstances, the request hit does not need to fetch data from the next level of Cache. To simplify directory protocol complexity, it is grouped with L1D write miss handling in the same category.

L1D接收到的无效请求分为两种,分别为Inv-L请求和Inv-DE请求。由于本发明设计的目录协议中增加了对DMA的一致性维护,且采用写作废的方式操作。因此,DMA写请求对末级Cache进行操作时,其发给各个拥有数据拷贝L1D的Inv-DE请求应该返回应答信号给末级Cache。而L1D在执行store指令时也可能会引发写作废操作(发送Inv-L请求),但无效应答信号是返回给L1D的。不同种类的无效请求,其应答信号返回的设备是不一样的。因此,本发明对其进行分类。There are two types of invalid requests received by the L1D, which are Inv-L requests and Inv-DE requests. Because the directory protocol designed by the present invention adds the consistency maintenance to DMA, and adopts the mode of writing waste to operate. Therefore, when a DMA write request operates on the last-level Cache, the request sent to each Inv-DE that owns the data copy L1D should return a response signal to the last-level Cache. And L1D may also cause write waste operation (send Inv-L request) when executing the store command, but the invalid response signal is returned to L1D. For different types of invalid requests, the devices returned by the response signals are different. Therefore, the present invention classifies it.

无效应答(Inv-Ack)请求抵达L1D后会根据情况做相应的处理。若其不是最后一个无效应答请求,则请求对应数据块的目录状态不发生改变;否则,请求对应数据块的目录状态从IMA或SMA变为M。After the invalid response (Inv-Ack) request arrives at L1D, it will deal with it according to the situation. If it is not the last invalid response request, the directory status of the corresponding data block is not changed; otherwise, the directory status of the corresponding data block is changed from IMA or SMA to M.

抵达L1D的数据读返回请求有两个来源,分别为末级Cache和其它的L1D。且由于部分数据读返回请求需要携带Ack信息(记录需要返回的无效应答数目),因此需要对其进行区分处理。如表2.3的第一行所示,数据读返回请求可以分为来自拥有者L1D且不携带Ack、来自拥有者L1D且携带Ack、来自末级Cache且不携带Ack、来自末级Cache且Ack为0、来自末级Cache且Ack大于0五种情况。There are two sources of data read and return requests arriving at L1D, namely the last level Cache and other L1Ds. And because some data read and return requests need to carry Ack information (record the number of invalid responses that need to be returned), it needs to be processed differently. As shown in the first row of Table 2.3, the data read return request can be divided into: from the owner L1D without Ack, from the owner L1D with Ack, from the last-level cache without Ack, and from the last-level cache with Ack 0. It comes from the last level Cache and the Ack is greater than 0 in five situations.

下表3.1、3.2、3.3是末级Cache目录控制器的详细描述。类似于L1D目录控制器,在末级Cache目录控制器中也存在“中间”状态。例如:L1D读缺失后会向末级Cache发出GetS请求。该请求访问的数据块在末级Cache中处于M状态时目录状态会变为SD,直至最新的数据返回。无效应答(Inv-Ack)请求与L1D目录控制器类似,这里不再描述。Tables 3.1, 3.2, and 3.3 below are detailed descriptions of the last-level cache directory controller. Similar to the L1D directory controller, there is also an "intermediate" state in the last-level Cache directory controller. For example: after an L1D read miss, a GetS request is sent to the last-level Cache. When the data block accessed by the request is in the M state in the last-level Cache, the directory state will change to SD until the latest data is returned. The Inv-Ack request is similar to the L1D directory controller and will not be described here.

表3.1chart 3.1

表3.2Table 3.2

表3.3Table 3.3

L1D替换分为干净行替换和脏行替换两种。L1D replacement is divided into clean row replacement and dirty row replacement.

干净行替换会向末级Cache发送PutS请求。在实际操作中可能有多个核共享一个数据块的情况,这时PutS请求可以根据抵达末级Cache的顺序分为非最后一个(PutS-NotLast)和最后一个(PutS-Last)两种情况。PutS-NotLast请求处理时,数据块的目录状态并不发生改变,而只改变对应的共享列表信息。PutS-Last请求处理时,数据块的目录状态从S变为I,且对应的共享列表信息清空。Clean row replacement will send a PutS request to the last level cache. In actual operation, there may be multiple cores sharing a data block. At this time, the PutS request can be divided into two cases: not the last one (PutS-NotLast) and the last one (PutS-Last) according to the order of arriving at the last level Cache. When the PutS-NotLast request is processed, the directory status of the data block does not change, but only the corresponding share list information changes. When the PutS-Last request is processed, the directory status of the data block changes from S to I, and the corresponding share list information is cleared.

脏行替换会向末级Cache发送PutM+data请求。该请求访问的数据块目录项可能在请求处理前已经改变,需要进行区别处理。若脏行替换请求访问的数据块目录状态为M且共享列表指示的L1D恰巧是进行此脏行替换的L1D,则称其为PutM+data from Owner请求。这种情况下,脏替换数据会更新末级Cache数据体且返回应答请求;否则,称该脏行替换请求为PutM+data from Non-Owner请求。这种情况下,只需返回应答请求即可。Dirty row replacement will send a PutM+data request to the last level Cache. The directory entry of the data block accessed by the request may have changed before the request is processed, which needs to be handled differently. If the status of the data block directory accessed by the dirty row replacement request is M and the L1D indicated by the shared list happens to be the L1D for the dirty row replacement, it is called a PutM+data from Owner request. In this case, the dirty replacement data will update the last-level Cache data body and return a response request; otherwise, the dirty row replacement request is called a PutM+data from Non-Owner request. In this case, simply return the reply request.

本发明基于末级Cache进行目录一致性机制的设计,末级Cache目录存储体中某一组的目录结构图如图4所示。从图中可以看出,末级Cache采用8路组相联的映射机制,且为每一路分配一个目录项。目录项由目录状态和共享列表两个部分组成。其中,目录状态指示该缓存数据块在末级Cache中是否具有最新数据以及是否为脏数据;共享列表指示该缓存数据块在第一级存储中的拷贝情况。结合目录状态及共享列表信息即可清楚地知道缓存数据块在片上的具体情况,从而方便对其进行一致性维护操作。The present invention designs the directory consistency mechanism based on the last-level Cache, and the directory structure diagram of a certain group in the last-level Cache directory storage body is shown in FIG. 4 . It can be seen from the figure that the last-level Cache adopts an 8-way set-associative mapping mechanism, and assigns a directory entry to each way. A directory entry consists of two parts: directory status and share list. Wherein, the directory status indicates whether the cached data block has the latest data in the last-level cache and whether it is dirty data; the sharing list indicates the copy status of the cached data block in the first-level storage. Combined with the directory status and shared list information, the specific situation of the cache data block on the slice can be clearly known, so as to facilitate its consistency maintenance operation.

目录机制在实现过程中会随着流水线的不同而所有不同,本发明举例对在L1D和末级Cache的流水线实现进行简单描述。The directory mechanism will vary with different pipelines in the implementation process. The present invention briefly describes the implementation of pipelines in L1D and last-level Cache by way of example.

如图5所示,为在具体应用实例中L1D流水线整体结构示意图。其流水线由七级组成,分别为DC1、DC2、EX1、EX2、EX3、EX4、EX5。L1D接收到load、store指令后,首先对其进行两站流水的译码处理(DC1和DC2),判断指令的操作类型及要实现的功能。指令译码完成后进行地址计算(EX1)。接着,是功能实现的过程。本发明设计了读执行3拍,写执行2拍的L1DCache访存流水线。其中,读操作流水线在SMAC访存主流水线中处于EX2、EX3、EX4位置,分别是命中缺失判断、访存/缺失处理、访存输出;写操作流水线则处于EX2、EX3位置,分别是缺失判断、访存/缺失处理。作为标量访存主流水线的一部分,L1DCache流水线也受到内核全局停顿信号(Stall)和流水线清除信号的控制。As shown in FIG. 5 , it is a schematic diagram of the overall structure of the L1D pipeline in a specific application example. Its assembly line consists of seven stages, namely DC1, DC2, EX1, EX2, EX3, EX4, EX5. After L1D receives the load and store instructions, it first performs two-station pipeline decoding processing (DC1 and DC2) to judge the operation type of the instruction and the function to be realized. Address calculation (EX1) is performed after instruction decoding is completed. Next, is the process of function realization. The present invention designs an L1DCache memory access pipeline with 3 shots for reading and 2 shots for writing. Among them, the read operation pipeline is in the positions of EX2, EX3, and EX4 in the mainstream pipeline of SMAC memory access, which are hit-miss judgment, memory access/missing processing, and memory access output respectively; the write operation pipeline is in the positions of EX2 and EX3, which are miss judgment respectively. , Fetching/missing handling. As part of the main pipeline of scalar memory access, the L1DCache pipeline is also controlled by the kernel global pause signal (Stall) and the pipeline clear signal.

如图6所示,为在具体应用实例中末级Cache流水线整体结构示意图。从图中可以看出,进入末级Cache的请求会在流水线走两条不同的路径。L1D的读返回请求或者L1D返回的无效应答请求无需缓存在输入缓冲中,而直接通过旁路传递给流水线的Tag_Judge站,此为第一条路径;访问DDR存储空间的数据请求需先缓存在输入缓冲中,再从第一级流水线开始进行流水处理,此为第二条路径。流水线由五级组成,分别为Req_Arb、Tag_Wait、Tag_Judge、Data_Acc、Data_Dec。As shown in FIG. 6 , it is a schematic diagram of the overall structure of the last-stage Cache pipeline in a specific application example. It can be seen from the figure that requests entering the last-level Cache will take two different paths in the pipeline. The read return request of L1D or the invalid response request returned by L1D does not need to be cached in the input buffer, but directly passed to the Tag_Judge station of the pipeline through the bypass, which is the first path; the data request for accessing the DDR storage space needs to be cached in the input buffer first. In buffering, pipeline processing starts from the first stage pipeline, which is the second path. The pipeline consists of five stages, namely Req_Arb, Tag_Wait, Tag_Judge, Data_Acc, Data_Dec.

下面对每级流水实现的功能进行详细描述。The functions realized by each level of pipeline are described in detail below.

流水线第一级(Req_Arb):请求在这一站内与Flush请求进行轮转仲裁,并将仲裁出的请求发向下一级流水。同时,还会读取请求数据块的有效位、脏位以及Tag信息。The first stage of the pipeline (Req_Arb): The request performs round-robin arbitration with the Flush request in this station, and sends the arbitrated request to the next-level pipeline. At the same time, the valid bit, dirty bit and Tag information of the requested data block will be read.

流水线第二级(Tag_Wait):请求在这一站只是判断是否为目录请求,以及读取目录信息。The second stage of the pipeline (Tag_Wait): the request at this station is only to judge whether it is a directory request, and to read the directory information.

流水线第三级(Tag_Judge):首先判断请求是否命中,若缺失则再判断是否与MBUF中地址相关。若为相关的缺失请求,则将请求发向MBUF;若为不相关的缺失请求,则将请求发向MBUF和OBUF。命中请求根据是否为目录请求,进行相应的处理。若为非目录请求,则产生访问数据体使能;若为目录请求,则查看目录项信息,根据目录状态以及共享列表的不同请求的处理也不一样。目录请求的处理方式分为三类:第一类直接进行操作,并产生访问数据体使能;第二类由于最新数据不在末级Cache中,因此需等待L1D数据返回,待数据抵达后再产生访问数据体使能;第三类需要等待Inv-Ack请求,待无效应答全部抵达后再产生访问数据体使能。The third stage of the pipeline (Tag_Judge): first judge whether the request is hit, if it is missing, then judge whether it is related to the address in the MBUF. If it is a relevant missing request, send the request to MBUF; if it is an irrelevant missing request, send the request to MBUF and OBUF. The hit request is processed accordingly according to whether it is a directory request. If it is a non-directory request, it will generate the access data body enable; if it is a directory request, it will check the directory item information, and the processing of different requests is different according to the directory status and the shared list. The processing methods of directory requests are divided into three categories: the first category directly operates and generates access data body enable; the second category needs to wait for the return of L1D data because the latest data is not in the last-level cache, and generates it after the data arrives Access to the data body is enabled; the third type needs to wait for the Inv-Ack request, and the access to the data body is enabled after all invalid responses arrive.

流水线第四级(Data_Acc):首先判断请求处理数据体的类别。若为读操作,则将请求访问的数据块从数据体读出后锁存一拍;若为写操作,则先对写入数据进行编码,再进行更新数据体操作。进行数据体读操作的请求最后会将读出的数据发向下一级流水。The fourth stage of the pipeline (Data_Acc): first judge the category of the requested processing data body. If it is a read operation, read the requested data block from the data body and then latch it for one beat; if it is a write operation, first encode the written data, and then perform the operation of updating the data body. A request for a data volume read operation will finally send the read data to the next-level pipeline.

流水线第五级(Data_Dec):对上一级读出的请求数据进行译码操作,并将读返回数据请求发向上环仲裁处理模块进行处理。The fifth stage of the pipeline (Data_Dec): decodes the request data read by the upper stage, and sends the request to read the returned data to the upper ring arbitration processing module for processing.

以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims (9)

1. the multinuclear catalogue concordance device towards GPDSP framework, it is characterised in that including:
Kernel, comprising DMA and L1D, L1D is level one data Cache;Described DMA has been used for removing of peripheral hardware and interior internuclear data Fortune;Described L1D includes Normal Deal and two parallel processing elements of Monitor Deal, described Normal Deal process Unit completes the process of load, store instruction, and described Monitor Deal processing unit is used for responding the prison that any time arrives at Listen request, and processing procedure is not affected by Normal Deal processing unit;
Final stage Cache on sheet, distributed is connected on on-chip interconnection network;
Sheet external memory DDR, data buffer storage is on L1D and sheet in final stage Cache;
On-chip interconnection network, is used for receiving network request, can first carry out decoding process, decode out mesh after receiving network request Node and purpose equipment after request is sent to the position of correspondence.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 1, it is characterised in that described Be divided into several body in upper final stage Cache, each described body by input buffer cell IBUF, pipelined units PipeLine, Output buffer cell OBUF and return looped network process logical block Rtn NAC composition;Described input buffer cell is used for being responsible for Cache the request entering final stage Cache from network-on-chip;Described pipelined units is used for depositing the access DDR from input buffering The request of storage space carries out streamlined process;Described output buffer cell is used for being responsible for caching final stage Cache and accesses the request of DDR; Described return looped network processes logical block and is used for being responsible for the request of polytype entrance network-on-chip is carried out arbitration process.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 1 and 2, it is characterised in that also wrap Including MSI directory protocol unit, the request being used for sending L1D carries out consistency maintenance;Described MSI directory protocol unit by M, S, Tri-directory states compositions of I;M state represents that data are by certain DSP Core is exclusive and data are dirty;S state represents that data are by one Individual or multiple DSP Core share and data are clean;I state represents that all of DSP Core does not has data to copy.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 3, it is characterised in that also include Contents controller, described contents controller judges the correctness of scheme on protocol hierarchy, has been used for single request not With the process under directory states, the clash handle of multiple association requests and be in the data block of catalogue " intermediate state " to relevant please The response asked processes;Described contents controller is divided into two classes, a class to be placed in L1D, another kind of is placed on final stage Cache on sheet In, directory operation is carried out in final stage Cache on L1D and sheet.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 3, it is characterised in that described Depositing the bibliographic structure of complete catalogue in upper final stage Cache, described bibliographic structure is be buffered on sheet in final stage Cache every One data block distribution directory entry;Described directory entry includes directory states and shared list two parts, and described shared list is each Individual DSP Core distributes one and represents whether data exist copy in corresponding DSP Core.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 1 and 2, it is characterised in that in institute State employing pipeline organization in L1D, in order to complete Instruction decoding, address computation, read flag and mode bit, judgement hit, data Body accesses and data return the fluvial processes operated.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 6, it is characterised in that described L1D Pipeline organization be respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5;After L1D receives load, store instruction, first First its decoding carrying out two station flowing water is processed, i.e. DC1 and DC2, it is judged that the action type of instruction and function to be realized;Instruction Decode laggard row address to calculate, i.e. EX1, complete functional realiey;Then complete to read to perform 3 bats, write and perform 2 bats L1DCache memory access streamline, wherein read operation streamline is in EX2, EX3, EX4 position in SMAC memory access main pipeline, point It not that hit disappearance judges, memory access/disappearance processes, memory access output;Write operation streamline is then in EX2, EX3 position, is respectively Disappearance judges, memory access/disappearance processes.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 1 and 2, it is characterised in that in institute State and final stage Cache uses on sheet pipeline organization, in order to realize read flag and mode bit, reading directory entry, to judge hit, monitoring Process, data volume accesses and data return the fluvial processes operated.
Multinuclear catalogue concordance device towards GPDSP framework the most according to claim 8, it is characterised in that described The pipeline organization of upper final stage Cache includes:
Streamline first order Req_Arb: carry out round-robin arbitration with Flush request, and request arbitration gone out is sent to next stage stream Water;Meanwhile, the significance bit of read requests data block, dirty position and Tag information are gone back;
Streamline second level Tag_Wait: determine whether catalog request, and read directory information;
Streamline third level Tag_Judge: first determine whether to ask whether to hit, if disappearance, then judge whether and address in MBUF Relevant;If relevant miss request, then request is sent to MBUF;If incoherent miss request, then request is sent to MBUF and OBUF;Hit requests, according to whether be catalog request, processes accordingly;If non-catalog request, then produce visit Ask that data volume enables;If catalog request, then check directory entry information, according to directory states and the different requests of shared list Process the most different;The processing mode of catalog request is divided three classes: the first kind directly operates, and produces access data volume Enable;Equations of The Second Kind waits that L1D data return, and pending data produces after arriving at again and accesses data volume enable;3rd class waits Inv-Ack Request, response to be invalidated produces after all arriving at again and accesses data volume enable;
Streamline fourth stage Data_Acc: first determine whether the classification of request processing data body;If read operation, then request is accessed Data block from data volume read after latch one bat;If write operation, the most first write data are encoded, then be updated number Make according to gymnastics;The data of reading finally can be sent to next stage flowing water by the request carrying out data volume read operation;
Streamline level V Data_Dec: the request data reading upper level carries out decoded operation, and please by reading return data Ask and be sent to ring arbitration process module and process.
CN201610503703.5A 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework Active CN106201939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610503703.5A CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610503703.5A CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Publications (2)

Publication Number Publication Date
CN106201939A true CN106201939A (en) 2016-12-07
CN106201939B CN106201939B (en) 2019-04-05

Family

ID=57463707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610503703.5A Active CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Country Status (1)

Country Link
CN (1) CN106201939B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117396A (en) * 2018-08-30 2019-01-01 山东经安纬固消防科技有限公司 memory access method and system
CN110704343A (en) * 2019-09-10 2020-01-17 无锡江南计算技术研究所 Data transmission method and device for memory access and on-chip communication of many-core processor
CN113435153A (en) * 2021-06-04 2021-09-24 上海天数智芯半导体有限公司 Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems
CN116028418A (en) * 2023-02-13 2023-04-28 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer
CN119669109A (en) * 2025-02-21 2025-03-21 山东云海国创云计算装备产业创新中心有限公司 Many-core cache consistency system, method, electronic device, storage medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279428A (en) * 2013-05-08 2013-09-04 中国人民解放军国防科学技术大学 Explicit multi-core Cache consistency active management method facing flow application
CN103714039A (en) * 2013-12-25 2014-04-09 中国人民解放军国防科学技术大学 Universal computing digital signal processor
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN105389277A (en) * 2015-10-29 2016-03-09 中国人民解放军国防科学技术大学 Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor)
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279428A (en) * 2013-05-08 2013-09-04 中国人民解放军国防科学技术大学 Explicit multi-core Cache consistency active management method facing flow application
CN103714039A (en) * 2013-12-25 2014-04-09 中国人民解放军国防科学技术大学 Universal computing digital signal processor
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN105389277A (en) * 2015-10-29 2016-03-09 中国人民解放军国防科学技术大学 Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor)
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李明: ""X-DSP一级数据Cache的设计与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117396A (en) * 2018-08-30 2019-01-01 山东经安纬固消防科技有限公司 memory access method and system
CN110704343A (en) * 2019-09-10 2020-01-17 无锡江南计算技术研究所 Data transmission method and device for memory access and on-chip communication of many-core processor
CN113435153A (en) * 2021-06-04 2021-09-24 上海天数智芯半导体有限公司 Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems
CN113435153B (en) * 2021-06-04 2022-07-22 上海天数智芯半导体有限公司 Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems
CN116028418A (en) * 2023-02-13 2023-04-28 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer
CN119669109A (en) * 2025-02-21 2025-03-21 山东云海国创云计算装备产业创新中心有限公司 Many-core cache consistency system, method, electronic device, storage medium and product

Also Published As

Publication number Publication date
CN106201939B (en) 2019-04-05

Similar Documents

Publication Publication Date Title
US9513904B2 (en) Computer processor employing cache memory with per-byte valid bits
CN104699631B (en) It is multi-level in GPDSP to cooperate with and shared storage device and access method
US8180981B2 (en) Cache coherent support for flash in a memory hierarchy
US6622217B2 (en) Cache coherence protocol engine system and method for processing memory transaction in distinct address subsets during interleaved time periods in a multiprocessor system
US6636949B2 (en) System for handling coherence protocol races in a scalable shared memory system based on chip multiprocessing
US6697919B2 (en) System and method for limited fanout daisy chaining of cache invalidation requests in a shared-memory multiprocessor system
US9740617B2 (en) Hardware apparatuses and methods to control cache line coherence
US20170185515A1 (en) Cpu remote snoop filtering mechanism for field programmable gate array
US9361233B2 (en) Method and apparatus for shared line unified cache
GB2349721A (en) Multi-processor data processing system
US7000078B1 (en) System and method for maintaining cache coherency in a shared memory system
US20160092354A1 (en) Hardware apparatuses and methods to control cache line coherency
WO2018229702A1 (en) Reducing cache transfer overhead in a system
CN108710582A (en) The system, apparatus and method of selective enabling for the instruction processing based on locality
US6622218B2 (en) Cache coherence protocol engine and method for efficient processing of interleaved memory transactions in a multiprocessor system
CN106201939B (en) Multicore catalogue consistency device towards GPDSP framework
CN111913891A (en) Hybrid directory and snoop based coherency for reducing directory update overhead in a two-tiered memory
CN109661656A (en) Method and apparatus for the intelligent storage operation using the request of condition ownership
US6412047B2 (en) Coherency protocol
US9436605B2 (en) Cache coherency apparatus and method minimizing memory writeback operations
US20240419599A1 (en) Direct cache transfer with shared cache lines
US20240419595A1 (en) Coherent hierarchical cache line tracking
US20070073977A1 (en) Early global observation point for a uniprocessor system
van den Brand et al. Streaming consistency: a model for efficient MPSoC design
JP2001043133A (en) Method and system for maintaining cache coherency for write-through-store operation in multiprocessor system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant