[go: up one dir, main page]

CN120821687A - Heterogeneous computing processing core implementation method based on embedded architecture - Google Patents

Heterogeneous computing processing core implementation method based on embedded architecture

Info

Publication number
CN120821687A
CN120821687A CN202510866202.2A CN202510866202A CN120821687A CN 120821687 A CN120821687 A CN 120821687A CN 202510866202 A CN202510866202 A CN 202510866202A CN 120821687 A CN120821687 A CN 120821687A
Authority
CN
China
Prior art keywords
processing core
instruction
module
write
main processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510866202.2A
Other languages
Chinese (zh)
Inventor
周凯
颜佳宁
于帆
李乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co Ltd filed Critical Shandong Inspur Science Research Institute Co Ltd
Priority to CN202510866202.2A priority Critical patent/CN120821687A/en
Publication of CN120821687A publication Critical patent/CN120821687A/en
Pending legal-status Critical Current

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本发明公开一种基于内嵌式架构的异构计算处理核心实现方法,涉及异构计算与并行处理技术领域。针对现有的异构计算处理核心难以突破数据传输效率、软件调度复杂等方面的问题,采用方案如下:在内嵌式异构计算处理核心范式中,将N个独立的处理核心划分为一个主处理核心与N‑1个协处理核心,其中,主处理核心负责与异构编程软件框架进行交互,同时承担缓存、取指译码、发射和执行四个模块的全部计算任务;协处理核心仅保留执行模块的物理电路,其缓存、取指译码和发射操作通过片上总线由主处理核心的对应模块远程控制完成。本发明通过从硬件层面优化异构计算处理核心范式,节约了硬件资源,同时简化了软件执行复杂性。

The present invention discloses a method for realizing a heterogeneous computing processing core based on an embedded architecture, and relates to the technical field of heterogeneous computing and parallel processing. In view of the problems that the existing heterogeneous computing processing core is difficult to break through in terms of data transmission efficiency and software scheduling complexity, the following scheme is adopted: in the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with the heterogeneous programming software framework, and at the same time undertakes all computing tasks of the four modules of caching, instruction fetching and decoding, emission and execution; the co-processing core only retains the physical circuit of the execution module, and its caching, instruction fetching and decoding and emission operations are remotely controlled by the corresponding module of the main processing core through the on-chip bus. The present invention saves hardware resources and simplifies the complexity of software execution by optimizing the heterogeneous computing processing core paradigm from the hardware level.

Description

Heterogeneous computing processing core implementation method based on embedded architecture
Technical Field
The invention relates to the technical field of heterogeneous computation and parallel processing, in particular to a method for realizing a heterogeneous computation processing core based on an embedded architecture.
Background
With the increasing complexity of computing tasks and the increasing scale of computing power, the more difficult the traditional isomorphic computing processing cores are to meet the computing requirements, heterogeneous computing modes are generated. The heterogeneous computing mode is used for constructing a computing power product with high adaptability and high computing power together by combining a plurality of processing cores with different architectures and different computing suitability, and provides a high-efficiency solution for the modern computing power application requirements.
However, existing heterogeneous computing modes, while having excellent performance advantages in terms of computational power supply, also present some performance bottlenecks due to their heterogeneous nature. On one hand, data interaction is required to be frequently carried out among a plurality of independent processing cores, the data transmission efficiency is limited by the number of interaction interfaces of the processing cores and is difficult to continuously promote, on the other hand, for heterogeneous computing products, a heterogeneous programming software framework is required to be adaptively constructed, corresponding execution files are constructed for each independent processing core, and the execution process of the multi-processing cores is scheduled and coordinated, so that the complexity of software programming is increased.
Disclosure of Invention
Aiming at the problems that the existing heterogeneous computing processing core is difficult to break through in the aspects of data transmission efficiency, complex software scheduling and the like, the invention provides the implementation method of the heterogeneous computing processing core based on the embedded architecture, and the hardware resource is saved and the software execution complexity is simplified by optimizing the heterogeneous computing processing core paradigm from the hardware level.
The invention provides a heterogeneous computing processing core implementation method based on an embedded architecture, which solves the technical problems and adopts the following technical scheme:
In the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with a heterogeneous programming software framework and simultaneously bears all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, and the caching, instruction fetching decoding and transmitting operations of the physical circuits are completed by corresponding modules of the main processing core through an on-chip bus.
Optionally, the specific implementation procedure of the related method is as follows:
S1, dividing the whole calculation task into a main processing core task and a co-processing core task according to the calculation suitability of each processing core by the heterogeneous programming software framework, marking, correspondingly transmitting the whole calculation task and required data into an instruction cache and a data cache of the main processing core, and transmitting a starting signal to the main processing core;
S2, a fetching decoding module of the main processing core starts to work, reads instructions from an instruction cache and completes decoding work of the instructions;
S3, the main processing core enters a scoreboard logic according to the decoding result, judges whether register conflict exists, and enters a register file to read operands when the register conflict does not exist, and enters a transmitting module;
And S4, after the execution modules of the main processing core and the co-processing core complete calculation, the calculation result is written back to a register file or a data cache of the main processing core through a write-back module, wherein the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.
Further optionally, the transmitting module involved is used as a task scheduling hub of the main processing core and the co-processing core, and the differentiated distributing logic is executed based on the task identifier output by the instruction fetching decoding module:
for a main processing core task, directly pushing the main processing core task to a local execution module to start calculation;
For the co-processing core task, task data and instruction parameters are packaged through an on-chip high-speed bus and sent to an execution module of the corresponding co-processing core;
In the process, the transmitting module synchronously completes task address mapping and authority verification of the co-processing core, ensures the correctness and safety of data transmission, and simultaneously confirms the task receiving state through handshake signals and the co-processing core to avoid task loss caused by bus congestion.
Optionally, the related method constructs a dual pointer instruction queue management mechanism for sequential transceiving of decoupled instructions and concurrent execution of computation, and the implementation is as follows:
1.1 Maintaining, by the main processing core, an issue instruction pointer and a write back instruction pointer, the issue instruction pointer and the write back instruction pointer being initialized to 0 before the overall computing task begins to execute;
1.2 Judging whether the instruction queue in the scoreboard is full:
1.2 a) if the queue is full, suspending the operation of the instruction fetch decoding module until the queue has an idle slot;
1.2 b) if not, the instruction fetch decoding module performs instruction extraction and decoding, and sends the instruction to the transmitting module, and then step 3) is performed;
1.3 The transmitting module determines a target executing module of the instruction according to the instruction decoding result and further judges whether the target executing module is available:
1.3 a) if the target execution module is not available, placing the instruction into a waiting queue;
1.3 b) if the target execution module is available, sending an instruction to the target execution module, adding 1 to the instruction pointer to be launched, and writing back the instruction pointer to be unchanged;
1.4 The main processing core monitors the instruction execution state of the target execution module through the write-back pointer:
1.4 a) if the target execution module does not execute the completion instruction, keeping waiting;
1.4 b) if the target execution module executes the completion instruction, executing a write-back operation, storing write-back data into a register or a data cache, adding 1 to the write-back instruction pointer, keeping the transmitting instruction pointer unchanged, and then continuing to circularly check the next instruction state.
Optionally, the related method constructs a memory access barrier management mechanism, and obtains data from a data cache by respectively configuring memory access units in execution modules of a main processing core and a co-processing core, which is specifically implemented as follows:
2.1 After the instruction fetching decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and the processing core type of the execution operation through an operand, and then storing the processing core type identification into a memory access barrier table according to the memory access address;
2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier to the end of the address entry to form an ordered queue;
2.1 b) if the memory barrier table does not have the memory address entry information, creating a row of address entries, and writing the processing core type identification into the first column;
2.2 When the access units of the main processing core and the co-processing core complete the access calculation and prepare to write back the cache, firstly entering an access barrier table to perform access address arbitration:
2.2 a) if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, these processing cores concurrently perform a write-back operation;
2.2 b) if there is a non-first list of an address entry of a certain processing core, suspending the write-back operation of the processing core until the first list of the corresponding address entry completes the write-back operation;
2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers under the address entry by one bit to the left, filling the first blank by the subsequent identifier, and clearing the address entry if the processing core type identifier does not exist under the address entry.
Further optionally, the processing core types involved include a main processing core and a co-processing core.
Further optionally, step 2.2 a) is performed, if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is controlled in a unified manner by the local write-back module, and the first co-processing core in the memory access entry is controlled in a unified manner by the write-back module of the main processing core.
Optionally, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:
when a RAW conflict is detected, the instruction enters a waiting queue until the dependency is released;
for WAR/WAW conflicts, physical registers are dynamically allocated to eliminate conflicts by a register renaming technique.
Optionally, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture;
The main processing core adopts a CPU (Central processing Unit) for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU (graphics processing Unit), an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit) according to the calculation requirement.
The method for realizing the heterogeneous computing processing core based on the embedded architecture has the beneficial effects that compared with the prior art:
1. The invention constructs a plurality of independent processing cores in the heterogeneous computation paradigm into a main multi-cooperative processing core mode, embeds pipeline modules such as caching, instruction fetching decoding, transmitting and the like of the cooperative processing cores in the main processing core, reduces logic resources, changes an interactive link among the plurality of processing cores into an internal layout wiring from an external IO interface, has higher transmission efficiency;
2. By constructing the double-pointer instruction queue management mechanism and the access barrier management mechanism, the heterogeneous computing performance and the access performance can be improved, so that the main processing core execution module and the plurality of co-processing core execution modules can execute concurrently without destroying the sequence of the whole computing task.
Drawings
FIG. 1 is a schematic diagram of an overall architecture of an embedded heterogeneous computing processing core according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of a conventional heterogeneous computing processing core;
FIG. 3 is a diagram illustrating a dual pointer instruction queue management mechanism according to a second embodiment of the present invention.
Fig. 4 is a schematic diagram of a memory access barrier management mechanism according to a third embodiment of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.
Embodiment one:
referring to fig. 1, this embodiment proposes a heterogeneous computing processing core implementation method based on an embedded architecture, in an embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, where the main processing core is responsible for interacting with a heterogeneous programming software framework, and simultaneously performs all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, where caching, instruction fetching decoding and transmitting operations are completed by corresponding modules of the main processing core through on-chip buses in a remote control manner.
The specific implementation process of the method of the embodiment is as follows:
S1, dividing the whole calculation task into a main processing core task and a co-processing core task according to the calculation suitability of each processing core by the heterogeneous programming software framework, marking, correspondingly transmitting the whole calculation task and required data into an instruction cache and a data cache of the main processing core, and transmitting a starting signal to the main processing core;
S2, a fetching decoding module of the main processing core starts to work, reads instructions from an instruction cache and completes decoding work of the instructions;
S3, the main processing core enters a scoreboard logic according to the decoding result, judges whether register conflict exists, and enters a register file to read operands when the register conflict does not exist, and enters a transmitting module;
And S4, after the execution modules of the main processing core and the co-processing core complete calculation, the calculation result is written back to a register file or a data cache of the main processing core through a write-back module, wherein the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.
In this embodiment, the transmitting module is used as a task scheduling hub of the main processing core and the co-processing core, and executes the differentiated distributing logic based on the task identifier output by the instruction fetching decoding module:
for a main processing core task, directly pushing the main processing core task to a local execution module to start calculation;
For the co-processing core task, task data and instruction parameters are packaged through an on-chip high-speed bus and sent to an execution module of the corresponding co-processing core;
In the process, the transmitting module synchronously completes task address mapping and authority verification of the co-processing core, ensures the correctness and safety of data transmission, and simultaneously confirms the task receiving state through handshake signals and the co-processing core to avoid task loss caused by bus congestion.
It should be noted that, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture. The main processing core adopts a CPU for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU, an FPGA or an ASIC according to the calculation requirement.
With reference to fig. 1 and 2, it can be known from the foregoing execution process that, in the embedded heterogeneous computing processing core paradigm of this embodiment, the heterogeneous programming software framework only needs to perform processing core identification marking on a computing task when the computing task is compiled, and starts the main processing core when the computing task is issued, and does not need to monitor the running state of the processing core in real time, which greatly reduces the running overhead of software.
Embodiment two:
On the basis of the first embodiment, referring to fig. 3, in order to further improve heterogeneous computing performance and memory performance, so that the main processing core execution module and the numerous co-processing core execution modules can concurrently execute without destroying the sequence of the overall computing task, it is further proposed to construct a dual pointer instruction queue management mechanism for sequential transceiving of the decoupled instructions and concurrent execution of the computation, which is specifically implemented as follows:
1.1 Maintaining, by the main processing core, an issue instruction pointer and a write back instruction pointer, the issue instruction pointer and the write back instruction pointer being initialized to 0 before the overall computing task begins to execute;
1.2 Judging whether the instruction queue in the scoreboard is full:
1.2 a) if the queue is full, suspending the operation of the instruction fetch decoding module until the queue has an idle slot;
1.2 b) if not, the instruction fetch decoding module performs instruction extraction and decoding, and sends the instruction to the transmitting module, and then step 3) is performed;
1.3 The transmitting module determines a target executing module of the instruction according to the instruction decoding result and further judges whether the target executing module is available:
1.3 a) if the target execution module is not available, placing the instruction into a waiting queue;
1.3 b) if the target execution module is available, sending an instruction to the target execution module, adding 1 to the instruction pointer to be launched, and writing back the instruction pointer to be unchanged;
1.4 The main processing core monitors the instruction execution state of the target execution module through the write-back pointer:
1.4 a) if the target execution module does not execute the completion instruction, keeping waiting;
1.4 b) if the target execution module executes the completion instruction, executing a write-back operation, storing write-back data into a register or a data cache, adding 1 to the write-back instruction pointer, keeping the transmitting instruction pointer unchanged, and then continuing to circularly check the next instruction state.
In the execution process, the main processing core can ensure that the execution modules of all the processing cores can execute calculation concurrently by maintaining the transmitting instruction pointer instead of only one execution module executing calculation at the same time, and the main processing core can enable a plurality of execution modules of a plurality of processing cores to feed back the writing-back effective signals at the same time by maintaining the writing-back instruction pointer, but the writing-back of effective data can only strictly follow the original instruction sequence and cannot be disturbed by the calculation modules executed concurrently.
Embodiment III:
on the basis of the first embodiment or the second embodiment, referring to fig. 4, in this embodiment, in order to improve the memory access efficiency of the main co-processing core, a memory access barrier management mechanism is further constructed, and data is obtained from a data cache by respectively configuring memory access units in execution modules of the main processing core and the co-processing core, which is specifically implemented as follows:
2.1 After the instruction fetch decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and a processing core type (main processing core/co-processing core) for executing the operation through an operand, and then storing a processing core type identifier (main processing core/co-processing core) into a memory access barrier table according to the memory access address;
2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier (main processing core/co-processing core) to the end of the address entry to form an ordered queue;
2.1 b) if the memory barrier table does not have the memory address entry information, creating a row address entry, and writing the processing core type identification (main processing core/co-processing core) into the first column; it should be added that if the processing cores currently executing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is uniformly controlled by the local write-back module, and the first auxiliary processing core in the memory access entry is uniformly controlled by the write-back module of the main processing core;
2.2 When the access units of the main processing core and the co-processing core complete the access calculation and prepare to write back the cache, firstly entering an access barrier table to perform access address arbitration:
2.2 a) if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, these processing cores concurrently perform a write-back operation;
2.2 b) if there is a non-first list of an address entry of a certain processing core, suspending the write-back operation of the processing core until the first list of the corresponding address entry completes the write-back operation;
2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers (main processing core/co-processing core) under the address entry by one bit, filling the first column blank by the subsequent identifier, and clearing the address entry if the processing core type identifier (main processing core/co-processing core) is not available under the address entry.
The main processing core maintains the access barrier table, so that the multi-processing core in the heterogeneous computation paradigm can perform data interaction with the cache concurrently on one hand, the data transmission efficiency is improved, the read-write sequence requirement of the data can be ensured on the other hand, and the effectiveness of the cache data is ensured.
For the first embodiment, the second embodiment and the third embodiment, it should be noted that, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:
when a RAW conflict is detected, the instruction enters a waiting queue until the dependency is released;
for WAR/WAW conflicts, physical registers are dynamically allocated to eliminate conflicts by a register renaming technique.
In summary, by adopting the implementation method of the heterogeneous computing processing core based on the embedded architecture, a plurality of independent processing cores in the heterogeneous computing paradigm are built into a main multi-cooperative processing core mode, pipeline modules such as caching, instruction fetching decoding and transmitting of the cooperative processing cores are embedded into the main processing core, logic resources are reduced, meanwhile, interaction links among the plurality of processing cores are changed into internal layout wiring from external IO interfaces, the transmission efficiency is higher, meanwhile, the application simplification is brought to a heterogeneous programming software framework by the heterogeneous programming software framework, and the heterogeneous programming software framework only needs to add corresponding cooperative processing core identifiers according to the cooperative processing core function support in the process of building execution files for the main processing cores, and does not need to build independent execution files for each processing core, and scheduling of execution processes of the plurality of processing cores.
The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.

Claims (9)

1.一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,在内嵌式异构计算处理核心范式中,将N个独立的处理核心划分为一个主处理核心与N-1个协处理核心,其中,主处理核心负责与异构编程软件框架进行交互,同时承担缓存、取指译码、发射和执行四个模块的全部计算任务;协处理核心仅保留执行模块的物理电路,其缓存、取指译码和发射操作通过片上总线由主处理核心的对应模块远程控制完成。1. A method for implementing a heterogeneous computing processing core based on an embedded architecture, characterized in that, within the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with the heterogeneous programming software framework and simultaneously undertakes all computing tasks of the four modules of caching, instruction fetch decoding, transmission, and execution; the co-processing core only retains the physical circuit of the execution module, and its caching, instruction fetch decoding, and transmission operations are remotely controlled by the corresponding module of the main processing core via an on-chip bus. 2.根据权利要求1所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,所述方法的具体执行过程如下:2. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 1, wherein the specific execution process of the method is as follows: S1、异构编程软件框架根据各个处理核心的计算适配性,将整个计算任务划分为主处理核心任务和协处理核心任务,并进行标记;随后将整个计算任务及所需数据对应发送至主处理核心的指令缓存和数据缓存中,并向主处理核心发送启动信号;S1. The heterogeneous programming software framework divides the entire computing task into main processing core tasks and co-processing core tasks based on the computing adaptability of each processing core, and labels them. It then sends the entire computing task and the required data to the instruction cache and data cache of the main processing core, and sends a start signal to the main processing core. S2、主处理核心的取指译码模块开始工作,从指令缓存中读取指令,并完成指令的译码工作;取指取指译码模块通过计算任务中标记的处理核心类型,识别出协处理核心任务;S2. The instruction fetch and decode module of the main processing core starts working, reads instructions from the instruction cache, and completes the instruction decoding work; the instruction fetch and decode module identifies the co-processing core task by the processing core type marked in the computing task; S3、主处理核心根据译码结果进入记分牌逻辑,判断是否存在寄存器冲突,并在不存在寄存器冲突时进入寄存器文件中读取操作数,进入发射模块;发射模块根据译码结果,将主处理核心任务发送至主处理核心的执行模块完成计算,同时将协处理核心任务通过片上总线发送给对应的协处理核心执行模块;S3. The main processing core enters the scoreboard logic based on the decoding result to determine whether there is a register conflict. If there is no register conflict, it reads the operand from the register file and enters the transmission module. The transmission module sends the main processing core task to the execution module of the main processing core to complete the calculation based on the decoding result, and sends the co-processing core task to the corresponding co-processing core execution module via the on-chip bus. S4、主处理核心和协处理核心的执行模块完成计算后,将计算结果通过写回模块写回至主处理核心的寄存器文件或数据缓存中;其中,协处理核心的写回操作由主处理核心的写回模块统一控制。S4. After the execution modules of the main processing core and the co-processing core complete the calculation, they write the calculation results back to the register file or data cache of the main processing core through the write-back module; wherein, the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core. 3.根据权利要求2所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,所述发射模块作为主处理核心与协处理核心的任务调度枢纽,基于取指取指译码模块输出的任务标识执行差异化分发逻辑:3. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the transmitting module serves as a task scheduling hub for the main processing core and the co-processing core, and executes differentiated distribution logic based on the task identifier output by the instruction fetch and decode module: 对于主处理核心任务,直接推送至本地执行模块启动计算;For the main processing core tasks, push them directly to the local execution module to start the calculation; 对于协处理核心任务,通过片上高速总线封装任务数据与指令参数,发送至对应协处理核心的执行模块;For co-processing core tasks, the task data and instruction parameters are encapsulated through the on-chip high-speed bus and sent to the execution module of the corresponding co-processing core; 这一过程中,发射模块同步完成协处理核心的任务地址映射与权限校验,确保数据传输的正确性与安全性,同时通过握手信号与协处理核心确认任务接收状态,避免因总线拥塞导致任务丢失。During this process, the transmitting module synchronously completes the task address mapping and permission verification of the co-processing core to ensure the correctness and security of data transmission. At the same time, it confirms the task reception status with the co-processing core through handshake signals to avoid task loss due to bus congestion. 4.根据权利要求2所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,所述方法构建双指针指令队列管理机制,用于解耦指令的顺序收发与计算的并发执行,其具体实现如下:4. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the method constructs a dual-pointer instruction queue management mechanism to decouple the sequential sending and receiving of instructions from the concurrent execution of calculations, and is specifically implemented as follows: 1.1)通过主处理核心维护发射指令指针和写回指令指针,在整体计算任务开始执行之前,将发射指令指针和写回指令指针初始化为0;1.1) Maintaining the transmit instruction pointer and the write-back instruction pointer through the main processing core, and initializing the transmit instruction pointer and the write-back instruction pointer to 0 before the overall computing task begins execution; 1.2)判断记分牌中的指令队列是否已满:1.2) Determine whether the instruction queue in the scoreboard is full: 1.2a)若已满,则暂停取指译码模块工作,直至队列出现空闲槽位;1.2a) If the queue is full, the instruction fetch and decode module is suspended until a free slot appears in the queue; 1.2b)若未满,则取指译码模块执行指令提取与译码,并将指令发送至发射模块,随后执行步骤3);1.2b) If the number of cells is not full, the instruction fetch and decode module performs instruction fetching and decoding, and sends the instruction to the transmitter module, and then executes step 3); 1.3)发射模块根据指令译码结果确定该指令的目标执行模块,并进一步判断目标执行模块是否可用:1.3) The transmitter module determines the target execution module of the instruction based on the instruction decoding result, and further determines whether the target execution module is available: 1.3a)若目标执行模块不可用,则将指令放入等待队列;1.3a) If the target execution module is not available, the instruction is placed in a waiting queue; 1.3b)若目标执行模块可用,则将指令发送至目标执行模块,发射指令指针加1,写回指令指针保持不变;随后继续循环执行步骤2)直至记分牌满或无可发射指令;1.3b) If the target execution module is available, send the instruction to the target execution module, increment the transmit instruction pointer by 1, and leave the write-back instruction pointer unchanged; then continue looping through step 2) until the scoreboard is full or there are no more transmit instructions; 1.4)主处理核心通过写回指针监控目标执行模块的指令执行状态:1.4) The main processing core monitors the instruction execution status of the target execution module through the write-back pointer: 1.4a)若目标执行模块未执行完成指令,则保持等待;1.4a) If the target execution module has not executed the completion instruction, then keep waiting; 1.4b)若目标执行模块执行完成指令,则执行写回操作,将写回数据存入寄存器或数据缓存中,写回指令指针加1,发射指令指针保持不变,随后继续循环检查下一指令状态。1.4b) If the target execution module completes the instruction, it performs a write-back operation, stores the write-back data in a register or data cache, increments the write-back instruction pointer by 1, and leaves the transmit instruction pointer unchanged, and then continues the loop to check the next instruction status. 5.根据权利要求2所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,所述方法构建访存屏障管理机制,通过在主处理核心和协处理核心的执行模块中分别配置访存单元,从数据缓存中获取数据,其具体实现如下:5. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the method constructs a memory access barrier management mechanism, and obtains data from a data cache by configuring memory access units in the execution modules of the main processing core and the co-processing core, respectively. The specific implementation is as follows: 2.1)当主处理核心的取指取指译码模块完成译码后,如果判断当前指令为访存操作,则通过操作数确定访存地址和执行操作的处理核心类型,随后根据访存地址将处理核心类型标识存入访存屏障表格中;2.1) When the instruction fetch decoder module of the main processing core completes decoding, if the current instruction is determined to be a memory access operation, the memory access address and the type of processing core performing the operation are determined based on the operand. The processing core type identifier is then stored in the memory access barrier table based on the memory access address. 2.1a)如果访存屏障表格中已有该访存地址信息,则将处理核心类型标识添加至该地址条目的末尾,形成有序队列;2.1a) If the memory access address information already exists in the memory access barrier table, the processing core type identifier is added to the end of the address entry to form an ordered queue; 2.1b)如果访存屏障表格中没有该访存地址条目信息,则新建一行地址条目,并将处理核心类型标识写入首列;2.1b) If the memory access address entry information does not exist in the memory access barrier table, a new row of address entries is created and the processing core type identifier is written into the first column; 2.2)当主处理核心或协处理核心的访存单元完成访存计算并准备写回缓存时,首先进入访存屏障表格进行访存地址仲裁:2.2) When the memory access unit of the main processing core or the co-processing core completes the memory access calculation and is ready to write back to the cache, it first enters the memory access barrier table to perform memory access address arbitration: 2.2a)如果当前执行访存操作的处理核心均处于访存地址条目的首列,则这些处理核心并发地执行写回操作;2.2a) If the processing cores currently performing the memory access operation are all in the first column of the memory access address entry, then these processing cores concurrently perform the write-back operation; 2.2b)如果存在某一个处理核心在其访存地址条目的非首列,则暂停该处理核心的写回操作,直至其对应访存地址条目的首列处理核心完成写回;2.2b) If a processing core is not in the first column of its access address entry, the write-back operation of the processing core is suspended until the processing core in the first column of its corresponding access address entry completes the write-back; 2.3)当访存屏障表格中某个访存地址对应的首列处理核心完成写回操作后,将该地址条目下的所有处理核心标识左移一位,首列空缺由后续标识填补;若该地址条目下已无处理核心类型标识,则清空该地址条目。2.3) When the first column of processing cores corresponding to a certain memory access address in the memory barrier table completes the write-back operation, all processing core identifiers under the address entry are shifted left by one position, and the vacancy in the first column is filled by subsequent identifiers; if there is no processing core type identifier under the address entry, the address entry is cleared. 6.根据权利要求5所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,所述处理核心类型包括主处理核心和协处理核心。6. A method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 5, characterized in that the processing core types include a main processing core and a co-processing core. 7.根据权利要求6所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,执行步骤2.2a),如果当前执行访存操作的处理核心均处于访存地址条目的首列,则处于访存地址条目的首列主处理核心通过本地写回模块,处于访存地址条目的首列协处理核心通过主处理核心的写回模块统一控制。7. A method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 6, characterized in that step 2.2a) is executed. If the processing cores currently performing memory access operations are all in the first column of the memory access address entry, the main processing core in the first column of the memory access address entry is uniformly controlled through the local write-back module, and the auxiliary processing core in the first column of the memory access address entry is uniformly controlled through the write-back module of the main processing core. 8.根据权利要求2所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,记分牌在检测寄存器冲突时,维护一个动态依赖表记录各寄存器的读写状态:8. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein when detecting a register conflict, the scoreboard maintains a dynamic dependency table to record the read and write status of each register: 当检测到RAW冲突时,指令进入等待队列直至依赖解除;When a RAW conflict is detected, the instruction enters the waiting queue until the dependency is resolved; 对于WAR/WAW冲突,通过寄存器重命名技术动态分配物理寄存器消除冲突。For WAR/WAW conflicts, physical registers are dynamically allocated through register renaming technology to eliminate conflicts. 9.根据权利要求1所述的一种基于内嵌式架构的异构计算处理核心实现方法,其特征在于,在内嵌式异构计算处理核心范式中,一个主处理核心与N-1个协处理核心通过片上总线连接,形成协同计算架构;9. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 1, wherein, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected via an on-chip bus to form a collaborative computing architecture; 所述主处理核心采用CPU,用于任务调度、指令分发与全局控制;所述协处理核心根据计算需求选择GPU、FPGA或ASIC中的任一种计算单元。The main processing core uses a CPU for task scheduling, instruction distribution and global control; the co-processing core selects any computing unit among GPU, FPGA or ASIC according to computing requirements.
CN202510866202.2A 2025-06-26 2025-06-26 Heterogeneous computing processing core implementation method based on embedded architecture Pending CN120821687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510866202.2A CN120821687A (en) 2025-06-26 2025-06-26 Heterogeneous computing processing core implementation method based on embedded architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510866202.2A CN120821687A (en) 2025-06-26 2025-06-26 Heterogeneous computing processing core implementation method based on embedded architecture

Publications (1)

Publication Number Publication Date
CN120821687A true CN120821687A (en) 2025-10-21

Family

ID=97366960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510866202.2A Pending CN120821687A (en) 2025-06-26 2025-06-26 Heterogeneous computing processing core implementation method based on embedded architecture

Country Status (1)

Country Link
CN (1) CN120821687A (en)

Similar Documents

Publication Publication Date Title
JP5764265B2 (en) Circuit devices, integrated circuit devices, program products and methods that utilize low-latency variable propagation networks for parallel processing of virtual threads across multiple hardware threads (grains of virtual threads across multiple hardware threads) Low latency variable transmission network for generalized parallel processing)
KR101039782B1 (en) Network-on-Chip System with Active Memory Processor
US7082601B2 (en) Multi-thread execution method and parallel processor system
CN105765541B (en) Controllers for Motor Vehicles
US12379959B2 (en) Compute task state encapsulation
CN102193885A (en) Virtual channel support in a nonvolatile memory controller
CN107729267B (en) Distributed allocation of resources and interconnect structure for supporting execution of instruction sequences by multiple engines
US20140143524A1 (en) Information processing apparatus, information processing apparatus control method, and a computer-readable storage medium storing a control program for controlling an information processing apparatus
CN115437691B (en) Physical register file allocation device for RISC-V vector and floating point register
CN101894013A (en) Instruction level pipeline control method and system in processor
JP3523286B2 (en) Sequential data transfer type memory and computer system using sequential data transfer type memory
CN100489830C (en) 64 bit stream processor chip system structure oriented to scientific computing
JP4585647B2 (en) Support for multiple outstanding requests to multiple targets in a pipelined memory system
US20150121037A1 (en) Computing architecture and method for processing data
KR20190033084A (en) Store and load trace by bypassing load store units
US5924120A (en) Method and apparatus for maximizing utilization of an internal processor bus in the context of external transactions running at speeds fractionally greater than internal transaction times
CN117501254A (en) Providing atomicity for complex operations using near-memory computation
JP5499987B2 (en) Shared cache memory device
CN111656337A (en) System and method for executing instructions
CN110603521B (en) Hyper-threading processor
CN114610394A (en) Instruction scheduling method, processing circuit and electronic equipment
US10275392B2 (en) Data processing device
CN113703841B (en) An optimized method, device and medium for register data reading
CN120821687A (en) Heterogeneous computing processing core implementation method based on embedded architecture
US5638538A (en) Turbotable: apparatus for directing address and commands between multiple consumers on a node coupled to a pipelined system bus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination