CN120821687A

CN120821687A - Heterogeneous computing processing core implementation method based on embedded architecture

Info

Publication number: CN120821687A
Application number: CN202510866202.2A
Authority: CN
Inventors: 周凯; 颜佳宁; 于帆; 李乐乐
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2025-06-26
Filing date: 2025-06-26
Publication date: 2025-10-21

Abstract

The present invention discloses a method for realizing a heterogeneous computing processing core based on an embedded architecture, and relates to the technical field of heterogeneous computing and parallel processing. In view of the problems that the existing heterogeneous computing processing core is difficult to break through in terms of data transmission efficiency and software scheduling complexity, the following scheme is adopted: in the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with the heterogeneous programming software framework, and at the same time undertakes all computing tasks of the four modules of caching, instruction fetching and decoding, emission and execution; the co-processing core only retains the physical circuit of the execution module, and its caching, instruction fetching and decoding and emission operations are remotely controlled by the corresponding module of the main processing core through the on-chip bus. The present invention saves hardware resources and simplifies the complexity of software execution by optimizing the heterogeneous computing processing core paradigm from the hardware level.

Description

Heterogeneous computing processing core implementation method based on embedded architecture

Technical Field

The invention relates to the technical field of heterogeneous computation and parallel processing, in particular to a method for realizing a heterogeneous computation processing core based on an embedded architecture.

Background

With the increasing complexity of computing tasks and the increasing scale of computing power, the more difficult the traditional isomorphic computing processing cores are to meet the computing requirements, heterogeneous computing modes are generated. The heterogeneous computing mode is used for constructing a computing power product with high adaptability and high computing power together by combining a plurality of processing cores with different architectures and different computing suitability, and provides a high-efficiency solution for the modern computing power application requirements.

However, existing heterogeneous computing modes, while having excellent performance advantages in terms of computational power supply, also present some performance bottlenecks due to their heterogeneous nature. On one hand, data interaction is required to be frequently carried out among a plurality of independent processing cores, the data transmission efficiency is limited by the number of interaction interfaces of the processing cores and is difficult to continuously promote, on the other hand, for heterogeneous computing products, a heterogeneous programming software framework is required to be adaptively constructed, corresponding execution files are constructed for each independent processing core, and the execution process of the multi-processing cores is scheduled and coordinated, so that the complexity of software programming is increased.

Disclosure of Invention

Aiming at the problems that the existing heterogeneous computing processing core is difficult to break through in the aspects of data transmission efficiency, complex software scheduling and the like, the invention provides the implementation method of the heterogeneous computing processing core based on the embedded architecture, and the hardware resource is saved and the software execution complexity is simplified by optimizing the heterogeneous computing processing core paradigm from the hardware level.

The invention provides a heterogeneous computing processing core implementation method based on an embedded architecture, which solves the technical problems and adopts the following technical scheme:

In the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with a heterogeneous programming software framework and simultaneously bears all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, and the caching, instruction fetching decoding and transmitting operations of the physical circuits are completed by corresponding modules of the main processing core through an on-chip bus.

Optionally, the specific implementation procedure of the related method is as follows:

S1, dividing the whole calculation task into a main processing core task and a co-processing core task according to the calculation suitability of each processing core by the heterogeneous programming software framework, marking, correspondingly transmitting the whole calculation task and required data into an instruction cache and a data cache of the main processing core, and transmitting a starting signal to the main processing core;

S2, a fetching decoding module of the main processing core starts to work, reads instructions from an instruction cache and completes decoding work of the instructions;

S3, the main processing core enters a scoreboard logic according to the decoding result, judges whether register conflict exists, and enters a register file to read operands when the register conflict does not exist, and enters a transmitting module;

And S4, after the execution modules of the main processing core and the co-processing core complete calculation, the calculation result is written back to a register file or a data cache of the main processing core through a write-back module, wherein the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.

Further optionally, the transmitting module involved is used as a task scheduling hub of the main processing core and the co-processing core, and the differentiated distributing logic is executed based on the task identifier output by the instruction fetching decoding module:

for a main processing core task, directly pushing the main processing core task to a local execution module to start calculation;

For the co-processing core task, task data and instruction parameters are packaged through an on-chip high-speed bus and sent to an execution module of the corresponding co-processing core;

In the process, the transmitting module synchronously completes task address mapping and authority verification of the co-processing core, ensures the correctness and safety of data transmission, and simultaneously confirms the task receiving state through handshake signals and the co-processing core to avoid task loss caused by bus congestion.

Optionally, the related method constructs a dual pointer instruction queue management mechanism for sequential transceiving of decoupled instructions and concurrent execution of computation, and the implementation is as follows:

1.1 Maintaining, by the main processing core, an issue instruction pointer and a write back instruction pointer, the issue instruction pointer and the write back instruction pointer being initialized to 0 before the overall computing task begins to execute;

1.2 Judging whether the instruction queue in the scoreboard is full:

1.2 a) if the queue is full, suspending the operation of the instruction fetch decoding module until the queue has an idle slot;

1.2 b) if not, the instruction fetch decoding module performs instruction extraction and decoding, and sends the instruction to the transmitting module, and then step 3) is performed;

1.3 The transmitting module determines a target executing module of the instruction according to the instruction decoding result and further judges whether the target executing module is available:

1.3 a) if the target execution module is not available, placing the instruction into a waiting queue;

1.3 b) if the target execution module is available, sending an instruction to the target execution module, adding 1 to the instruction pointer to be launched, and writing back the instruction pointer to be unchanged;

1.4 The main processing core monitors the instruction execution state of the target execution module through the write-back pointer:

1.4 a) if the target execution module does not execute the completion instruction, keeping waiting;

1.4 b) if the target execution module executes the completion instruction, executing a write-back operation, storing write-back data into a register or a data cache, adding 1 to the write-back instruction pointer, keeping the transmitting instruction pointer unchanged, and then continuing to circularly check the next instruction state.

Optionally, the related method constructs a memory access barrier management mechanism, and obtains data from a data cache by respectively configuring memory access units in execution modules of a main processing core and a co-processing core, which is specifically implemented as follows:

2.1 After the instruction fetching decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and the processing core type of the execution operation through an operand, and then storing the processing core type identification into a memory access barrier table according to the memory access address;

2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier to the end of the address entry to form an ordered queue;

2.1 b) if the memory barrier table does not have the memory address entry information, creating a row of address entries, and writing the processing core type identification into the first column;

2.2 When the access units of the main processing core and the co-processing core complete the access calculation and prepare to write back the cache, firstly entering an access barrier table to perform access address arbitration:

2.2 a) if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, these processing cores concurrently perform a write-back operation;

2.2 b) if there is a non-first list of an address entry of a certain processing core, suspending the write-back operation of the processing core until the first list of the corresponding address entry completes the write-back operation;

2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers under the address entry by one bit to the left, filling the first blank by the subsequent identifier, and clearing the address entry if the processing core type identifier does not exist under the address entry.

Further optionally, the processing core types involved include a main processing core and a co-processing core.

Further optionally, step 2.2 a) is performed, if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is controlled in a unified manner by the local write-back module, and the first co-processing core in the memory access entry is controlled in a unified manner by the write-back module of the main processing core.

Optionally, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:

when a RAW conflict is detected, the instruction enters a waiting queue until the dependency is released;

for WAR/WAW conflicts, physical registers are dynamically allocated to eliminate conflicts by a register renaming technique.

Optionally, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture;

The main processing core adopts a CPU (Central processing Unit) for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU (graphics processing Unit), an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit) according to the calculation requirement.

The method for realizing the heterogeneous computing processing core based on the embedded architecture has the beneficial effects that compared with the prior art:

1. The invention constructs a plurality of independent processing cores in the heterogeneous computation paradigm into a main multi-cooperative processing core mode, embeds pipeline modules such as caching, instruction fetching decoding, transmitting and the like of the cooperative processing cores in the main processing core, reduces logic resources, changes an interactive link among the plurality of processing cores into an internal layout wiring from an external IO interface, has higher transmission efficiency;

2. By constructing the double-pointer instruction queue management mechanism and the access barrier management mechanism, the heterogeneous computing performance and the access performance can be improved, so that the main processing core execution module and the plurality of co-processing core execution modules can execute concurrently without destroying the sequence of the whole computing task.

Drawings

FIG. 1 is a schematic diagram of an overall architecture of an embedded heterogeneous computing processing core according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of a conventional heterogeneous computing processing core;

FIG. 3 is a diagram illustrating a dual pointer instruction queue management mechanism according to a second embodiment of the present invention.

Fig. 4 is a schematic diagram of a memory access barrier management mechanism according to a third embodiment of the present invention.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.

Embodiment one:

referring to fig. 1, this embodiment proposes a heterogeneous computing processing core implementation method based on an embedded architecture, in an embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, where the main processing core is responsible for interacting with a heterogeneous programming software framework, and simultaneously performs all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, where caching, instruction fetching decoding and transmitting operations are completed by corresponding modules of the main processing core through on-chip buses in a remote control manner.

The specific implementation process of the method of the embodiment is as follows:

In this embodiment, the transmitting module is used as a task scheduling hub of the main processing core and the co-processing core, and executes the differentiated distributing logic based on the task identifier output by the instruction fetching decoding module:

It should be noted that, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture. The main processing core adopts a CPU for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU, an FPGA or an ASIC according to the calculation requirement.

With reference to fig. 1 and 2, it can be known from the foregoing execution process that, in the embedded heterogeneous computing processing core paradigm of this embodiment, the heterogeneous programming software framework only needs to perform processing core identification marking on a computing task when the computing task is compiled, and starts the main processing core when the computing task is issued, and does not need to monitor the running state of the processing core in real time, which greatly reduces the running overhead of software.

Embodiment two:

On the basis of the first embodiment, referring to fig. 3, in order to further improve heterogeneous computing performance and memory performance, so that the main processing core execution module and the numerous co-processing core execution modules can concurrently execute without destroying the sequence of the overall computing task, it is further proposed to construct a dual pointer instruction queue management mechanism for sequential transceiving of the decoupled instructions and concurrent execution of the computation, which is specifically implemented as follows:

1.2 Judging whether the instruction queue in the scoreboard is full:

In the execution process, the main processing core can ensure that the execution modules of all the processing cores can execute calculation concurrently by maintaining the transmitting instruction pointer instead of only one execution module executing calculation at the same time, and the main processing core can enable a plurality of execution modules of a plurality of processing cores to feed back the writing-back effective signals at the same time by maintaining the writing-back instruction pointer, but the writing-back of effective data can only strictly follow the original instruction sequence and cannot be disturbed by the calculation modules executed concurrently.

Embodiment III:

on the basis of the first embodiment or the second embodiment, referring to fig. 4, in this embodiment, in order to improve the memory access efficiency of the main co-processing core, a memory access barrier management mechanism is further constructed, and data is obtained from a data cache by respectively configuring memory access units in execution modules of the main processing core and the co-processing core, which is specifically implemented as follows:

2.1 After the instruction fetch decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and a processing core type (main processing core/co-processing core) for executing the operation through an operand, and then storing a processing core type identifier (main processing core/co-processing core) into a memory access barrier table according to the memory access address;

2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier (main processing core/co-processing core) to the end of the address entry to form an ordered queue;

2.1 b) if the memory barrier table does not have the memory address entry information, creating a row address entry, and writing the processing core type identification (main processing core/co-processing core) into the first column; it should be added that if the processing cores currently executing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is uniformly controlled by the local write-back module, and the first auxiliary processing core in the memory access entry is uniformly controlled by the write-back module of the main processing core;

2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers (main processing core/co-processing core) under the address entry by one bit, filling the first column blank by the subsequent identifier, and clearing the address entry if the processing core type identifier (main processing core/co-processing core) is not available under the address entry.

The main processing core maintains the access barrier table, so that the multi-processing core in the heterogeneous computation paradigm can perform data interaction with the cache concurrently on one hand, the data transmission efficiency is improved, the read-write sequence requirement of the data can be ensured on the other hand, and the effectiveness of the cache data is ensured.

For the first embodiment, the second embodiment and the third embodiment, it should be noted that, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:

In summary, by adopting the implementation method of the heterogeneous computing processing core based on the embedded architecture, a plurality of independent processing cores in the heterogeneous computing paradigm are built into a main multi-cooperative processing core mode, pipeline modules such as caching, instruction fetching decoding and transmitting of the cooperative processing cores are embedded into the main processing core, logic resources are reduced, meanwhile, interaction links among the plurality of processing cores are changed into internal layout wiring from external IO interfaces, the transmission efficiency is higher, meanwhile, the application simplification is brought to a heterogeneous programming software framework by the heterogeneous programming software framework, and the heterogeneous programming software framework only needs to add corresponding cooperative processing core identifiers according to the cooperative processing core function support in the process of building execution files for the main processing cores, and does not need to build independent execution files for each processing core, and scheduling of execution processes of the plurality of processing cores.

The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.

Claims

1. A method for implementing a heterogeneous computing processing core based on an embedded architecture, characterized in that, within the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with the heterogeneous programming software framework and simultaneously undertakes all computing tasks of the four modules of caching, instruction fetch decoding, transmission, and execution; the co-processing core only retains the physical circuit of the execution module, and its caching, instruction fetch decoding, and transmission operations are remotely controlled by the corresponding module of the main processing core via an on-chip bus.

2. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 1, wherein the specific execution process of the method is as follows:

S1. The heterogeneous programming software framework divides the entire computing task into main processing core tasks and co-processing core tasks based on the computing adaptability of each processing core, and labels them. It then sends the entire computing task and the required data to the instruction cache and data cache of the main processing core, and sends a start signal to the main processing core.

S2. The instruction fetch and decode module of the main processing core starts working, reads instructions from the instruction cache, and completes the instruction decoding work; the instruction fetch and decode module identifies the co-processing core task by the processing core type marked in the computing task;

S3. The main processing core enters the scoreboard logic based on the decoding result to determine whether there is a register conflict. If there is no register conflict, it reads the operand from the register file and enters the transmission module. The transmission module sends the main processing core task to the execution module of the main processing core to complete the calculation based on the decoding result, and sends the co-processing core task to the corresponding co-processing core execution module via the on-chip bus.

S4. After the execution modules of the main processing core and the co-processing core complete the calculation, they write the calculation results back to the register file or data cache of the main processing core through the write-back module; wherein, the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.

3. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the transmitting module serves as a task scheduling hub for the main processing core and the co-processing core, and executes differentiated distribution logic based on the task identifier output by the instruction fetch and decode module:

For the main processing core tasks, push them directly to the local execution module to start the calculation;

For co-processing core tasks, the task data and instruction parameters are encapsulated through the on-chip high-speed bus and sent to the execution module of the corresponding co-processing core;

During this process, the transmitting module synchronously completes the task address mapping and permission verification of the co-processing core to ensure the correctness and security of data transmission. At the same time, it confirms the task reception status with the co-processing core through handshake signals to avoid task loss due to bus congestion.

4. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the method constructs a dual-pointer instruction queue management mechanism to decouple the sequential sending and receiving of instructions from the concurrent execution of calculations, and is specifically implemented as follows:

1.1) Maintaining the transmit instruction pointer and the write-back instruction pointer through the main processing core, and initializing the transmit instruction pointer and the write-back instruction pointer to 0 before the overall computing task begins execution;

1.2) Determine whether the instruction queue in the scoreboard is full:

1.2a) If the queue is full, the instruction fetch and decode module is suspended until a free slot appears in the queue;

1.2b) If the number of cells is not full, the instruction fetch and decode module performs instruction fetching and decoding, and sends the instruction to the transmitter module, and then executes step 3);

1.3) The transmitter module determines the target execution module of the instruction based on the instruction decoding result, and further determines whether the target execution module is available:

1.3a) If the target execution module is not available, the instruction is placed in a waiting queue;

1.3b) If the target execution module is available, send the instruction to the target execution module, increment the transmit instruction pointer by 1, and leave the write-back instruction pointer unchanged; then continue looping through step 2) until the scoreboard is full or there are no more transmit instructions;

1.4) The main processing core monitors the instruction execution status of the target execution module through the write-back pointer:

1.4a) If the target execution module has not executed the completion instruction, then keep waiting;

1.4b) If the target execution module completes the instruction, it performs a write-back operation, stores the write-back data in a register or data cache, increments the write-back instruction pointer by 1, and leaves the transmit instruction pointer unchanged, and then continues the loop to check the next instruction status.

5. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein the method constructs a memory access barrier management mechanism, and obtains data from a data cache by configuring memory access units in the execution modules of the main processing core and the co-processing core, respectively. The specific implementation is as follows:

2.1) When the instruction fetch decoder module of the main processing core completes decoding, if the current instruction is determined to be a memory access operation, the memory access address and the type of processing core performing the operation are determined based on the operand. The processing core type identifier is then stored in the memory access barrier table based on the memory access address.

2.1a) If the memory access address information already exists in the memory access barrier table, the processing core type identifier is added to the end of the address entry to form an ordered queue;

2.1b) If the memory access address entry information does not exist in the memory access barrier table, a new row of address entries is created and the processing core type identifier is written into the first column;

2.2) When the memory access unit of the main processing core or the co-processing core completes the memory access calculation and is ready to write back to the cache, it first enters the memory access barrier table to perform memory access address arbitration:

2.2a) If the processing cores currently performing the memory access operation are all in the first column of the memory access address entry, then these processing cores concurrently perform the write-back operation;

2.2b) If a processing core is not in the first column of its access address entry, the write-back operation of the processing core is suspended until the processing core in the first column of its corresponding access address entry completes the write-back;

2.3) When the first column of processing cores corresponding to a certain memory access address in the memory barrier table completes the write-back operation, all processing core identifiers under the address entry are shifted left by one position, and the vacancy in the first column is filled by subsequent identifiers; if there is no processing core type identifier under the address entry, the address entry is cleared.

6. A method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 5, characterized in that the processing core types include a main processing core and a co-processing core.

7. A method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 6, characterized in that step 2.2a) is executed. If the processing cores currently performing memory access operations are all in the first column of the memory access address entry, the main processing core in the first column of the memory access address entry is uniformly controlled through the local write-back module, and the auxiliary processing core in the first column of the memory access address entry is uniformly controlled through the write-back module of the main processing core.

8. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 2, wherein when detecting a register conflict, the scoreboard maintains a dynamic dependency table to record the read and write status of each register:

When a RAW conflict is detected, the instruction enters the waiting queue until the dependency is resolved;

For WAR/WAW conflicts, physical registers are dynamically allocated through register renaming technology to eliminate conflicts.

9. The method for implementing a heterogeneous computing processing core based on an embedded architecture according to claim 1, wherein, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected via an on-chip bus to form a collaborative computing architecture;

The main processing core uses a CPU for task scheduling, instruction distribution and global control; the co-processing core selects any computing unit among GPU, FPGA or ASIC according to computing requirements.