Heterogeneous computing processing core implementation method based on embedded architecture
Technical Field
The invention relates to the technical field of heterogeneous computation and parallel processing, in particular to a method for realizing a heterogeneous computation processing core based on an embedded architecture.
Background
With the increasing complexity of computing tasks and the increasing scale of computing power, the more difficult the traditional isomorphic computing processing cores are to meet the computing requirements, heterogeneous computing modes are generated. The heterogeneous computing mode is used for constructing a computing power product with high adaptability and high computing power together by combining a plurality of processing cores with different architectures and different computing suitability, and provides a high-efficiency solution for the modern computing power application requirements.
However, existing heterogeneous computing modes, while having excellent performance advantages in terms of computational power supply, also present some performance bottlenecks due to their heterogeneous nature. On one hand, data interaction is required to be frequently carried out among a plurality of independent processing cores, the data transmission efficiency is limited by the number of interaction interfaces of the processing cores and is difficult to continuously promote, on the other hand, for heterogeneous computing products, a heterogeneous programming software framework is required to be adaptively constructed, corresponding execution files are constructed for each independent processing core, and the execution process of the multi-processing cores is scheduled and coordinated, so that the complexity of software programming is increased.
Disclosure of Invention
Aiming at the problems that the existing heterogeneous computing processing core is difficult to break through in the aspects of data transmission efficiency, complex software scheduling and the like, the invention provides the implementation method of the heterogeneous computing processing core based on the embedded architecture, and the hardware resource is saved and the software execution complexity is simplified by optimizing the heterogeneous computing processing core paradigm from the hardware level.
The invention provides a heterogeneous computing processing core implementation method based on an embedded architecture, which solves the technical problems and adopts the following technical scheme:
In the embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, wherein the main processing core is responsible for interacting with a heterogeneous programming software framework and simultaneously bears all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, and the caching, instruction fetching decoding and transmitting operations of the physical circuits are completed by corresponding modules of the main processing core through an on-chip bus.
Optionally, the specific implementation procedure of the related method is as follows:
S1, dividing the whole calculation task into a main processing core task and a co-processing core task according to the calculation suitability of each processing core by the heterogeneous programming software framework, marking, correspondingly transmitting the whole calculation task and required data into an instruction cache and a data cache of the main processing core, and transmitting a starting signal to the main processing core;
S2, a fetching decoding module of the main processing core starts to work, reads instructions from an instruction cache and completes decoding work of the instructions;
S3, the main processing core enters a scoreboard logic according to the decoding result, judges whether register conflict exists, and enters a register file to read operands when the register conflict does not exist, and enters a transmitting module;
And S4, after the execution modules of the main processing core and the co-processing core complete calculation, the calculation result is written back to a register file or a data cache of the main processing core through a write-back module, wherein the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.
Further optionally, the transmitting module involved is used as a task scheduling hub of the main processing core and the co-processing core, and the differentiated distributing logic is executed based on the task identifier output by the instruction fetching decoding module:
for a main processing core task, directly pushing the main processing core task to a local execution module to start calculation;
For the co-processing core task, task data and instruction parameters are packaged through an on-chip high-speed bus and sent to an execution module of the corresponding co-processing core;
In the process, the transmitting module synchronously completes task address mapping and authority verification of the co-processing core, ensures the correctness and safety of data transmission, and simultaneously confirms the task receiving state through handshake signals and the co-processing core to avoid task loss caused by bus congestion.
Optionally, the related method constructs a dual pointer instruction queue management mechanism for sequential transceiving of decoupled instructions and concurrent execution of computation, and the implementation is as follows:
1.1 Maintaining, by the main processing core, an issue instruction pointer and a write back instruction pointer, the issue instruction pointer and the write back instruction pointer being initialized to 0 before the overall computing task begins to execute;
1.2 Judging whether the instruction queue in the scoreboard is full:
1.2 a) if the queue is full, suspending the operation of the instruction fetch decoding module until the queue has an idle slot;
1.2 b) if not, the instruction fetch decoding module performs instruction extraction and decoding, and sends the instruction to the transmitting module, and then step 3) is performed;
1.3 The transmitting module determines a target executing module of the instruction according to the instruction decoding result and further judges whether the target executing module is available:
1.3 a) if the target execution module is not available, placing the instruction into a waiting queue;
1.3 b) if the target execution module is available, sending an instruction to the target execution module, adding 1 to the instruction pointer to be launched, and writing back the instruction pointer to be unchanged;
1.4 The main processing core monitors the instruction execution state of the target execution module through the write-back pointer:
1.4 a) if the target execution module does not execute the completion instruction, keeping waiting;
1.4 b) if the target execution module executes the completion instruction, executing a write-back operation, storing write-back data into a register or a data cache, adding 1 to the write-back instruction pointer, keeping the transmitting instruction pointer unchanged, and then continuing to circularly check the next instruction state.
Optionally, the related method constructs a memory access barrier management mechanism, and obtains data from a data cache by respectively configuring memory access units in execution modules of a main processing core and a co-processing core, which is specifically implemented as follows:
2.1 After the instruction fetching decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and the processing core type of the execution operation through an operand, and then storing the processing core type identification into a memory access barrier table according to the memory access address;
2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier to the end of the address entry to form an ordered queue;
2.1 b) if the memory barrier table does not have the memory address entry information, creating a row of address entries, and writing the processing core type identification into the first column;
2.2 When the access units of the main processing core and the co-processing core complete the access calculation and prepare to write back the cache, firstly entering an access barrier table to perform access address arbitration:
2.2 a) if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, these processing cores concurrently perform a write-back operation;
2.2 b) if there is a non-first list of an address entry of a certain processing core, suspending the write-back operation of the processing core until the first list of the corresponding address entry completes the write-back operation;
2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers under the address entry by one bit to the left, filling the first blank by the subsequent identifier, and clearing the address entry if the processing core type identifier does not exist under the address entry.
Further optionally, the processing core types involved include a main processing core and a co-processing core.
Further optionally, step 2.2 a) is performed, if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is controlled in a unified manner by the local write-back module, and the first co-processing core in the memory access entry is controlled in a unified manner by the write-back module of the main processing core.
Optionally, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:
when a RAW conflict is detected, the instruction enters a waiting queue until the dependency is released;
for WAR/WAW conflicts, physical registers are dynamically allocated to eliminate conflicts by a register renaming technique.
Optionally, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture;
The main processing core adopts a CPU (Central processing Unit) for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU (graphics processing Unit), an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit) according to the calculation requirement.
The method for realizing the heterogeneous computing processing core based on the embedded architecture has the beneficial effects that compared with the prior art:
1. The invention constructs a plurality of independent processing cores in the heterogeneous computation paradigm into a main multi-cooperative processing core mode, embeds pipeline modules such as caching, instruction fetching decoding, transmitting and the like of the cooperative processing cores in the main processing core, reduces logic resources, changes an interactive link among the plurality of processing cores into an internal layout wiring from an external IO interface, has higher transmission efficiency;
2. By constructing the double-pointer instruction queue management mechanism and the access barrier management mechanism, the heterogeneous computing performance and the access performance can be improved, so that the main processing core execution module and the plurality of co-processing core execution modules can execute concurrently without destroying the sequence of the whole computing task.
Drawings
FIG. 1 is a schematic diagram of an overall architecture of an embedded heterogeneous computing processing core according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of a conventional heterogeneous computing processing core;
FIG. 3 is a diagram illustrating a dual pointer instruction queue management mechanism according to a second embodiment of the present invention.
Fig. 4 is a schematic diagram of a memory access barrier management mechanism according to a third embodiment of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.
Embodiment one:
referring to fig. 1, this embodiment proposes a heterogeneous computing processing core implementation method based on an embedded architecture, in an embedded heterogeneous computing processing core paradigm, N independent processing cores are divided into a main processing core and N-1 co-processing cores, where the main processing core is responsible for interacting with a heterogeneous programming software framework, and simultaneously performs all computing tasks of caching, instruction fetching decoding, transmitting and executing four modules, and the co-processing cores only reserve physical circuits of the executing modules, where caching, instruction fetching decoding and transmitting operations are completed by corresponding modules of the main processing core through on-chip buses in a remote control manner.
The specific implementation process of the method of the embodiment is as follows:
S1, dividing the whole calculation task into a main processing core task and a co-processing core task according to the calculation suitability of each processing core by the heterogeneous programming software framework, marking, correspondingly transmitting the whole calculation task and required data into an instruction cache and a data cache of the main processing core, and transmitting a starting signal to the main processing core;
S2, a fetching decoding module of the main processing core starts to work, reads instructions from an instruction cache and completes decoding work of the instructions;
S3, the main processing core enters a scoreboard logic according to the decoding result, judges whether register conflict exists, and enters a register file to read operands when the register conflict does not exist, and enters a transmitting module;
And S4, after the execution modules of the main processing core and the co-processing core complete calculation, the calculation result is written back to a register file or a data cache of the main processing core through a write-back module, wherein the write-back operation of the co-processing core is uniformly controlled by the write-back module of the main processing core.
In this embodiment, the transmitting module is used as a task scheduling hub of the main processing core and the co-processing core, and executes the differentiated distributing logic based on the task identifier output by the instruction fetching decoding module:
for a main processing core task, directly pushing the main processing core task to a local execution module to start calculation;
For the co-processing core task, task data and instruction parameters are packaged through an on-chip high-speed bus and sent to an execution module of the corresponding co-processing core;
In the process, the transmitting module synchronously completes task address mapping and authority verification of the co-processing core, ensures the correctness and safety of data transmission, and simultaneously confirms the task receiving state through handshake signals and the co-processing core to avoid task loss caused by bus congestion.
It should be noted that, in the embedded heterogeneous computing processing core paradigm, a main processing core and N-1 co-processing cores are connected through an on-chip bus to form a co-computing architecture. The main processing core adopts a CPU for task scheduling, instruction distribution and global control, and the co-processing core selects any one of a GPU, an FPGA or an ASIC according to the calculation requirement.
With reference to fig. 1 and 2, it can be known from the foregoing execution process that, in the embedded heterogeneous computing processing core paradigm of this embodiment, the heterogeneous programming software framework only needs to perform processing core identification marking on a computing task when the computing task is compiled, and starts the main processing core when the computing task is issued, and does not need to monitor the running state of the processing core in real time, which greatly reduces the running overhead of software.
Embodiment two:
On the basis of the first embodiment, referring to fig. 3, in order to further improve heterogeneous computing performance and memory performance, so that the main processing core execution module and the numerous co-processing core execution modules can concurrently execute without destroying the sequence of the overall computing task, it is further proposed to construct a dual pointer instruction queue management mechanism for sequential transceiving of the decoupled instructions and concurrent execution of the computation, which is specifically implemented as follows:
1.1 Maintaining, by the main processing core, an issue instruction pointer and a write back instruction pointer, the issue instruction pointer and the write back instruction pointer being initialized to 0 before the overall computing task begins to execute;
1.2 Judging whether the instruction queue in the scoreboard is full:
1.2 a) if the queue is full, suspending the operation of the instruction fetch decoding module until the queue has an idle slot;
1.2 b) if not, the instruction fetch decoding module performs instruction extraction and decoding, and sends the instruction to the transmitting module, and then step 3) is performed;
1.3 The transmitting module determines a target executing module of the instruction according to the instruction decoding result and further judges whether the target executing module is available:
1.3 a) if the target execution module is not available, placing the instruction into a waiting queue;
1.3 b) if the target execution module is available, sending an instruction to the target execution module, adding 1 to the instruction pointer to be launched, and writing back the instruction pointer to be unchanged;
1.4 The main processing core monitors the instruction execution state of the target execution module through the write-back pointer:
1.4 a) if the target execution module does not execute the completion instruction, keeping waiting;
1.4 b) if the target execution module executes the completion instruction, executing a write-back operation, storing write-back data into a register or a data cache, adding 1 to the write-back instruction pointer, keeping the transmitting instruction pointer unchanged, and then continuing to circularly check the next instruction state.
In the execution process, the main processing core can ensure that the execution modules of all the processing cores can execute calculation concurrently by maintaining the transmitting instruction pointer instead of only one execution module executing calculation at the same time, and the main processing core can enable a plurality of execution modules of a plurality of processing cores to feed back the writing-back effective signals at the same time by maintaining the writing-back instruction pointer, but the writing-back of effective data can only strictly follow the original instruction sequence and cannot be disturbed by the calculation modules executed concurrently.
Embodiment III:
on the basis of the first embodiment or the second embodiment, referring to fig. 4, in this embodiment, in order to improve the memory access efficiency of the main co-processing core, a memory access barrier management mechanism is further constructed, and data is obtained from a data cache by respectively configuring memory access units in execution modules of the main processing core and the co-processing core, which is specifically implemented as follows:
2.1 After the instruction fetch decoding module of the main processing core finishes decoding, if the current instruction is judged to be the memory access operation, determining a memory access address and a processing core type (main processing core/co-processing core) for executing the operation through an operand, and then storing a processing core type identifier (main processing core/co-processing core) into a memory access barrier table according to the memory access address;
2.1 a) if the memory address information is already in the memory barrier table, adding a processing core type identifier (main processing core/co-processing core) to the end of the address entry to form an ordered queue;
2.1 b) if the memory barrier table does not have the memory address entry information, creating a row address entry, and writing the processing core type identification (main processing core/co-processing core) into the first column; it should be added that if the processing cores currently executing the memory access operation are all in the first column of the memory access entry, the first main processing core in the memory access entry is uniformly controlled by the local write-back module, and the first auxiliary processing core in the memory access entry is uniformly controlled by the write-back module of the main processing core;
2.2 When the access units of the main processing core and the co-processing core complete the access calculation and prepare to write back the cache, firstly entering an access barrier table to perform access address arbitration:
2.2 a) if the processing cores currently performing the memory access operation are all in the first column of the memory access entry, these processing cores concurrently perform a write-back operation;
2.2 b) if there is a non-first list of an address entry of a certain processing core, suspending the write-back operation of the processing core until the first list of the corresponding address entry completes the write-back operation;
2.3 When the first processing core corresponding to a certain memory address in the memory barrier table completes the write-back operation, shifting all the processing core identifiers (main processing core/co-processing core) under the address entry by one bit, filling the first column blank by the subsequent identifier, and clearing the address entry if the processing core type identifier (main processing core/co-processing core) is not available under the address entry.
The main processing core maintains the access barrier table, so that the multi-processing core in the heterogeneous computation paradigm can perform data interaction with the cache concurrently on one hand, the data transmission efficiency is improved, the read-write sequence requirement of the data can be ensured on the other hand, and the effectiveness of the cache data is ensured.
For the first embodiment, the second embodiment and the third embodiment, it should be noted that, when the scoreboard detects a register conflict, a dynamic dependency table is maintained to record the read-write states of each register:
when a RAW conflict is detected, the instruction enters a waiting queue until the dependency is released;
for WAR/WAW conflicts, physical registers are dynamically allocated to eliminate conflicts by a register renaming technique.
In summary, by adopting the implementation method of the heterogeneous computing processing core based on the embedded architecture, a plurality of independent processing cores in the heterogeneous computing paradigm are built into a main multi-cooperative processing core mode, pipeline modules such as caching, instruction fetching decoding and transmitting of the cooperative processing cores are embedded into the main processing core, logic resources are reduced, meanwhile, interaction links among the plurality of processing cores are changed into internal layout wiring from external IO interfaces, the transmission efficiency is higher, meanwhile, the application simplification is brought to a heterogeneous programming software framework by the heterogeneous programming software framework, and the heterogeneous programming software framework only needs to add corresponding cooperative processing core identifiers according to the cooperative processing core function support in the process of building execution files for the main processing cores, and does not need to build independent execution files for each processing core, and scheduling of execution processes of the plurality of processing cores.
The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.