CN120564009A - Data processing task execution method, device, equipment and medium - Google Patents
Data processing task execution method, device, equipment and mediumInfo
- Publication number
- CN120564009A CN120564009A CN202510694059.3A CN202510694059A CN120564009A CN 120564009 A CN120564009 A CN 120564009A CN 202510694059 A CN202510694059 A CN 202510694059A CN 120564009 A CN120564009 A CN 120564009A
- Authority
- CN
- China
- Prior art keywords
- instruction
- target
- processor
- vector
- executor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/96—Management of image or video recognition tasks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method, a device, equipment and a medium for executing a data processing task, which are applied to a target hardware accelerator independently arranged outside a processor; the method comprises the steps of performing primary decoding on each processing instruction in a current data processing task issued by a processor to screen out a target processing instruction of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, the target type comprises a vector type and a matrix type, determining an execution sequence of each re-decoded target processing instruction, and scheduling a target executor in each operation executor to execute an operation corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task. By the scheme, the bottleneck problem of calculation force of the hardware accelerator when executing the data processing task and the problem of poor expandability of the hardware accelerator can be avoided.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for executing a data processing task.
Background
In recent years, under the promotion of emerging application scenes such as big data, 5G communication and big language models, the artificial intelligence technology has deepened to aspects of human life, meanwhile, with the rapid rise of generated artificial intelligence (AIGC, namely ARTIFICIAL INTELLIGENCE GENERATED Content), enterprises increasingly begin to adopt AIGC, natural language processing and neural networks to expand functions, enhance user experience, the current maximum neural network model is about 15-30 ten thousand times that of the current, and the traditional CPU (Central Processing Unit, namely a central processing unit) cannot meet the computational power requirement of increasingly growing data task processing.
To break through the computational bottleneck of traditional computing systems, researchers have begun to focus on new hardware architectures. In order to achieve both application cost and flexibility, many mainstream processors implement instruction set expansion of single instruction multiple data (Single Instruction Multiple Data, i.e., SIMD), and SIMD instructions can perform the same operation on multiple groups of data at the same time, thus saving dynamic instruction bandwidth and program space.
The conventional vector instruction set extension structure based on RISC-V architecture uses a scheme of VPU (Video Processing Unit, i.e. video processing unit) with a hardware accelerator built in a CPU pipeline, and has the problems of bottleneck in computing power requirements of data processing tasks which face the increasing artificial intelligence and poor expandability.
It can be seen how to avoid the bottleneck problem of computing power of the hardware accelerator when performing the data processing task and the problem of poor scalability of the hardware accelerator are problems to be solved by those skilled in the art.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a device, equipment and a medium for executing a data processing task, which avoid the bottleneck problem of calculation force of a hardware accelerator when executing the data processing task and the problem of poor expandability of the hardware accelerator. The specific scheme is as follows:
in a first aspect, the present invention discloses a method for executing a data processing task, which is applied to a target hardware accelerator independently arranged outside a processor, and the method includes:
Performing primary decoding on each processing instruction in a current data processing task issued by the processor to screen out a target processing instruction of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, and the target type comprises a vector type and a matrix type;
determining the execution sequence of each re-decoded target processing instruction;
And scheduling a target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task.
Optionally, the performing primary decoding on each processing instruction in the current data processing task issued by the processor includes:
Acquiring a current data processing task issued by the processor through a preset standardized interface between the processor and the target hardware accelerator;
and performing primary decoding on the current data processing task.
Optionally, the target hardware accelerator includes an instruction decoder, the instruction decoder including a first stage instruction decoder and a second instruction decoder;
the primary decoding of each processing instruction in the current data processing task issued by the processor comprises the following steps:
performing primary decoding on each processing instruction in the current data processing task issued by the processor by utilizing the first-stage instruction decoder;
Correspondingly, the determining the execution sequence of the target processing instruction after each re-decoding includes:
And re-decoding each target processing instruction by using the second instruction decoder to obtain micro-operations corresponding to each target processing instruction, and determining the execution sequence of each micro-operation.
Optionally, the target hardware accelerator includes an instruction scheduler, and the determining the execution sequence of the target processing instructions after each re-decoding includes:
Re-decoding each target processing instruction to obtain micro-operations and required operands corresponding to each target processing instruction, and acquiring the required operands from the processor;
Analyzing the dependency relationship and conflict relationship between the micro-operations by using the instruction scheduler, and determining the execution sequence of the micro-operations based on the dependency relationship, the conflict relationship and the computing resources of the operation executors;
Correspondingly, the scheduling, according to the execution sequence, the target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction includes:
And scheduling a target executor in each operation executor according to the execution sequence so that the target executor executes each micro-operation by using the required operand.
Optionally, each operation executor includes a vector matrix operator, a vector mask controller, a cross-channel instruction processor and a vector access manager, and the target executor is any one or more executors in each operation executor.
Optionally, the vector matrix arithmetic unit comprises a vector register, a matrix register, a first arithmetic unit for vector arithmetic instruction operation, a second arithmetic unit for floating point and multiply-divide operation, a third arithmetic unit for matrix arithmetic instruction operation and a fourth arithmetic unit for matrix multiplication and accumulation operation, wherein each channel of the vector register comprises a plurality of single-port units for parallel storage of data blocks, the matrix register comprises a plurality of two-dimensional matrix registers, and the number of rows and columns of each two-dimensional matrix register are determined based on the row length of the two-dimensional matrix registers.
Optionally, the scheduling, according to the execution order, the target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction includes:
Scheduling target executors in all operation executors to be current executors in sequence according to the execution sequence;
If the vector mask controller is a current executor, the vector mask controller performs bit-level operation on a target element corresponding to a current instruction to complete masking operation on the current instruction, wherein the bit-level operation is any one or more operations of bit and operation, bit or operation and bit exclusive or operation;
If the cross-channel instruction processor is the current executor, the cross-channel instruction processor performs any one or more operations of data fusion operation, data extraction operation, reduction calculation operation, data rearrangement operation and data displacement operation on the current instruction;
if the vector memory manager is the current executor, the vector memory manager obtains the memory operation type of the current instruction, performs address calculation on the current instruction according to a memory mode supported by an expandable vector instruction set to obtain memory requests containing memory addresses, merges the memory requests meeting preset conditions, and performs memory operation corresponding to the memory operation type on data on the memory address in a target storage area through an advanced expandable interface to respond to the merged memory requests;
the current instruction is any one instruction of the target processing instructions after being decoded again.
In a second aspect, the present invention discloses a data processing task execution device applied to a target hardware accelerator independently disposed outside a processor, the device comprising:
the system comprises a processor, a primary decoding module, a target processing module and a matrix type processing module, wherein the primary decoding module is used for carrying out primary decoding on each processing instruction in a current data processing task issued by the processor so as to screen out a target processing instruction of a target type;
the sequence determining module is used for determining the execution sequence of each re-decoded target processing instruction;
And the operation scheduling module is used for scheduling a target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task.
In a third aspect, the present invention discloses an electronic device, comprising:
A memory for storing a computer program;
A processor for executing a computer program to perform the steps of the previously disclosed data processing task execution method.
In a fourth aspect, the present invention discloses a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the previously disclosed data processing task execution method.
The method comprises the steps of carrying out primary decoding on each processing instruction in a current data processing task issued by a processor to screen out target processing instructions of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, the target type comprises a vector type and a matrix type, determining the execution sequence of each re-decoded target processing instruction, and scheduling a target executor in each operation executor to execute operations corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task.
The method has the advantages that a conventional scheme of arranging the hardware accelerator in a processor pipeline is abandoned, the method is applied to a target hardware accelerator which is independently arranged outside the processor, namely, the hardware accelerator is arranged outside the processor, the layout enables the hardware accelerator to be independent of the processor, dependence on the processor pipeline is reduced, expandability of the hardware accelerator is improved, calculation force bottlenecks caused by limitation of internal resources of the processor are avoided, a larger space is provided for improving calculation force, processing instructions of data processing tasks are simply distinguished by adopting two-stage decoding, the processing instructions are firstly decoded, vector type and matrix type target processing instructions are screened, then the target processing instructions are decoded again, namely, complete decoding is carried out, two-stage decoding is divided into definite steps, the instruction processing efficiency is improved, the data processing speed is accelerated, further, the execution sequence of the target processing instructions after the re-decoding is determined, each target executor is scheduled according to the execution sequence, namely, each executor reasonably executes different operations according to different target executors, each executor works, each job is high, the whole calculation force is effectively processed, and the whole calculation force is improved.
Drawings
For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a block diagram of a specific vector instruction set extension based on a RISC-V architecture according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for executing a data processing task according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a vector register according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a specific matrix register according to an embodiment of the present invention;
FIG. 5 is a flowchart of a specific method for executing a data processing task according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a specific target hardware accelerator according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a specific hardware accelerator data flow according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a data processing task execution device according to an embodiment of the present invention;
Fig. 9 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
In recent years, under the promotion of emerging application scenes such as big data, 5G communication and big language models, an artificial intelligence technology has deepened to aspects of human life, meanwhile, along with the rapid rise of generated artificial intelligence, enterprises increasingly start to adopt AIGC, natural language processing and neural networks to expand functions, so that user experience is enhanced, the current maximum neural network model is about 15-30 ten thousand times that of the current, and the traditional CPU can not meet the calculation power requirement of increasingly data task processing.
To break through the computational bottleneck of traditional computing systems, researchers have begun to focus on new hardware architectures. In order to achieve the application cost and flexibility, a plurality of mainstream processors realize the instruction set expansion of single instruction and multiple data, and SIMD instructions can simultaneously perform the same operation on multiple groups of data, so that the dynamic instruction bandwidth and the program space are saved.
For example, as shown in fig. 1, the conventional vector instruction set extension architecture based on RISC-V architecture uses a scheme of VPU with a hardware accelerator built in a CPU pipeline, and has a bottleneck in facing the increasing computational power requirements of data processing tasks of artificial intelligence and poor scalability.
The terms "comprising" and "having" in the description of the invention and in the above-described figures, as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description.
Next, a data processing task execution scheme provided by the embodiment of the present invention is described in detail. Fig. 2 is a schematic diagram of a method for executing a data processing task, which is applied to a target hardware accelerator independently disposed outside a processor, and includes:
And S11, performing primary decoding on each processing instruction in the current data processing task issued by the processor to screen out target processing instructions of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, and the target type comprises a vector type and a matrix type.
In this embodiment, the performing primary decoding on each processing instruction in the current data processing task issued by the processor includes acquiring the current data processing task issued by the processor through a preset standardized interface between the processor and the target hardware accelerator, and performing primary decoding on the current data processing task.
The target hardware accelerator is independently arranged outside the processor, and a preset standardized interface exists between the processor and the target hardware accelerator, wherein the preset standardized interface is used for transmitting a current data processing task issued by the processor to the target hardware accelerator, and the current data processing task is a task in an image recognition model constructed based on a neural network, so that the target hardware accelerator performs primary decoding on the current data processing task.
In this embodiment, the target hardware accelerator includes an instruction decoder, where the instruction decoder includes a first-stage instruction decoder and a second instruction decoder, and the performing primary decoding on each processing instruction in the current data processing task issued by the processor includes performing primary decoding on each processing instruction in the current data processing task issued by the processor by using the first-stage instruction decoder.
The target hardware accelerator comprises an instruction decoder, the instruction decoder comprises a first-stage instruction decoder and a second instruction decoder, the first-stage instruction decoder is arranged at an ID stage (instruction decoding stage) at the front end of the processor, that is, in the embodiment, two-stage decoding is performed on the current data processing task, in the primary decoding process, only target processing instructions of a target type are required to be screened out, wherein the target type comprises a vector type and a matrix type, the first-stage decoding is arranged at the ID stage at the front end of the central processor and is only used for simply distinguishing whether the current instruction is the vector instruction and the matrix instruction, and the target processing instructions of the target type are processed by the processor.
And step S12, determining the execution sequence of each re-decoded target processing instruction.
In this embodiment, determining the execution order of the target processing instructions after the re-decoding includes re-decoding each target processing instruction by using the second instruction decoder to obtain micro-operations corresponding to each target processing instruction, and determining the execution order of each micro-operation.
The second instruction decoder needs to decode each target processing instruction again to determine the micro-operation corresponding to each target processing instruction, and it is understood that one data processing task may include a plurality of processing instructions, each processing instruction may correspond to a plurality of micro-operations, and dependency relationships, contradiction relationships and the like may exist between different micro-operations, so that the execution order needs to be determined based on the relationships between the micro-operations.
And step S13, according to the execution sequence, scheduling a target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction so as to complete the current data processing task.
In this embodiment, each of the operation executors includes a vector matrix operator, a vector mask controller, a cross-channel instruction processor, and a vector access manager, and the target executor is any one or more executors of each of the operation executors.
Each operation executor comprises a vector matrix operator, a vector mask controller, a cross-channel instruction processor and a vector memory manager, the target executors in each operation executor are scheduled to execute operations corresponding to each re-decoded target processing instruction according to the execution sequence, that is, each target executor is sequentially used as a current executor to execute the current micro-operation according to the execution sequence, so that each target processing instruction is completed, and then the current data processing task is completed, wherein each target executor is any one or more executors in each operation executor, that is, all operation executors can be target executors, or only vector matrix executors in each operation executor can be target executors, and the like, the current target executor can be one or a plurality of micro-operations to be executed, the current target executor executes the current corresponding micro-operation, and obtains the micro-operation result to execute the next micro-operation.
Among other things, vector mask controllers are used to process vector mask instructions, which are useful when bit-level operations on elements in a vector are required, since they operate directly on mask registers, which are typically used to control the flow of vector computation, for example, when conditional execution or selective updating of vector elements.
The cross-Lane instruction processor is responsible for processing cross-Lane instructions, such as inserting a scalar operand into a vector operand, fetching a scalar operand from a vector operand, the reduce instruction receiving an element of a vector register set and a scalar placed in the 0 position of the vector register element, obtaining a scalar value by some sort of Reduction operation, the scalar value also placed in the 0 position of the element of a vector register, the Shuffle instruction being capable of rearranging the elements in the vector according to a specified pattern, the Slide instruction moving data elements up and down in the vector register set, which is useful in processing sliding windows, overlapping windows, or other scenarios requiring relative displacement between elements.
The vector memory manager is responsible for processing the vector memory instruction, and has only one external memory interface, in particular AXI (Advanced eXtensible Interface) interfaces, the bit width of which is 2B/DP-FLOP, and the choice is to ensure the balance between the computing power and the bandwidth. The vector memory manager has an address generation unit Address Generator (AGU) for supporting memory access modes in the RVV instruction set, including a unit stride (unit-stride) mode, a stride (strided) mode, and an index (indexed) addressing mode. The vector access manager may merge the access requirements of the AGU into burst access, accessing external storage through the AXI interface.
In this embodiment, the vector matrix arithmetic unit includes a vector register, a matrix register, a first arithmetic unit for vector arithmetic instruction operation, a second arithmetic unit for floating point and multiply-divide operation, a third arithmetic unit for matrix arithmetic instruction operation, and a fourth arithmetic unit for matrix multiply-accumulate operation, wherein each channel of the vector register includes a plurality of single-ports for parallel storage of data blocks, and the matrix register includes a plurality of two-dimensional matrix registers, and the number of rows and columns of each of the two-dimensional matrix registers are determined based on the row length of the two-dimensional matrix registers.
Further, the number of vector matrix operators can be expanded and can be configured to be 2-16, the vector matrix operators are divided into registers and operators, and each vector matrix operator is provided with its own Lane sequencer (operation channel instruction sequencer) which is responsible for tracking 8 parallel instructions. The registers include Vector registers (Vector REGISTER FILE, i.e., VRF), matrix registers (MatrixRegister File, i.e., MRF), wherein, for example, a specific schematic diagram of a Vector register as shown in fig. 3 is implemented using a set of single-port (1 RW) memory banks (data blocks), each of which has a width of 64 bits, which is the same as the width of the data path of each channel, each channel has eight single-port memory data blocks, that is, each channel of the Vector register contains a plurality of single ports for parallel memory data blocks, the width of the data blocks is the same as the path width of the channels, for example, a specific schematic diagram of a matrix register as shown in fig. 4 contains a plurality of two-dimensional matrix registers, specifically 8 two-dimensional matrix registers, respectively M0, M1, M2, M3, M4, M5, M6 and M7, RLEN representing the length of each register in bits, which in units of bits is a constant value in any implementation, and available rl256 is the number of columns or more, and the number of mse.g., M16 and M32 are calculated as a matrix, and the number of M32 is calculated as the number of M32, respectively, and the number of M32 is calculated as the number of columns of the matrix elements.
The first operator (Vector Arithmetic Logic Unit, VALU, i.e., vector arithmetic logic Unit) is used for Vector arithmetic instruction operations, the second operator (Vector Multi-Floating-Point Unit, VMFPU, i.e., vector Multi-Floating-Point Unit) is used for Floating Point and Multiply-divide operations, the third operator (Matrix Arithmetic Logic Unit, MALU, i.e., matrix arithmetic logic Unit) is used for matrix arithmetic instruction operations, and the fourth operator (multiple-Accumulate Unit, MAC, i.e., multiply-accumulate Unit) is used for matrix Multiply and accumulate operations.
In this embodiment, the scheduling the target executor of each operation executor according to the execution sequence to execute the operation corresponding to the target processing instruction after each re-decoding includes scheduling the target executor of each operation executor to be a current executor in sequence according to the execution sequence, performing bit-level operation on a target element corresponding to a current instruction by the vector mask controller to complete mask operation on the current instruction if the vector mask controller is the current executor, performing any one or more operations of bit and operation, bit or operation and bit exclusive or operation by the bit mask controller, performing data fusion operation, data extraction operation, reduction calculation operation, data rearrangement operation and any one or more operations of data displacement operation on the current instruction if the vector memory manager is the current executor, performing merging operation on the target memory according to a preset memory mode corresponding to the target memory operation and the target memory request after each request of the current instruction is met by the vector memory access manager, and performing the merging operation on the target memory mode after the current instruction is decoded according to an expandable vector instruction set, and performing the request again.
When the system schedules according to a preset execution sequence, the target executors in the operation executors are sequentially designated as current executors:
If the current executor is a vector mask controller, the controller will execute bit level logic operation on the target element corresponding to the current instruction, AND the supported bit level operation includes basic logic operations such as bit AND (AND), bit OR (OR), bit XOR (XOR), AND the like, AND the mask control function of the current instruction is completed through the operations.
If the current executor is a cross-lane instruction processor, the cross-lane instruction processor will perform cross-lane data processing operations on the current instruction, the types of operations supported include data fusion (e.g., scalar insert vector), data extraction (e.g., scalar fetch from vector), reduction computation (e.g., vector sum reduction), data reordering (e.g., element reordering), and data displacement (e.g., sliding window operation).
If the current executor is a vector memory manager, firstly identifying the memory operation type (loading/storing) of the current instruction, performing address calculation according to a memory mode supported by an expandable vector instruction set to generate a memory request, wherein the supported memory mode comprises unit stride, index and the like, combining and optimizing a plurality of memory requests meeting the conditions, executing actual memory operation through an advanced expandable interface (namely an AXI interface), reading or writing data of an appointed address in a target memory area, and finally completing response to the combined memory request.
It should be noted that the current instruction refers to any target processing instruction after being decoded by the instruction decoder for the second time, and the whole scheduling process is unified and coordinated by the instruction scheduler, so that each executor can execute the instruction correctly in sequence.
In this embodiment, a power regulator may be further disposed in the hardware accelerator, to determine the current target executor and the non-current target executor, and to turn on the power supply path of the current target executor and turn off the power supply path of the non-current target executor by using the power regulator, so as to supply power to the current target executor, but not to supply power to the non-current target executor.
The method has the advantages that a conventional scheme of arranging the hardware accelerator in a processor pipeline is abandoned, the method is applied to a target hardware accelerator which is independently arranged outside the processor, namely, the hardware accelerator is arranged outside the processor, the layout enables the hardware accelerator to be independent of the processor, dependence on the processor pipeline is reduced, expandability of the hardware accelerator is improved, calculation force bottlenecks caused by limitation of internal resources of the processor are avoided, a larger space is provided for improving calculation force, processing instructions of data processing tasks are simply distinguished by adopting two-stage decoding, the processing instructions are firstly decoded, vector type and matrix type target processing instructions are screened, then the target processing instructions are decoded again, namely, complete decoding is carried out, two-stage decoding is divided into definite steps, the instruction processing efficiency is improved, the data processing speed is accelerated, further, the execution sequence of the target processing instructions after the re-decoding is determined, each target executor is scheduled according to the execution sequence, namely, each executor reasonably executes different operations according to different target executors, each executor works, each job is high, the whole calculation force is effectively processed, and the whole calculation force is improved.
Referring to fig. 5, an embodiment of the present invention discloses a specific method for executing a data processing task, and compared with the previous embodiment, the present embodiment further describes and optimizes a technical solution. The method is applied to a target hardware accelerator which is independently arranged outside the processor, wherein the target hardware accelerator comprises an instruction scheduler, and the method comprises the following steps:
And S21, performing primary decoding on each processing instruction in the current data processing task issued by the processor to screen out target processing instructions of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, and the target type comprises a vector type and a matrix type.
The image recognition task is processed by utilizing an image recognition model constructed based on a neural network, for example, a face image is recognized in real time, the face image acquired by a camera is input into the image recognition model, the image is recognized by the model, a data processing task in the recognition process is issued to a target hardware accelerator from a processor, the data processing task comprises various processing instructions, wherein the processing instructions of vector type and matrix type are target processing instructions, for example, affine transformation matrix calculation instructions, depth separable convolution instructions, vector dot product instructions and the like, and the target hardware accelerator screens out the target processing instructions.
And S22, re-decoding each target processing instruction to obtain micro-operation and required operands corresponding to each target processing instruction, and acquiring the required operands from the processor.
Each target processing instruction is decoded again to obtain a micro-operation and a required operand corresponding to each target processing instruction, and the required operand is obtained from the processor, where the required operand is specifically a scalar operand, a source operand of a scalar or a destination operand of a scalar (i.e. an operand in a scalar x or F REGISTER FILE), for example, the required operand is read from a register in the processor.
And S23, analyzing the dependency relationship and the conflict relationship among the micro-operations by using the instruction scheduler, and determining the execution sequence of the micro-operations based on the dependency relationship, the conflict relationship and the computing resources of each operation executor.
A global scheduler of the instruction scheduler hardware accelerator. The main function is to manage and control instruction execution sequence, dependency relationship and conflict relationship in parallel vector calculation, ensure coordination work with interfaces of functional units in a hardware accelerator, optimize allocation of computing resources and improve execution efficiency by recording and analyzing instruction states and conflict conditions, and meanwhile, the module realizes a waiting and processing mechanism of unfinished instructions so as to ensure correctness in a complex data path.
And step S24, scheduling a target executor in each operation executor according to the execution sequence so that the target executor executes each micro-operation by utilizing the required operand to complete the current data processing task.
Therefore, the hardware accelerator adopts an external independent design, avoids the performance bottleneck of the traditional built-in processor scheme, realizes efficient instruction scheduling through the instruction decoder and the instruction scheduler, distributes vector and matrix instructions to special functional modules, namely an operation executor, carries out parallel processing, and supports diversified tasks such as vector operation, matrix multiply accumulation, data recombination, conditional mask control and the like. In addition, the accelerator is compatible with RISC-V RVV instruction sets through a two-stage decoding mechanism, ecological compatibility is ensured, and meanwhile, delay and power consumption are further reduced by access optimization (such as AXI interface burst transmission) of VLSU and cross-channel data processing (such as reduction and Shuffle) of SLDU. Each channel of the vector register comprises a plurality of single ports for storing data blocks in parallel, and the design can improve the performance of the CPU in large data parallel operation and realize the real-time calculation targets of high performance and low power consumption.
The following describes the present invention by taking a specific target hardware accelerator structure shown in fig. 6 as an example. The target hardware accelerator comprises an instruction decoder (Dispatcher), an instruction scheduler (Sequencer) and an operation executor, and the specific functions are as follows:
1) An instruction decoder (Dispatcher) is responsible for interfacing the CPU's requests with the hardware accelerator. The instruction decoder comprises a first-stage instruction decoder and a second instruction decoder, namely two-stage decoding is carried out on the instruction. The first stage of decoding is placed in the ID stage of front end of CPU, and is used for simply distinguishing whether current instruction is vector instruction or matrix instruction, and the second stage of decoding is placed in the Dispatcher of front end of hardware accelerator, and can completely decode vector instruction and need scalar source operand or produce scalar destination operand.
2) The instruction scheduler (Sequencer) is a global scheduler for the hardware accelerator. The main function is to manage and control instruction execution sequence and dependency relationship in parallel vector calculation, ensure coordination work with interfaces of all functional units in a hardware accelerator, optimize allocation of computing resources and improve execution efficiency by recording and analyzing instruction states and conflict conditions, and meanwhile, the module realizes a waiting and processing mechanism of unfinished instructions so as to ensure correctness in a complex data path.
3) The operation executor comprises a vector matrix operator, a vector mask controller, a cross-channel instruction processor and a vector memory manager, and is specifically as follows:
3.1 The vector matrix operator comprises a vector register, a matrix register, a first operator for vector arithmetic instruction operation, a second operator for floating point and multiply-divide operation, a third operator for matrix arithmetic instruction operation and a fourth operator for matrix multiplication and accumulation operation, wherein each channel of the vector register comprises a plurality of single-port used for parallel storage of data blocks;
3.2 A vector mask controller (i.e., MASKU) performs bit-level operations on a target element corresponding to the current instruction to complete the masking operation of the current instruction, the bit-level operations being any one or more of bit-and operations, bit-or operations, and bit-exclusive-or operations;
3.3 A cross-channel instruction processor (i.e., SLDU) performs any one or more of data fusion operation, data extraction operation, reduction calculation operation, data rearrangement operation, and data displacement operation on the current instruction;
3.4 The vector memory manager (VLSU) obtains the memory operation type of the current instruction, performs address calculation on the current instruction according to the memory mode supported by the expandable vector instruction set to obtain memory requests containing memory addresses, merges the memory requests meeting preset conditions, performs memory operation corresponding to the memory operation type on data on the memory addresses in the target storage area through the advanced expandable interface to respond to the merged memory requests, externally has only one memory interface (namely an AXI interface) with the bit width of 2B/DP-FLOP, the selection is to ensure the balance between the computing power and the bandwidth, and the vector memory manager is provided with an Address Generation Unit (AGU) for supporting the memory mode in the RVV instruction set, wherein the memory mode comprises a unit stride mode, a stride mode and an index addressing mode, and VLSU can merge the memory requirements of the AGU into burst access and access the external memory through the AXI interface.
For example, a specific hardware accelerator data flow diagram shown in fig. 7, the current data processing task issued by the processor is obtained through a preset standardized interface between the processor and the target hardware accelerator, and the data flow is specifically:
1) The first-stage instruction decoder performs primary decoding on each processing instruction in the current data processing task issued by the processor;
2) The second instruction decoder decodes each target processing instruction again to obtain micro-operation and required operands corresponding to each target processing instruction, and acquires the required operands from the processor;
3) The instruction scheduler analyzes the dependency relationship and the conflict relationship among the micro-operations, determines the execution sequence of the micro-operations based on the dependency relationship, the conflict relationship and the computing resources of the operation executors, and schedules the target executors in the operation executors according to the execution sequence;
4) The target executor is any one or more executors of a vector matrix arithmetic unit, a vector mask controller, a cross-channel instruction processor and a vector access manager, the target executor utilizes the required operand to execute each micro-operation, each target executor is a current executor in turn according to the execution sequence, and the current instruction is any one instruction of the target processing instructions after each re-decoding;
4.1 If the vector mask controller is the current executor, the vector mask controller performs bit level operation on a target element corresponding to the current instruction to complete the mask operation on the current instruction, wherein the bit level operation is any one or more operations of bit and operation, bit or operation and bit exclusive or operation;
4.2 If the cross-channel instruction processor is the current executor, the cross-channel instruction processor performs any one or more operations of data fusion operation, data extraction operation, reduction calculation operation, data rearrangement operation and data displacement operation on the current instruction;
4.3 If the vector memory manager is the current executor, the vector memory manager obtains the memory operation type of the current instruction, performs address calculation on the current instruction according to the memory mode supported by the expandable vector instruction set to obtain memory requests containing memory addresses, merges the memory requests meeting preset conditions, and performs memory operation corresponding to the memory operation type on data on the memory address in a target storage area through an advanced expandable interface to respond to the merged memory requests;
4.4 The vector matrix arithmetic unit is dynamically configurable (2-16) and includes a vector register, a matrix register, a first arithmetic unit (VALU) for vector arithmetic instruction operations, a second arithmetic unit (VMFPU) for floating point and multiply-divide operations, a third arithmetic unit (MALU) for matrix arithmetic instruction operations, and a fourth arithmetic unit (MAC) for matrix multiply and accumulate operations.
Therefore, the invention comprehensively considers key factors such as design, coordination and application of the hardware accelerator, and remarkably improves the calculation efficiency and flexibility through the modularized division and the extensible architecture. Specifically, the hardware accelerator adopts an external independent design, avoids the performance bottleneck of the traditional built-in VPU scheme, realizes efficient instruction scheduling through a Dispatcher and a Sequencer, distributes vector and matrix instructions to a special functional module (such as Lane, MASKU, SLDU, VLSU) for parallel processing, and supports various tasks such as vector operation, matrix multiply accumulation, data recombination, conditional mask control and the like. The number of Lane modules can be dynamically configured (2-16), and flexible parameter adjustment (such as RLEN) of a Vector Register (VRF) and a Matrix Register (MRF) is combined to adapt to calculation requirements of different scales, so that the method is particularly suitable for high-performance real-time operation in an artificial intelligence scene. In addition, the accelerator is compatible with RISC-V RVV instruction set through two-stage decoding mechanism, ensuring ecological compatibility, simultaneously achieving access optimization (such as AXI interface burst transmission) of VLSU and cross-channel data processing (such as reduction and Shuffle) of SLDU, further reducing delay and power consumption, improving performance of CPU in big data parallel operation, and achieving real-time calculation targets of high performance and low power consumption.
Fig. 8 is a schematic structural diagram of a data processing task execution device according to an embodiment of the present invention, which is applied to a target hardware accelerator independently disposed outside a processor, where the device includes:
The primary decoding module 11 is used for performing primary decoding on each processing instruction in a current data processing task issued by the processor to screen out a target processing instruction of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, and the target type comprises a vector type and a matrix type;
a sequence determining module 12, configured to determine an execution sequence of each of the re-decoded target processing instructions;
And the operation scheduling module 13 is configured to schedule, according to the execution order, a target executor in each operation executor to execute an operation corresponding to each re-decoded target processing instruction, so as to complete the current data processing task.
The method has the advantages that a conventional scheme of arranging the hardware accelerator in a processor pipeline is abandoned, the method is applied to a target hardware accelerator which is independently arranged outside the processor, namely, the hardware accelerator is arranged outside the processor, the layout enables the hardware accelerator to be independent of the processor, dependence on the processor pipeline is reduced, expandability of the hardware accelerator is improved, calculation force bottlenecks caused by limitation of internal resources of the processor are avoided, a larger space is provided for improving calculation force, processing instructions of data processing tasks are simply distinguished by adopting two-stage decoding, the processing instructions are firstly decoded, vector type and matrix type target processing instructions are screened, then the target processing instructions are decoded again, namely, complete decoding is carried out, two-stage decoding is divided into definite steps, the instruction processing efficiency is improved, the data processing speed is accelerated, further, the execution sequence of the target processing instructions after the re-decoding is determined, each target executor is scheduled according to the execution sequence, namely, each executor reasonably executes different operations according to different target executors, each executor works, each job is high, the whole calculation force is effectively processed, and the whole calculation force is improved.
Further, the embodiment of the present application further discloses an electronic device, and fig. 9 is a block diagram of an electronic device according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application. The electronic device may comprise, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26. The memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement relevant steps in the data processing task execution method disclosed in any of the foregoing embodiments. In addition, the electronic device in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device, the communication interface 24 is configured to create a data transmission channel with an external device for the electronic device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the outside, where the specific interface type may be selected according to the needs of the specific application, which is not specifically limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program capable of performing other specific tasks in addition to the computer program capable of performing the data processing task execution method performed by the electronic device as disclosed in any of the foregoing embodiments.
Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program realizes the method for executing the data processing task when being executed by a processor. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.
Claims (10)
1. A data processing task execution method is characterized by being applied to a target hardware accelerator which is independently arranged outside a processor, and comprises the following steps:
Performing primary decoding on each processing instruction in a current data processing task issued by the processor to screen out a target processing instruction of a target type, wherein the current data processing task is a task in an image recognition model constructed based on a neural network, and the target type comprises a vector type and a matrix type;
determining the execution sequence of each re-decoded target processing instruction;
And scheduling a target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task.
2. The method for executing a data processing task according to claim 1, wherein the performing initial decoding on each processing instruction in the current data processing task issued by the processor includes:
Acquiring a current data processing task issued by the processor through a preset standardized interface between the processor and the target hardware accelerator;
and performing primary decoding on the current data processing task.
3. The method of claim 1, wherein the target hardware accelerator comprises an instruction decoder comprising a first stage instruction decoder and a second instruction decoder;
the primary decoding of each processing instruction in the current data processing task issued by the processor comprises the following steps:
performing primary decoding on each processing instruction in the current data processing task issued by the processor by utilizing the first-stage instruction decoder;
Correspondingly, the determining the execution sequence of the target processing instruction after each re-decoding includes:
And re-decoding each target processing instruction by using the second instruction decoder to obtain micro-operations corresponding to each target processing instruction, and determining the execution sequence of each micro-operation.
4. The method according to claim 1, wherein the target hardware accelerator includes an instruction scheduler, and wherein determining the execution order of the target processing instructions after each re-decode includes:
Re-decoding each target processing instruction to obtain micro-operations and required operands corresponding to each target processing instruction, and acquiring the required operands from the processor;
Analyzing the dependency relationship and conflict relationship between the micro-operations by using the instruction scheduler, and determining the execution sequence of the micro-operations based on the dependency relationship, the conflict relationship and the computing resources of the operation executors;
Correspondingly, the scheduling, according to the execution sequence, the target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction includes:
And scheduling a target executor in each operation executor according to the execution sequence so that the target executor executes each micro-operation by using the required operand.
5. The method according to any one of claims 1 to 4, wherein each of the operation executors includes a vector matrix operator, a vector mask controller, a cross-lane instruction processor, and a vector memory manager, and the target executor is any one or more of the operation executors.
6. The method according to claim 5, wherein the vector matrix operator includes a vector register, a matrix register, a first operator for vector arithmetic instruction operation, a second operator for floating point and multiply-divide operation, a third operator for matrix arithmetic instruction operation, and a fourth operator for matrix multiply-and-accumulate operation, wherein each lane of the vector register includes a plurality of single ports for parallel storage of data blocks, and the matrix register includes a plurality of two-dimensional matrix registers, and the number of rows and columns of each of the two-dimensional matrix registers is determined based on the row length of the two-dimensional matrix registers.
7. The method according to claim 5, wherein the scheduling, according to the execution order, a target executor among the operation executors to execute an operation corresponding to each of the re-decoded target processing instructions includes:
Scheduling target executors in all operation executors to be current executors in sequence according to the execution sequence;
If the vector mask controller is a current executor, the vector mask controller performs bit-level operation on a target element corresponding to a current instruction to complete masking operation on the current instruction, wherein the bit-level operation is any one or more operations of bit and operation, bit or operation and bit exclusive or operation;
If the cross-channel instruction processor is the current executor, the cross-channel instruction processor performs any one or more operations of data fusion operation, data extraction operation, reduction calculation operation, data rearrangement operation and data displacement operation on the current instruction;
if the vector memory manager is the current executor, the vector memory manager obtains the memory operation type of the current instruction, performs address calculation on the current instruction according to a memory mode supported by an expandable vector instruction set to obtain memory requests containing memory addresses, merges the memory requests meeting preset conditions, and performs memory operation corresponding to the memory operation type on data on the memory address in a target storage area through an advanced expandable interface to respond to the merged memory requests;
the current instruction is any one instruction of the target processing instructions after being decoded again.
8. A data processing task execution device, characterized by being applied to a target hardware accelerator independently provided outside a processor, comprising:
the system comprises a processor, a primary decoding module, a target processing module and a matrix type processing module, wherein the primary decoding module is used for carrying out primary decoding on each processing instruction in a current data processing task issued by the processor so as to screen out a target processing instruction of a target type;
the sequence determining module is used for determining the execution sequence of each re-decoded target processing instruction;
And the operation scheduling module is used for scheduling a target executor in each operation executor to execute the operation corresponding to each re-decoded target processing instruction according to the execution sequence so as to complete the current data processing task.
9. An electronic device, comprising:
a memory for storing a computer program;
A processor for executing the computer program to perform the steps of the data processing task performing method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the data processing task execution method according to any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510694059.3A CN120564009A (en) | 2025-05-27 | 2025-05-27 | Data processing task execution method, device, equipment and medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510694059.3A CN120564009A (en) | 2025-05-27 | 2025-05-27 | Data processing task execution method, device, equipment and medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN120564009A true CN120564009A (en) | 2025-08-29 |
Family
ID=96814611
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510694059.3A Pending CN120564009A (en) | 2025-05-27 | 2025-05-27 | Data processing task execution method, device, equipment and medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN120564009A (en) |
-
2025
- 2025-05-27 CN CN202510694059.3A patent/CN120564009A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024060789A1 (en) | Intelligent computing-oriented method, system and apparatus for scheduling distributed training tasks | |
| CN112381220B (en) | Neural network tensor processor | |
| US12306752B2 (en) | Processor cluster address generation | |
| US11620510B2 (en) | Platform for concurrent execution of GPU operations | |
| RU2427895C2 (en) | Multiprocessor architecture optimised for flows | |
| US8099584B2 (en) | Methods for scalably exploiting parallelism in a parallel processing system | |
| JP2020518042A (en) | Processing device and processing method | |
| CN114816529B (en) | Apparatus and method for configuring cooperative thread bundles in a vector computing system | |
| US12223011B1 (en) | Integer matrix multiplication engine using pipelining | |
| US20190130270A1 (en) | Tensor manipulation within a reconfigurable fabric using pointers | |
| CN112580792B (en) | Neural network multi-core tensor processor | |
| CN114637536B (en) | Task processing method, computing coprocessor, chip and computer equipment | |
| US8615770B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
| US20210342673A1 (en) | Inter-processor data transfer in a machine learning accelerator, using statically scheduled instructions | |
| US20190286971A1 (en) | Reconfigurable prediction engine for general processor counting | |
| JP2022546271A (en) | Method and apparatus for predicting kernel tuning parameters | |
| CN117808048A (en) | Operator execution method, device, equipment and storage medium | |
| JP6551751B2 (en) | Multiprocessor device | |
| US11416261B2 (en) | Group load register of a graph streaming processor | |
| CN112559053A (en) | Data synchronization processing method and device for reconfigurable processor | |
| US20210042123A1 (en) | Reducing Operations of Sum-Of-Multiply-Accumulate (SOMAC) Instructions | |
| US8959497B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
| CN120564009A (en) | Data processing task execution method, device, equipment and medium | |
| CN119645584A (en) | Fine-grained multi-operator parallel scheduling method and system based on heterogeneous data flow architecture | |
| US12443399B1 (en) | Method and system for code optimization based on statistical data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |