CN121166200A

CN121166200A - Data processing device, chip, board card and data processing method

Info

Publication number: CN121166200A
Application number: CN202511329215.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2025-09-17
Filing date: 2025-09-17
Publication date: 2025-12-19

Abstract

This application relates to a data processing device, a chip, a circuit board, and a data processing method. The data processing device of this application is included in a chip, which integrates one or more combined processing devices. The combined processing device is an artificial intelligence computing unit used to support various deep learning and machine learning algorithms and meet the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining.

Description

Data processing device, chip, board card and data processing method

Technical Field

The present application relates generally to the field of data processing technology. More particularly, the present application relates to a data processing apparatus, a chip, a board card, and a data processing method.

Background

Among neural network operations, a Reduction operation is widely applied to processing procedures such as data compression and normalization, and includes various operations such as maximum value calculation, minimum value calculation, summation calculation and mean calculation, and is involved in key calculations such as batch normalization (Batchnorm) commonly used in an image processing network, layer normalization (Layernorm) commonly used in a natural language processing network, and Softmax activation function.

In order to improve the protocol operation efficiency, a modern neural network processing chip (such as a cloud artificial intelligence chip MLU chip) is usually provided with a special acceleration processing module and software instructions, for example, a hardware computing unit (such as a vector operation processing device CT unit) of the modern neural network processing chip provides special efficient vector instructions for a high-dimensional protocol (i.e. a protocol dimension is high-dimensional). However, these instructions often have alignment limitations in that hardware performance is only fully exploited when the operational data meets certain alignment requirements (e.g., byte level alignment). When the size of the non-reduced dimension does not meet the alignment requirement, particularly when the size of the non-reduced dimension is far smaller than the minimum alignment granularity, the actual performance of the reduced instruction is obviously reduced, the utilization rate of the hardware instruction is reduced, and the overall hardware efficiency of the reduced operation is seriously affected.

In view of the foregoing, there is a need for a data processing scheme that is suitable for the reduction operation when the non-reduction dimension size does not meet the alignment requirement, and that improves the instruction utilization in the reduction operation and improves the hardware processing efficiency.

Disclosure of Invention

In order to solve at least one or more of the technical problems mentioned above, the present application proposes, in various aspects, a data processing apparatus, a chip, a board card, and a data processing method.

In a first aspect, the application provides a data processing device comprising a processing unit configured to execute a reduction instruction on first multidimensional data, and a storage unit configured to store data during execution of the reduction instruction, wherein the processing unit is configured to execute the reduction instruction by determining a reduction dimension of the first multidimensional data for which the reduction instruction is intended, performing a folding operation on the reduction dimension of the first multidimensional data when a first ratio of the dimension of the first multidimensional data to an alignment granularity required for the first multidimensional data to meet instruction alignment requirements is below a first threshold, expanding the dimension of the non-reduction dimension such that a second ratio of the dimension of the expanded non-reduction dimension to an alignment granularity required for the second multidimensional data to meet instruction alignment requirements exceeds the first threshold, obtaining second multidimensional data, and performing a plurality of reduction operations on the second multidimensional data to obtain a reduction result for the first multidimensional data.

In a second aspect, the application provides a chip comprising the data processing apparatus of the application described in the first aspect.

In a third aspect, the present application provides a board card comprising the chip of the application described in the second aspect.

In a fourth aspect, the application provides a data processing method implemented by a data processing device, the data processing device comprising a processing unit and a storage unit, the data processing method comprising the steps of determining a reduction dimension of first multidimensional data for which a reduction instruction is intended, executing a folding operation on the reduction dimension of the first multidimensional data when a first ratio of the dimension of the first multidimensional data to an alignment granularity required by the first multidimensional data to meet an instruction alignment requirement is lower than a first threshold value, expanding the dimension of the non-reduction dimension so that a second ratio of the dimension of the expanded non-reduction dimension to the alignment granularity required by the first multidimensional data to meet the instruction alignment requirement exceeds the first threshold value, and executing a plurality of reduction operations on the second multidimensional data, so as to obtain a reduction result for the first multidimensional data.

Through the data processing scheme provided by the embodiment of the application, the processing unit determines the reduction dimension, when the first ratio of the size of the non-reduction dimension to the alignment granularity required by the non-reduction dimension to meet the instruction alignment requirement is lower than the first threshold, the folding operation is performed on the reduction dimension to expand the size of the non-reduction dimension until the second ratio exceeds the first threshold, and then the reduction operation is performed on the obtained second multi-dimensional data for a plurality of times, so that the utilization rate of instructions in the reduction operation when the size of the non-reduction dimension is smaller can be improved, and the reduction processing efficiency is improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, embodiments of the application are illustrated by way of example and not by way of limitation, and like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 shows a schematic diagram of a board 10 according to an embodiment of the present application;

fig. 2 is a block diagram showing a combination processing apparatus in the chip 101;

FIG. 3 shows a schematic internal structure of a computing device 201;

FIG. 4 shows a schematic diagram of the internal architecture of a processor core;

FIG. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core;

FIG. 6 illustrates exemplary three-dimensional data;

FIG. 7 illustrates a sequence of storage of multidimensional data on a memory;

FIG. 8a illustrates a reduction approach for a high-dimensional reduction;

FIG. 8b illustrates a reduction approach for a low-dimensional reduction;

FIG. 9 shows a schematic block diagram of a data processing apparatus according to some embodiments of the application;

FIG. 10 schematically illustrates a process diagram of a folding operation of an embodiment of the present application;

FIG. 11 illustrates a process diagram for performing a plurality of reduction operations on second multidimensional data;

FIG. 12 illustrates first multidimensional data having non-reduced dimensions as merged dimensions;

FIG. 13 is a schematic diagram illustrating the processing of the remaining set of data according to some embodiments of the application;

FIG. 14 is a schematic diagram illustrating the processing of the remaining set of data according to further embodiments of the present application;

FIG. 15 shows a process schematic of a conventional two-dimensional reduction operation;

FIG. 16 illustrates a schematic diagram of a reduction operation including two-dimensional reduction instructions according to an embodiment of the present application;

FIG. 17 illustrates an exemplary flow chart of a data processing method of some embodiments of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and in the claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Embodiments of the present application will be described below with reference to the accompanying drawings.

It should be noted that in embodiments of the present application, the data that needs to be or has been processed may be various data in a neural network, including but not limited to input neurons, weights, output neurons, gradients, and the like. The data processing device and the data processing method provided by the embodiment of the application can be applied to various fields and can process the protocol operation of various types of data. For example, facial recognition, object detection, image classification, medical image analysis, etc. in image recognition and processing, data types may include pixel data, image features, etc., language translation, emotion analysis, text summarization, voice recognition, etc. in Natural Language Processing (NLP), data types may include text data, vocabulary embedding, sentence structure, etc., intelligent assistants in voice recognition and processing, automatic subtitle generation, voice-to-text conversion, etc., data types may include audio signals, sonograms, etc., personalized content recommendation in recommendation systems, product recommendation, advertisement placement, etc., data types may include user behavior data, item features, scoring data, etc., disease diagnosis in medical health, drug discovery, gene sequence analysis, etc., data types may include medical records, biomarker data, genomic data, etc., risk assessment in financial fields, fraud detection, stock market prediction, etc., data types may include transaction data, user credit scores, market data, etc., vehicle perception in automatic driving, formulation, path planning, etc., data types may include sensor data, environmental features, traffic signals, etc. Game AI, virtual reality, animation generation, etc. in games and entertainment, the data types may include game state data, user interaction data, etc., physical simulation in scientific research, chemical compound prediction, celestial body physical analysis, etc., the data types may include experimental data, simulation results, observation data, etc., and generation type AI application, etc. The generated AI refers to artificial intelligence techniques that learn from large-scale data sets using complex algorithms, models, and rules to create new original content, such as, but not limited to, multiple types of content that can create text, pictures, sound, video, and code. Accordingly, the processed data may be image data, audio data, video data, voice data, text data, document data, etc., and the corresponding output data may also be image data, audio data, video data, voice data, text data, document data, etc. The output of the data processing apparatus or data processing method of the embodiment of the present application may include a likelihood score that an image belongs to a specific object category, a likelihood score that a document pertains to a specific subject, a likelihood score that a text segment in a target language is a correct translation of a text segment in a source language, or a likelihood score that a text segment is a correct transcription of a spoken utterance, etc. The data processing scheme of the embodiment of the application can be also used for reasoning and training of the neural network, and the data processing process of the neural network can improve the utilization rate of the protocol instruction when the neural network executes the protocol processing of one or more tasks, improve the protocol processing efficiency and simultaneously improve the reasoning and training speed and performance of the neural network model.

Exemplary hardware architecture

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the application. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may include a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processors, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present application may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

DRAM (Dynamic Random Access Memory), dynamic random access memory) 204 is used to store Data to be processed, and is a DDR (Double Data Rate) memory, typically 16G or more in size, for storing Data of computing device 201 and/or processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores, in other words, the computing device 201 is configured in a system-on-chip (soc) -cluster-processor core hierarchy.

At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be a plurality of external memory controllers 301, 2 being shown by way of example, for accessing external memory devices, such as DRAM 204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being shown by way of example in the figure, and the computing device 201 of the present application may also include 8, 16, 64, or even more clusters 305 as hardware progresses. The cluster 305 is used to efficiently execute the deep learning algorithm.

At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.

The number of processor cores 306 is illustratively shown as 4, and the present application is not limited to the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules, a control module 41, an operation module 42, and a storage module 43.

The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 411 and an instruction decode unit (instruction decode unit, IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations to support complex operations such as vector multiplication, addition, nonlinear transformation, etc., and the matrix operation unit 422 is responsible for core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 43 is used for storing or handling related data, including a neuron storage unit (NRAM) 431, a weight storage unit (WEIGHT RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a handling direct memory access module (move direct memory access, MVDMA) 434. The NRAM 431 is used for storing the feature map calculated by the processor core 306 and the intermediate result after calculation, the WRAM 432 is used for storing the weight of the deep learning network, IODMA 433 is used for controlling the access memory of the NRAM 431/WRAM 432 and the DRAM 204 through the broadcast bus 309, and MVDMA 434 is used for controlling the access memory of the NRAM 431/WRAM 432 and the SRAM 308.

Returning to FIG. 3, the storage cores 307 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 306, as well as to perform communications between the clusters 305 and the DRAM 204, between the clusters 305, between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from the processor cores 306 to the DRAM 204 respectively, but is transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is greatly reduced.

Broadcast buses 309, CDMA 310, and GDMA are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and cluster 305 and DRAM 204 data transfers, respectively. As will be described below, respectively.

The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201. Fig. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operation of CDMA 310. In this application scenario, the same computing device includes a plurality of clusters, for convenience of illustration, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include a plurality of processor cores, for convenience of illustration, also, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 is to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into the local SRAM 0, CDMA 0 is used as a master (master) end, CDMA 1 is used as a slave (slave) end, the master end pushes the write request to the slave end, that is, the master end sends a write address AW and write data W, the data is transmitted to the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to FIG. 3, GDMA 311 cooperates with external memory controller 301 to control access of SRAM 308 of cluster 305 to DRAM 204 or to read data from DRAM 204 into SRAM 308. From the foregoing, it is appreciated that communication between DRAM 204 and NRAM 431 or WRAM 432 may be accomplished via 2 channels. The first channel is to directly contact DRAM 204 with NRAM 431 or WRAM 432 through IODAM, 433, and the second channel is to transfer data between DRAM 204 and SRAM 308 via GDMA 311 and then between SRAM 308 and NRAM 431 or WRAM 432 via MVDMA 434. While the second channel seemingly requires more components to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, and thus communication between the DRAM 204 and the NRAM 431 or WRAM 432 may be more efficient through the second channel. Embodiments of the present application may select a data transmission channel based on its hardware conditions.

In other embodiments, the functions of GDMA and IODMA 433 may be integrated in the same component. For convenience of description, GDMA and IODMA 433 are regarded as different components, so long as the functions and technical effects achieved by the present application are similar to those of the present application, that is, the present application is within the scope of protection of the present application. Further, the functions of GDMA, IODMA, 433, CDMA 310, MVDMA, 434 may be implemented by the same components, which are also within the scope of the present application as long as the implemented functions and technical effects are similar to those of the present application.

Multidimensional data

With the development of artificial intelligence technology, data oriented in tasks such as image processing and pattern recognition are often in the form of multidimensional data. In order to facilitate understanding of the processing procedure of the multidimensional data in the embodiment of the present application, the related concepts of the multidimensional data and some related dimensional operations will be described below first.

As the name suggests, multidimensional data is data that includes multiple dimensions. Dimension information for a multi-dimensional data may include the number of dimensions of the data, the dimensions of the respective dimensions, and the like. For N-dimensional data, for example, X _N=[x₁,x₂,…,x_i,…,x_N may be used to represent its dimensional information, where X _i (i e 1,2, N) represents the size of one of the dimensions (or the dimension referred to as dimension). N may be an integer greater than or equal to 2. When n=2, X _N represents two-dimensional data, and when n=3, X _N represents three-dimensional data. And are not exhaustive.

FIG. 6 illustrates an exemplary three-dimensional data, or three-dimensional array. As shown in fig. 6 (a), the three-dimensional data X has three dimensions, namely, a dimension 0 (dim 0) in the data block depth direction, a dimension 1 (dim 1) in the data block height direction, and a dimension 2 (dim 2) in the data block width direction. Dimension 0 is 2, dimension 1 is 2, and dimension 2 is 3. Accordingly, the dimensional information of the three-dimensional data X may be expressed as X ₃ = [2, 3] in order of the dimensions from high to low, that is, in order of the dimension 0, the dimension 1, and the dimension 2. Based on the exemplary data shown in the figures, the three-dimensional data X can be expressed as:

X=(((1,2,3),(4,5,6)) ; ((7,8,9),(10,11,12)))。

According to some dimension transformation rule perm _N=(p₁,p₂,…,p_i,…,p_N), a dimension transformed data arrangement may be obtained, where the value of p _i (i e 1,2, N.) represents the original dimension of the data and the position of pi in perm _N represents the target dimension of the transformation. For example, given a dimension conversion rule perm ₃ = (0,2,1), it is indicated that dimension 1 is to be exchanged with dimension 2, i.e., original dimension 1 is to be converted to dimension 2 of the new array, original dimension 2 is to be converted to dimension 1 of the new array.

Fig. 6 (b) illustrates a converted array Y obtained by applying the above-described exemplary dimension conversion rule perm ₃ to the three-dimensional data X illustrated in the (a) diagram. It can be seen from the figure that dimension 1 and dimension 2 of array Y are swapped compared to array X. At this time, the dimensional information of the three-dimensional data Y may be expressed as Y ₃ = [2,3,2]. Still according to the principle of low-dimensional priority, the three-dimensional data Y becomes:

Y=(((1,4),(2,5),(3,6)) ; ((7,10),(8,11),(9,12)))。

Although the multi-dimensional data has a plurality of dimensions, there is a correspondence between the multi-dimensional data and the order of storage on the memory because the layout of the memory (e.g., the aforementioned memory DRAM and cache RAM) is always one-dimensional. The multi-dimensional data is typically allocated in contiguous memory space, i.e., the multi-dimensional data can be one-dimensionally expanded (e.g., in a low-dimensional priority manner) and sequentially stored on memory. Some dimension operations do not change the storage sequence of elements in the multidimensional data, for example, after dimension splitting is performed on the multidimensional data, the storage sequence of the multidimensional data in the memory is not changed, and only the shape of the multidimensional data is changed. Some dimension operations change the storage order of elements in the multi-dimensional data, for example, after the multi-dimensional data is transposed, the storage order of the multi-dimensional data in the memory is changed. For ease of understanding, the following description is provided in connection with fig. 7.

Fig. 7 illustrates a storage order of multi-dimensional data on a memory using a one-dimensional array of a block of contiguous memory to effect storage of the multi-dimensional data. For example, still taking the example three-dimensional array X shown in fig. 6 as an example, in a low-dimensional priority manner (e.g., row priority), it may be one-dimensionally expanded to x= (1,2,3,4,5,6,7,8,9,10,11,12).

Fig. 7 (a) shows a storage order of the three-dimensional data X, which coincides with a one-dimensional expansion order of the three-dimensional data X. In the (a) graph, the data representation with the same background is located in the same row (dimension dim 2). It can be seen that the lowest dimension (same row) of data is continuous, with higher dimension data being spaced apart by different distances. For example, in the storage manner shown in the (a) diagram, the adjacent element physical structure in the access dimension dim2 needs to be shifted by 1 position (e.g., from data 1 to data 2, data 5 to data 6, etc.), the adjacent element physical structure in the access dimension dim1 needs to be shifted by 3 positions (e.g., from data 1 to data 4, data 2 to data 5,..the data 9 to data 12, etc.), and the adjacent element physical structure in the access dimension dim0 needs to be shifted by 6 positions (e.g., from data 1 to data 7, data 2 to data 8,..the data 6 to data 12, etc.). This offset is called a stride (stride). The step size of each dimension of the three-dimensional data X may be represented as S _X = (6,3,1).

When the three-dimensional data X in fig. 6 is converted into three-dimensional data Y by the transposed transformation, it can be one-dimensionally expanded to y= (1,4,2,5,3,6,7,10,8,11,9,12) still in a low-dimensional priority manner.

Fig. 7 (b) shows a storage order of the three-dimensional data Y, which coincides with the one-dimensional expansion order of the three-dimensional data Y. Similarly, in the (b) diagram, the data of the same background represent dim1 located in the same row (in the (b) diagram Y of fig. 6). It can be seen that the lowest dimension (same row) of data is continuous, with higher dimension data being spaced apart by different distances. For example, in the storage mode shown in the (b) diagram, the step size of the adjacent element physical structure in the access dimension dim1 is 1, that is, 1 position (e.g., from data 1 to data 4, data 2 to data 5, etc.) needs to be shifted, the step size of the adjacent element physical structure in the access dimension dim2 is 2, that is, 2 position (e.g., from data 1 to data 2, data 2 to data 3, data 4 to data 5, etc.) needs to be shifted, and the step size of the adjacent element physical structure in the access dimension dim0 is 6, that is, 6 position (e.g., from data 1 to data 7, data 4 to data 10, data 2 to data 8, etc.) needs to be shifted. The step size of each dimension of the three-dimensional data Y may be represented as S _Y = (6,2,1).

As can be seen from fig. 7, the three-dimensional data X is converted into three-dimensional data Y by the transposition operation, and the dimensional information is changed, and the storage order in the memory is also changed, for example, the access step length in each dimension is also changed from (a) to (b).

For computer hardware implementations of multi-dimensional data transpositions, a portion of the data is typically transferred from memory (e.g., DRAM 204 shown in FIG. 4) to a cache (e.g., the various RAMs shown in FIG. 4) where the transpositions of the multi-dimensional data are implemented by performing one or more transpositions operations with hardware instructions.

High-dimensional and low-dimensional conventions

The multi-dimensional data and the storage manner thereof are described above with reference to fig. 6 and 7, and the high-dimensional and low-dimensional protocols of the multi-dimensional data are described below with reference to fig. 8a and 8b, respectively.

Fig. 8a illustrates a reduction approach for a high-dimensional reduction. The high-dimensional reduction refers to a reduction operation performed on the high-dimensional dimension (dimension with respect to the low dimension) of the multi-dimensional data in a reduction operation of the multi-dimensional data, and the result data retaining the low-dimensional dimension is obtained by compressing the dimension of the high-dimensional dimension (typically, reducing to 1).

As shown in fig. 8a, taking as an example two-dimensional data [ N, C ] represented in order of dimension from high to low, where N and C are the dimensions of two dimensions (dim 1 dimension and dim2 dimension), respectively, a high-dimensional reduction is to perform a reduction operation (such as summing, maximizing, minimizing, etc.) on the high-dimensional dimension therein (here, dim1 dimension). The original two-dimensional data is [ N, C ], namely N high-dimensional elements, wherein each high-dimensional element comprises C low-dimensional elements, the dimension of dim1 dimension is compressed to be 1 after high-dimensional reduction operation, and result data with dimension information of [1, C ] is obtained, wherein the element of each position in the result data is the reduction result of all N high-dimensional elements in the corresponding low-dimensional position (the same position of dim2 dimension) in the original two-dimensional data [ N, C ].

For example, if the original two-dimensional data is [4,6] (n=4, c=6), that is, a matrix of 4 rows and 6 columns, the high-dimensional reduction operation of summing the dim1 dimension will result in [1,6] result data, where the 1 st element of the result data is the sum of the 1 st and 4 elements of the original two-dimensional data, the 2 nd element is the sum of the 2 nd and 4 elements of the original two-dimensional data, the 3 rd element is the sum of the 3 rd and 4 elements of the original data, and so on.

Fig. 8b illustrates a reduction approach for a low-dimensional reduction. The low-dimensional reduction refers to a reduction operation performed on the low-dimensional dimension (dimension with respect to the high dimension) of the multi-dimensional data in a reduction operation of the multi-dimensional data, and the result data retaining the high-dimensional dimension is obtained by compressing the size of the low-dimensional dimension (typically, reduced to 1).

As shown in fig. 8b, again taking the two-dimensional data [ N, C ] as an example, the low-dimensional reduction is to perform reduction operations (e.g., summing, maximizing, minimizing, etc.) on the low-dimensional dimension therein (here, the dim2 dimension). The original two-dimensional data is [ N, C ], namely N high-dimensional elements, wherein each high-dimensional element comprises C low-dimensional elements, the dimensions of dim2 dimensions are compressed to be 1 after low-dimensional reduction operation, and result data with the dimensions of [ N,1] are obtained, wherein the elements of each position in the result data are reduction results of all C low-dimensional elements in the corresponding high-dimensional position (the same position of dim1 dimensions) in the original two-dimensional data.

For example, if the original two-dimensional data is [4,6] (n=4, c=6), that is, a matrix of 4 rows and 6 columns, a low-dimensional reduction operation of summing the dim2 dimensions will result in a [4,1] result, where the 1 st row element of the result is the sum of the 1 st row and 6 th elements of the original two-dimensional data, the 2 nd row element is the sum of the 2 nd row and 6 th elements of the original two-dimensional data, and so on.

Generally, the multidimensional data is stored according to a low-dimensional priority storage mode (such as a line-priority storage mode), when the high-dimensional protocol is operated, corresponding data can be accessed according to the storage sequence of the multidimensional data, and although extra data carrying and memory access cost can be effectively avoided, when the size of the non-protocol dimension does not meet the alignment requirement, the instruction performance utilization rate is lower. For low-dimensional specifications, the corresponding data needs to be jumped to be accessed for low-dimensional specification operation, and data access and storage overhead is greatly increased.

In order to avoid additional data access overhead, for a low-dimensional reduction scene, multi-dimensional data needs to be stored in a high-dimensional priority storage mode (such as a column priority storage mode), but the problem of instruction performance utilization caused when the size of a non-reduced dimension does not meet the alignment requirement still exists.

In order to solve the problem that the performance utilization rate of instructions is low due to the size of the non-reduction dimension, the application provides a technical scheme for folding the reduction dimension to expand the size of the non-reduction dimension. As will be described in detail below.

Exemplary data processing apparatus

Fig. 9 shows a schematic block diagram of a data processing apparatus according to some embodiments of the application. As shown in fig. 9, the data processing apparatus 900 may include a processing unit 910 configured to execute a reduction instruction on first multi-dimensional data, and a storage unit 920 configured to store data during execution of the reduction instruction, wherein the processing unit 910 is configured to execute the reduction instruction by determining a reduction dimension of the first multi-dimensional data for which the reduction instruction is intended, performing a folding operation on the reduction dimension of the first multi-dimensional data to expand the size of the non-reduction dimension such that a second ratio of the size of the expanded non-reduction dimension to an alignment granularity required for the expanded non-reduction dimension to meet the instruction alignment requirement exceeds the first threshold when a first ratio of the size of the non-reduction dimension of the first multi-dimensional data to an alignment granularity required for the first multi-dimensional data is below a first threshold, and performing a plurality of reduction operations on the second multi-dimensional data to obtain a reduction result for the first multi-dimensional data.

The processing unit 910 is configured to execute a reduction instruction on the first multidimensional data, where the reduction instruction may be a high-dimensional reduction instruction or a low-dimensional reduction instruction. The reduction dimension for which the high-dimensional reduction instruction is directed is a high-dimensional dimension, i.e., the dimension in which the reduction operation is performed is a high-dimensional dimension (e.g., dim1 dimension shown in fig. 8 a). The reduction dimension for which the reduction instruction is directed is a low-dimensional dimension, i.e., the dimension in which the reduction operation is performed is a low-dimensional dimension (e.g., dim2 dimension shown in fig. 8 b).

After determining the reduction dimension for which the reduction instruction is intended, i.e. the non-reduction dimension is determined. The non-reduction dimension is a dimension reserved when a reduction operation is performed. For example, for a high-dimensional reduction instruction, the non-reduction dimension is a low-dimensional dimension (e.g., the dim2 dimension shown in FIG. 8 a), and for a low-dimensional reduction instruction, the non-reduction dimension is a high-dimensional dimension (e.g., the dim1 dimension shown in FIG. 8 b). In some embodiments, the reduction dimension is higher than the non-reduction dimension, and the processing unit 910 executes a high-dimensional reduction instruction.

The storage unit 920 may be configured to store data during execution of the reduction instructions, such as data storing the first multi-dimensional data, the second multi-dimensional data, the reduction results, and the like. In some embodiments, storage unit 920 may include an on-chip cache (e.g., various RAMs as shown in fig. 4), and so forth.

Vector computing instructions typically have alignment limitations, whose full play of hardware performance depends on the data meeting specific alignment requirements. For example, instructions may reach theoretical peak performance when the processed data meets a 64-byte or 128-byte alignment requirement, and actual performance of the instructions may be below theoretical peak if the data does not meet the alignment requirement, resulting in performance loss. The specification instructions are also affected by such alignment constraints. This is because the protocol instructions rely on the hardware unit to perform parallel processing, essentially an aggregate operation on vector data, and thus the protocol instructions are also subject to hardware-to-data alignment requirements. When the byte size corresponding to the non-reduced dimension of the multi-dimensional data is much smaller than the required alignment granularity, the performance utilization of the reduced instruction may be at a lower level. Specifically, the performance utilization of the instruction may be calculated by the following formula:

(equation one);

Where P represents the performance utilization of the instruction, x represents the input data, len (x) represents the byte SIZE of the input data, PAD_UP represents the UP-alignment function, and ALIGN_SIZE is the hardware-dependent alignment granularity constant (e.g., minimum alignment granularity). Pad_up (len (x), align_size) =ceil (len (x)/align_size) (ceil represents the round-UP), i.e. represents the minimum number of alignment granularity returned. For ease of understanding, the following will exemplify.

For example 1, if the size of the non-reduced dimension of the first multidimensional data is 3 (i.e., contains 3 float elements), len (x) =3 (number of float elements) ×4 (number of float BYTEs per) =12 BYTEs (BYTE). Assuming an alignment granularity constant of 128 bytes, pad_up (12,128) =ceil (12/128) =1 (representing UP-alignment to 1 minimum alignment granularity), the size of the non-reduced dimension of the first multidimensional data is 1×128=128 bytes at the alignment granularity (i.e., aligned byte size) required to meet the instruction alignment requirement. At this time, p=len (x)/pad_up (len (x), 128) =12/128=0.09375 (about 9.38%). Clearly, the size of the non-reduced dimension (12 bytes) is much smaller than the minimum alignment granularity (128 bytes), resulting in an extremely low instruction performance utilization P.

For example 2, assuming that the alignment granularity constant is still 128 bytes, if the size of the non-reduced dimension of the first multidimensional data is 33 (i.e., contains 33 float elements), len (x) =33 (number of float elements) ×4 (number of float bytes per) =132 bytes (128 bytes, satisfying the minimum alignment granularity). At this time, pad_up (132,128) =ceil (132/128) =2 (representing UP-alignment to 2 minimum alignment granularity), so the alignment granularity (i.e., aligned byte size) required for the size of the non-reduced dimension of the first multidimensional data to meet the instruction alignment requirement is 2×128=256 bytes. Performance utilization p=len (x)/pad_up (len (x), 128) =132/256≡ 0.5156 (about 51.56%). In this example, the size of the non-reduced dimension satisfies the minimum alignment granularity, but since the byte size is much smaller than the integer multiple of the next minimum alignment granularity (132 bytes <2×128=256 bytes), the denominator increases significantly after alignment upwards, resulting in the performance utilization of the instruction still being at a low level.

Both of the above examples show that regardless of whether the byte size of the non-reduced dimension reaches the minimum alignment granularity, instruction performance utilization may be reduced due to the alignment granularity required for upward alignment. Based on this, the embodiment of the application compares the first ratio with the first threshold by setting the first threshold to determine whether to trigger the folding operation, thereby ensuring that the performance utilization rate P of the instruction is at a higher level. When the first ratio is lower than the first threshold, the processing unit 910 triggers the folding operation to expand the non-convention dimension, otherwise, the convention operation is directly performed without the folding operation. In some embodiments, the first threshold may be set according to a minimum acceptable performance utilization of the hardware. For example, the first threshold value may be set to 70%, 80%, 90%, or the like as necessary, or may not be limited thereto. The folding operation may be to expand the size of the non-reduction dimension by reducing the size of the reduction dimension to merge part of the data into the non-reduction dimension, thereby meeting the requirement of the first threshold.

In some embodiments, the processing unit 910 may be configured to perform a folding operation on the first multi-dimensional data by determining a number of fold groups for the reduction dimension based on the expansion multiple required for the non-reduction dimension, wherein the number of fold groups is a number of data groups required to fold a set of data for the reduction dimension, and performing the folding operation on the reduction dimension of the first multi-dimensional data by the number of fold groups. To facilitate a further understanding of the folding operation of the present embodiment, a high-dimensional specification will be exemplified as follows, further described in connection with fig. 10.

Fig. 10 schematically illustrates a process diagram of a folding operation of an embodiment of the present application. Taking the dimension information of the first multidimensional data as [ N,3] as an example, which includes N rows and 3 columns of data, assuming that the first threshold is 90%, the minimum alignment granularity is 128 bytes, when the reduced dimension is dim1 dimension, the first ratio (about 9.38%) of the size of the non-reduced dimension dim2 dimension to the alignment granularity (128 bytes) required to satisfy the instruction alignment requirement is far smaller than the first threshold (90%), the processing unit is triggered to perform the folding operation. In order to enable the size of the expanded non-reduction dimension to meet the instruction alignment requirement, the size of the non-reduction dimension needs to be expanded by at least 10 times.

In some embodiments, the expansion factor required for the non-reduced dimension may be determined by iteratively calculating a second ratio and comparing with a first threshold. For example, a second ratio of the size of the non-reduced dimension after being expanded by 2 times to the alignment granularity required by the non-reduced dimension to meet the instruction alignment requirement may be calculated first, if the second ratio does not exceed the first threshold, then the second ratio of the size of the non-reduced dimension after being expanded by 3 times to the alignment granularity required by the non-reduced dimension to meet the instruction alignment requirement is calculated in the next round until the second ratio exceeds the first threshold, and the expansion multiple of the current round is determined. In other embodiments, the expansion multiple required by the non-reduction dimension may be determined according to the least common multiple of the size of the non-reduction dimension and the minimum alignment granularity of the first multidimensional data, and based on this, the expansion of the non-reduction dimension is performed, so that the second ratio reaches 100%, the performance utilization of the implementation instruction reaches the theoretical peak, and the parallel processing capability of the hardware is fully utilized.

Continuing with the description back to FIG. 10, in some embodiments, the number of fold groups for the reduction dimension is the same as the expansion factor required for the non-reduction dimension. For example, based on the expansion factor of 10 required for the non-reduction dimension, the number of folded groups for the reduction dimension can be determined to be 10 groups. The number of folding groups is 10, which means that every 10 groups of data of the specification dimension are folded into one group. As shown in fig. 10, when the reduction dimension is high-dimensional, one set of data of the reduction dimension is one line of data (or referred to as one reduction dimension element) in the drawing, every 10 lines of data of the first multi-dimensional data [ N,3] are folded into one line based on the number of folded groups (10 groups), so that the size of the expanded non-reduction dimension reaches 30, resulting in second multi-dimensional data [ N/10,30]. For example, data 4, data 5, data 6 of the second row in the reduced dimension dim1 dimension of the first multidimensional data [ N,3] shown in fig. 10 are folded to the first row, data 7, data 8, data 9 of the third row are folded to the first row, and so on until the size of the folded non-reduced dimension reaches 30. At this point, a second ratio (93.75%) of the size of the expanded non-reduced dimension (120 bytes) to its alignment granularity (128 bytes) required to meet the instruction alignment requirement exceeds a first threshold (e.g., 90%).

Further, as shown in fig. 10, when performing the folding operation on the reduced dimension of the first multidimensional data according to the number of folding groups, the non-reduced dimension is sequentially expanded based on the storage sequence of the first multidimensional data, so that the expanded data storage sequence (that is, the data storage sequence of the second multidimensional data is the same as that of the first multidimensional data) can not be changed, the additional memory access overhead and the time consumption of data movement caused by data rearrangement can be avoided, the occupation of storage bandwidth and the consumption of hardware resources (such as temporary buffering and an address calculation unit) can be reduced, meanwhile, the continuous storage sequence can enable the hardware processing unit to efficiently read the data according to the original data layout, the additional processing of address offset or the splicing of the trans-block data is not needed, the natural adaptation of the data access and the hardware parallel processing mechanism is ensured, the continuity and the efficiency of the execution of the reduced instruction can be further improved, the original logic association of the data can be maintained, and the risk of operation errors caused by the change of the storage sequence can be reduced.

It will be appreciated that the above description is by way of example and not limitation, and that the alignment granularity required for the size of the non-reduced dimension after expansion (i.e., the denominator of the second ratio) and the alignment granularity required for the size of the non-reduced dimension before expansion (i.e., the denominator of the first ratio) may be the same or may be different. The alignment granularity required for the SIZE of the expanded non-reduction dimension to meet the instruction alignment requirement may be determined according to pad_up (len (x 2)), align_size, and the alignment granularity required for the SIZE of the non-reduction dimension before expansion to meet the instruction alignment requirement may be determined according to pad_up (len (x 1)), align_size, where len (x 1) represents the SIZE of the non-reduction dimension of the first multi-dimensional data, and len (x 2) represents the SIZE of the expanded non-reduction dimension. As will be exemplified below.

For example, assuming that the size of the non-reduced dimension of the first multi-dimensional data is 20 (i.e., 80 bytes) and the minimum alignment granularity is 32 (i.e., 128 bytes), the alignment granularity required for the size of the non-reduced dimension of the first multi-dimensional data to meet the instruction alignment requirement is 128 bytes and the first ratio is 80/128=0.625. Taking the first threshold set to 0.9 as an example, the first ratio is smaller than the first threshold, so that the folding operation is required for the reduced dimension of the first multidimensional data. In order to meet the limit requirement of the first threshold, the size of the non-reduced dimension needs to be extended by at least a factor of 3. The SIZE of the expanded non-reduced dimension will reach 60 (i.e., 240 bytes) that has exceeded the minimum alignment granularity, and from pad_up (len (x 2), align_size) =pad_up (240,128) =ceil (240/128) =2 (representing UP-alignment to 2 minimum alignment granularities), the alignment granularity required to determine that the SIZE of the expanded non-reduced dimension meets the instruction alignment requirement is 64 (i.e., 256 bytes), then the second ratio is 240/256=0.9375, exceeding the first threshold, thereby determining that the SIZE of the non-reduced dimension of the second multi-dimensional data is 60.

After performing a folding operation on the reduction dimension of the first multi-dimensional data to obtain second multi-dimensional data, the reduction result for the first multi-dimensional data may be obtained by performing a plurality of reduction operations on the second multi-dimensional data. In some embodiments, the processing unit 910 is configured to perform a plurality of reduction operations on the second multi-dimensional data by performing a first reduction operation of a reduction dimension on the second multi-dimensional data to obtain a first intermediate result, dimension splitting the non-reduction dimension of the first intermediate result by a size of the non-reduction dimension of the first multi-dimensional data to obtain third multi-dimensional data, and performing a second reduction operation of the reduction dimension on the third multi-dimensional data to obtain a reduction result for the first multi-dimensional data. For ease of understanding, the second multidimensional data obtained by the folding operation shown in fig. 10 will be exemplified as an example, and an exemplary description will be made in connection with fig. 11.

FIG. 11 illustrates a process diagram for performing a plurality of reduction operations on second multidimensional data. As shown in fig. 11, taking still a high-dimensional reduction instruction as an example, a first reduction operation is performed on the reduction dimension (high-dimensional dimension) of the second multi-dimensional data [ N/10,30], the size of the reduction dimension of the second multi-dimensional data [ N/10,30] is compressed to 1, while the size of the non-reduction dimension is maintained, resulting in a first intermediate result [1,30]. The elements of each location of the first intermediate result [1,30] are reduction results of all N/10 high-dimensional elements in the corresponding low-dimensional location in the second multi-dimensional data [ N/10,30 ]. For example, 01 data of the first intermediate result [1,30] in FIG. 11 is obtained by the isotacticity of data1 and data 31 of the first column of high-dimensional elements of the second multidimensional data [ N/10,30 ].

Then, according to the size of the non-reduced dimension of the first multi-dimensional data [ N,3], the non-reduced dimension of the first intermediate result [1,30] is subjected to dimension splitting to obtain third multi-dimensional data [10,3]. Specifically, if the size of the non-reduced dimension of the first multidimensional data [ N,3] is 3, the non-reduced dimension of the first intermediate result [1,30] is dimension-split according to the size of 3, so as to obtain third multidimensional data [10,3] with the size of 3 of the non-reduced dimension.

Further, a second reduction operation of the reduction dimension is performed on the third multi-dimensional data [10,3], the size of the reduction dimension of the third multi-dimensional data [10,3] is compressed to 1, while the size of the non-reduction dimension is reserved, thereby obtaining a reduction result [1,3] for the first multi-dimensional data [ N, 3]. The elements of each position of the reduction result [1,3] are reduction results of 10 high-dimensional elements in the third multi-dimensional data [10,3] at the corresponding low-dimensional position. For example, the 001 data of the reduction result [1,3] in fig. 11 is obtained by the reduction of the data 01, the data 04, the data 07, etc. of the first column high-dimensional element of the third multi-dimensional data [10, 3].

Compared with the method for directly executing the high-dimensional protocol on the first multi-dimensional data [ N,3], the embodiment of the application can realize the improvement of the instruction performance utilization rate from 9.375% to 93.75% through the combined scheme of the folding operation and the multi-time protocol operation, and can avoid the additional data moving overhead. Specifically, the second multidimensional data obtained after the folding operation meets the requirement of the first threshold, and the size of the non-reduced dimension is expanded by 10 times, so that the utilization rate of 32 processing channels (128 bytes are taken as an example) of the hardware processing unit is improved from 3 processing channels (12 bytes) to 30 processing channels (120 bytes) for processing, the calculation efficiency is improved by nearly 10 times, and the performance utilization rate of the instruction reaches 93.75%. Compared with the first multidimensional data, particularly in the scene of larger data volume of the first multidimensional data, the data volume of the first intermediate result obtained after the first reduction operation is very small and basically negligible, so that the instruction performance utilization rate of the reduction result obtained based on the embodiment of the application is approximately equivalent to 93.75% as a whole. In addition, the folding operation can not change the data storage sequence, and the data storage sequence is not required to be changed when the dimension splitting is carried out on the first intermediate result, so that extra data carrying and storage expenditure can be effectively avoided, and the overall processing efficiency of hardware is further ensured.

The data processing operations of a data processing apparatus according to embodiments of the present application have been described above exemplarily in connection with fig. 9-11, it being understood that the above description is by way of example and not limitation. For example, embodiments of the present application may not be limited to processing only two-dimensional data, but may also process three-dimensional or more-dimensional data. In some embodiments, the non-reduced dimension of the first multidimensional data may include one dimension or a dimension that is a combination of multiple dimensions. For ease of understanding, an exemplary description will be made below in connection with fig. 12.

FIG. 12 illustrates first multidimensional data having non-reduced dimensions as merged dimensions. As shown in fig. 12, it is assumed that the first multi-dimensional data to be processed is three-dimensional data X having three dimensions of dim0, dim1, and dim2, respectively. dim0 has a size of 2, dim1 has a size of 2, and dim2 has a size of 3. Accordingly, the dimensional information of the three-dimensional data X may be expressed as X ₃ = [2, 3] in order of the dimensions from high to low, i.e., in order of dim0, dim1 to dim2. When the reduction instruction is a high-dimensional reduction instruction, dim0 is a reduction dimension, by merging dim1 and dim2, the first multi-dimensional data can be regarded as two-dimensional data X ', the reduction dimension of the two-dimensional data X' is still dim0, and the non-reduction dimension is a new dimension (dim 1×dim 2) formed by merging dim1 and dim2. The dimension information of the two-dimensional data X' may be expressed as X ₂ = [2,6]. Based on this, the data processing apparatus of the embodiment of the present application can continue to process the two-dimensional data X' according to the operation steps described above, for example, in connection with fig. 10 and 11.

It will be appreciated that when the first multi-dimensional data is a three-dimensional, or four-dimensional, higher-dimensional data structure, if one of the original dimensions is directly used as an irregular dimension, the alignment efficiency and the performance utilization of the instruction may be extremely low due to the undersize of the dimension. In this case, the technical solution of the embodiment of the present application may combine multiple original dimensions of the first multidimensional data into one new dimension, and use the combined new dimension as the non-reduced dimension, so that the size of the non-reduced dimension may be indirectly improved, and the alignment requirement may be more easily satisfied.

Further, whether the non-reduced dimension is a "single dimension" or a "multiple dimension combination", the processing unit may be enabled to uniformly calculate the first ratio in the form of a single non-reduced dimension for the first multi-dimensional data of different number of dimensions and perform the folding operation. According to the arrangement, the problems of alignment efficiency and performance utilization rate caused by undersize of a certain original non-reduced dimension in the multidimensional data can be avoided, hardware core logic (only a lightweight operation such as reshape for dimension combination is needed to be supported) can be not needed to be modified, various data scenes such as two-dimensional word vectors, three-dimensional feature images, four-dimensional convolution tensors and the like in the neural network can be covered, and the universality and the suitability of the scheme of the application are obviously improved.

In other embodiments, the processing unit 910 may be further configured to execute the reduction instruction by performing a first reduction operation on the folded data satisfying the number of folded groups in the second multi-dimensional data to obtain a first intermediate result in response to the remaining group data not satisfying the number of folded groups in the second multi-dimensional data, merging the remaining group data with the third multi-dimensional data to obtain merged data, and performing a second reduction operation on the merged data to obtain a reduction result. For ease of understanding, an exemplary description will be made below in connection with fig. 13.

FIG. 13 is a schematic diagram illustrating the processing of the remaining group data according to some embodiments of the application. As shown in fig. 13, for the second multi-dimensional data [ N/10,30] resulting from the first multi-dimensional data [ N,3] folding operation, if N is not divisible by the number of folded groups 10, i.e., N/10 is not an integer, there is remaining group data B in the second multi-dimensional data [ N/10,30] that does not satisfy the number of folded groups. For convenience of explanation, assuming that n=66, the second multidimensional data [ N/10,30] includes folding data a satisfying the number of folding groups (e.g., 10 groups) and remaining group data B not satisfying the number of folding groups (e.g., 10 groups), wherein the dimensional information of a is [2,30], and the dimensional information of B is [2,3].

In this embodiment, a first reduction operation may be performed on the folded data a satisfying the number of folded groups in the second multi-dimensional data to obtain a first intermediate result [1,30], and the first intermediate result [1,30] may be dimension-split according to the size of the non-reduced dimension of the first multi-dimensional data [ N,3] to obtain third multi-dimensional data [10,3]. The remaining set of data B may then be combined (or spliced) with the third multi-dimensional data to obtain combined data 12, 3. Since the non-reduced dimension of the third multi-dimensional data is the same size as the first multi-dimensional data, the unfolded remaining set of data B may be directly merged with the third multi-dimensional data.

Then, by performing a second reduction operation on the combined data [12,3], a reduction result [1,3] is obtained. The element at each position of the reduction result [1,3] is the reduction result of the 12 high-dimensional elements at the corresponding low-dimensional position in the merged data [12, 3]. For example, the 001 data of the reduction result [1,3] in fig. 13 is obtained by reducing the data 01, data 04, data 07, data 028, data 61, and data 64 of the first column high-dimensional element of the merged data [12, 3].

As can be seen from the above description, in the remaining set data processing manner shown in fig. 13, the processing can be completed only by splitting the folded data and the remaining set data and performing cross-data merging (i.e., merging the remaining set data with the third multi-dimensional data) on the remaining set data without performing additional folding or alignment operation on the remaining set data, so that the individual operation overhead on the remaining set data can be reduced to the greatest extent, and the dimensional characteristics of the third multi-dimensional data can be fully utilized without changing the dimensions of the remaining set data, thereby avoiding the new hardware adaptation logic due to mismatching between the dimensions of the remaining set data and the dimensions of the third multi-dimensional data.

It will be appreciated that the description above in connection with fig. 13 is by way of example and not limitation, and that the present application also provides another solution for processing remaining sets of data. For example, in other embodiments of the present application, the processing unit may be further configured to execute the reduction instructions by folding the remaining set of data into one set and aligning to the size of the expanded non-reduction dimension in response to the remaining set of data in the second multi-dimensional data not satisfying the number of folded sets, and performing a plurality of reduction operations based on the aligned second multi-dimensional data to obtain a reduction result. For ease of understanding, an exemplary description will be made below in connection with fig. 14.

FIG. 14 is a schematic diagram illustrating the processing of the remaining set of data according to further embodiments of the present application. As shown in fig. 14, taking the second multi-dimensional data [ N/10,30] shown in fig. 13 as an example, assuming that n=66, the second multi-dimensional data [ N/10,30] includes folding data a satisfying the number of folding groups and remaining group data B not satisfying the number of folding groups, wherein the dimensional information of a is [2,30], and the dimensional information of B is [2,3].

In this embodiment, the remaining group data B may be folded into a group (as shown in the drawing, folded into a row), and the size of the group where the remaining group data B is located is aligned to the size of the expanded non-reduced dimension. As shown in fig. 14, the remaining set of data B may be aligned to the size of the expanded non-reduced dimension by, for example, a zero padding operation (data 00 in the figure), thereby yielding aligned second multi-dimensional data [3,30]. Then, a first reduction operation, a dimension splitting, and a second reduction operation, for example, in the illustration are performed based on the aligned second multidimensional data to obtain a reduction result.

As can be seen from the above description, in the remaining set data processing manner shown in fig. 14, by folding the remaining set data into a set and aligning the set data to the expanded non-reduced dimension size, the processing granularity of the whole second multidimensional data adaptive hardware can be reduced, the problem of hardware scheduling fragmentation possibly caused by the irregular size of the remaining set is reduced, the subsequent multiple reduction operations can be ensured to be stably executed by depending on the hardware parallel architecture, and the pipeline blocking risk is reduced.

The specification instructions executed by a data processing apparatus have been described in detail above in connection with the various figures and various embodiments from the perspective of the specification dimension, it being understood that the foregoing description is by way of example and not limitation. For example, the specification instructions may not be limited to include only high-dimensional specification instructions or low-dimensional specification instructions, and in some embodiments, the specification instructions may also include two-dimensional specification instructions.

Two-dimensional reduction instructions are a class of aggregate operation instructions for two-dimensional data (e.g., H W matrices, image pixel grids, etc.). The hardware of the current neural network processing chip generally cannot support all two-dimensional specification instructions due to the area AND power consumption of the hardware, for example, the current MLU chip does not support PROD operation (performing a continuous multiplication operation), AND operation (AND operation), OR operation (OR operation), AND the like. To enable hardware to perform these two-dimensional reduction operations, it is typically implemented in a multi-cycle fashion. This will be described below with reference to fig. 15.

Fig. 15 shows a process diagram of a conventional two-dimensional reduction operation. When performing, for example, a PROD (product) reduction operation on two-dimensional data having dimension information of [ N, C ], where N is the size of a high-dimensional dimension, which may represent the number of batches or the number of rows, and C is the size of a low-dimensional dimension, which may represent the number of channels or the number of columns, the conventional processing is realized by performing a Multiplication (MUL) process through a plurality of cycles, and the specific operation is shown in fig. 15.

Taking the example of performing PROD reduction operation on the high-dimensional dimension of the two-dimensional data, the goal is to compress the dimension N of the high-dimensional dimension to 1, and finally obtain a reduction result with the dimension of [1, C ]. Conventionally, the vector of each high-dimensional element [1, C ] in the two-dimensional data (i.e. the single data of the two-dimensional data [ N, C ]) is taken as a basic processing unit to sequentially execute MUL operation. Firstly, selecting the first [1, C ] vector in two-dimensional data (namely first row data) as an initial product result, then entering a cyclic processing stage, calling the MUL instruction each time, carrying out multiplication operation on the currently obtained product result and the subsequent [1, C ] vector, and updating the product result. For example, the size of the two-dimensional data obtained as a result of the first cyclic processing in the drawing is updated to [ N-1, C ]. The above-mentioned cycle is continuously executed until the operation of all N [1, C ] vectors in the two-dimensional data is completed, and the N-1 times of MUL processing is accumulated to finally obtain the [1, C ] dimension PROD protocol result after the size N of the high-order dimension is compressed.

However, in this conventional processing manner, a single MUL process only needs to perform N-1 times of MUL instructions in a cumulative way for completing the PROD protocol of the entire high-dimensional dimension for 1[ 1, c ] vector, and the number of instruction calls is too large, which can cause stress on continuous scheduling of the hardware pipeline, and is easy to cause pipeline stall or blocking, and it is difficult to fully develop the parallel processing capability of the hardware, so that the overall operation efficiency of the two-dimensional data PROD protocol is reduced.

Based on this, the present application provides, in still other embodiments, a scheme that facilitates improving the computational efficiency of two-dimensional reduction operations. In still further embodiments, the processing unit is further configured to perform the respective reduction operation on the second or third multi-dimensional data by performing a binary operation on the reduction dimensions of the second or third multi-dimensional data to obtain a first data block and a second data block in response to the reduction instruction comprising a two-dimensional reduction instruction, performing a two-dimensional reduction operation on the first and second data blocks to obtain a second intermediate result, and taking the second intermediate result obtained from the current round of operation as input data for the next round of operation, and performing the binary operation and the two-dimensional reduction operation in a loop until the first intermediate result or the reduction result for the first multi-dimensional data is obtained. In some embodiments, the two-dimensional reduction instructions may include at least one of PROD instructions, AND instructions, OR instructions, AND the like. For ease of understanding, an exemplary description will be made below in connection with fig. 16.

FIG. 16 illustrates a schematic diagram of a reduction operation including two-dimensional reduction instructions according to an embodiment of the present application. As shown in fig. 16, assuming that the dimension information of the second multidimensional data or the third multidimensional data is [ N, C ], taking the reduction dimension as the high-dimensional dimension as an example, the dimension N of the high-dimensional dimension may be subjected to a bipartite operation to obtain, for example, a first data block 1-1 and a second data block 1-2 in the illustration, wherein in the case where N is an even number, the dimension information of each data block obtained by the bipartite operation is [ N/2, C ]. And then, executing a two-dimensional reduction operation on the first data block 1-1 and the second data block 1-2 to aggregate elements at corresponding positions of the two data blocks, and obtaining a second intermediate result with the dimension information of [ N/2, C ]. In the next round, performing the bisection operation again on the reduced dimension of the second intermediate result with the dimension information of [ N/2, C ] to obtain two new data blocks, and performing the two-dimensional reduction operation on the two new data blocks to obtain the second intermediate result with the dimension information of [ N/4, C ]. And (5) circulating the steps until a result with dimension information of [1, C ] is obtained.

It is to be understood that the data processing apparatus according to the embodiment of the present application may not be limited to process only the case where N is even as shown in fig. 16, but may also process the case where N is odd. For example, in other embodiments, the processing unit may be further configured to perform a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by splitting the first data block into a first sub-data block and a second sub-data block in response to the size of the reduction dimension of the first data block being greater than the size of the reduction dimension of the second data block such that the size of the reduction dimension of the second sub-data block is the same as the size of the reduction dimension of the second data block, performing a two-dimensional reduction operation on the second data block and the second sub-data block to obtain a third intermediate result, and merging the third intermediate result with the first sub-data block to obtain the second intermediate result.

When the reduction dimension of the second multidimensional data or the third multidimensional data is odd, the first data block and the second data block with different reduction dimensions are obtained when the reduction dimension of the second multidimensional data or the third multidimensional data is subjected to the bipartite operation. In order to facilitate the two-dimensional reduction operation of the first data block and the second data block, the first data block with the larger reduction dimension size in the two data blocks can be split. For example, the reduction dimension of the first data block is split based on the reduction dimension of the second data block to obtain a second sub-data block having the same reduction dimension as the second data block, so that the second data block and the second sub-data block perform two-dimensional reduction operation. At this time, the first data block is split to obtain a first sub-data block, where the dimension of the first sub-data block is a difference between the dimension of the first data block and the dimension of the second data block. When the third intermediate result is combined with the first sub-data block, the non-reduced dimension size is kept unchanged, and the third intermediate result is spliced according to the reduced dimension order, so that a second intermediate result is obtained. For ease of understanding, the following will exemplify.

For example, assuming that the dimension information of the second multidimensional data or the third multidimensional data is [ N, C ], where N is an odd number, for example, n=7, instruction reduction is performed on the high-dimensional dimension, the dimension N of the high-dimensional dimension may be subjected to binary operation, so as to obtain two data blocks. The dimension information of the first data block is [4, C ], and the dimension information of the second data block is [3, C ]. The first data block [4, C ] is further split to obtain a first sub-data block [1, C ] and a second sub-data block [3, C ], wherein the size of the reduction dimension of the second sub-data block and the size of the reduction dimension of the second data block are both 3. Therefore, a two-dimensional reduction operation can be performed on the second data block [3, C ] and the second sub-data block [3, C ] once, so that elements at corresponding positions of the two data blocks are aggregated, and a third intermediate result with dimension information [3, C ] is obtained. And merging the first sub data block [1, C ] and the third intermediate result according to the continuity of the data to obtain a second intermediate result with the dimension information of [4, C ]. In the next round, performing the bipartite operation again on the reduced dimension of the second intermediate result with the dimension information of [4, C ] to obtain a new first data block and a new second data block, and performing the two-dimensional reduced operation on the two new data blocks to obtain the second intermediate result with the dimension information of [2, C ]. And (5) circulating the steps until a result with dimension information of [1, C ] is obtained.

Based on the above, the dimension information of the second intermediate result obtained after each round of operation is [ ceil (N/2), C ], where ceil represents an upward rounding, and N and C represent the dimension of the reduced dimension and the dimension of the non-reduced dimension of the input data for each round, respectively. The above-described corresponding operations are performed based on whether the reduction dimension size of the input data in each round of operation is even or odd. And (5) circulating the steps until a result with dimension information of [1, C ] is obtained.

According to the combined scheme of the bipartite operation and the two-dimensional reduction operation, for the two-dimensional data [ N, C ] with the dimension of the reduction dimension being N, the final result of [1, C ] can be obtained by only processing log ₂ (N) cycles, and the data volume processed in each round is successively halved. Compared with the processing procedure shown in fig. 15, the processing procedure shown in fig. 16 greatly reduces the instruction calling times, improves the data processing amount of single operation, is beneficial to reducing pipeline stalling or blocking phenomenon, is beneficial to fully playing the parallel processing capability of hardware, and greatly improves the overall utilization rate of the hardware.

Exemplary data processing method

The application also provides a data processing method implemented by a data processing device, which may comprise the aforementioned processing unit and a storage unit. The data processing method will be described below with reference to fig. 17.

FIG. 17 illustrates an exemplary flow chart of a data processing method of some embodiments of the application. As shown in fig. 17, the data processing method 1700 may include determining a reduction dimension of first multi-dimensional data for which a reduction instruction is directed in step S1702, performing a folding operation on the reduction dimension of the first multi-dimensional data to expand the size of the non-reduction dimension such that a second ratio of the size of the expanded non-reduction dimension to an alignment granularity required for the expanded non-reduction dimension to meet the instruction alignment requirement exceeds the first threshold to obtain second multi-dimensional data when a first ratio of the size of the non-reduction dimension of the first multi-dimensional data to an alignment granularity required for the non-reduction dimension to meet the instruction alignment requirement is lower than a first threshold in step S1704, and performing a plurality of reduction operations on the second multi-dimensional data to obtain a reduction result for the first multi-dimensional data in step S1706.

In some embodiments, step S1704 may further include determining a number of collapsed groups of the reduction dimension based on the expansion factor required for the non-reduction dimension, wherein the number of collapsed groups is a number of data groups required to collapse into a set of data of the reduction dimension, and performing a collapse operation on the reduction dimension of the first multi-dimensional data according to the number of collapsed groups.

In other embodiments, step S1706 may further comprise performing a first reduction operation of the reduction dimension on the second multi-dimensional data to obtain a first intermediate result, dimension splitting the non-reduction dimension of the first intermediate result according to the size of the non-reduction dimension of the first multi-dimensional data to obtain third multi-dimensional data, and performing a second reduction operation of the reduction dimension on the third multi-dimensional data to obtain a reduction result.

In still other embodiments, the data processing method 1700 further includes performing a first reduction operation on the folded data in the second multi-dimensional data that satisfies the number of folded groups to obtain a first intermediate result in response to the remaining group data in the second multi-dimensional data not satisfying the number of folded groups, merging the remaining group data with the third multi-dimensional data to obtain merged data, and performing a second reduction operation on the merged data to obtain a reduced result.

In some embodiments, the data processing method 1700 further includes, in response to the presence of remaining sets of data in the second multi-dimensional data that do not satisfy the number of folded sets, folding the remaining sets of data into a set and aligning to the size of the expanded non-reduction dimension, performing a plurality of reduction operations based on the aligned second multi-dimensional data to obtain a reduction result.

In other embodiments, the data processing method 1700 further includes performing a corresponding reduction operation on the second multi-dimensional data or the third multi-dimensional data by performing a binary operation on the reduction dimensions of the second multi-dimensional data or the third multi-dimensional data to obtain a first data block and a second data block in response to the reduction instructions including a two-dimensional reduction instruction, performing a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result, and performing the binary operation and the two-dimensional reduction operation in a loop with the second intermediate result obtained from the current round operation as input data for the next round operation until the first intermediate result or the reduction result is obtained.

In some embodiments, the data processing method 1700 further includes performing a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by splitting the first data block into a first sub-data block and a second sub-data block such that the size of the reduction dimension of the second sub-data block is the same as the size of the reduction dimension of the second data block in response to the size of the reduction dimension of the first data block being greater than the size of the reduction dimension of the second data block, performing a two-dimensional reduction operation on the second data block and the second sub-data block to obtain a third intermediate result, and merging the third intermediate result with the first sub-data block to obtain the second intermediate result.

In still other embodiments, the two-dimensional reduction instruction includes at least one of a PROD instruction, an AND instruction, OR instruction.

In some embodiments, the reduction dimension is higher than the non-reduction dimension.

In other embodiments, the non-reducing dimension includes one dimension, or a dimension that is a combination of multiple dimensions.

The data processing method of the embodiment of the present application has been described in detail in the foregoing with reference to the data processing apparatus, and will not be described herein.

In particular implementations, based on the disclosure and teachings of the present application, those skilled in the art will appreciate that the various types of devices described herein (e.g., computing devices or data processing devices) may be implemented by suitable hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned memory unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistive memory (RESISTIVE RANDOM ACCESS MEMORY, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (ENHANCED DYNAMIC Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

In embodiments of the present application, when reference is made to "instructions," software instructions, hardware instructions, firmware instructions, or any combination thereof may be included. Software instructions, which are typically statements or commands in a programming language, are high-level abstract instructions, including, for example, function calls, machine code, byte code, etc. Hardware instructions, also referred to as machine instructions or Instruction Set Architecture (ISA) instructions, are low-level commands that computer hardware can directly recognize and execute, including, for example, processor instructions, CPU instructions, and the like. Firmware instructions include, for example, opcodes, microcode, and the like. Further, different CPU architectures and different instruction sets may have different hardware instruction sets, and may be classified into CISC (Complex instruction set computer), RISC (reduced instruction set computer) and VLIW (very long instruction word) according to the complexity of the instruction set of the computer, hardware design, execution speed, compiler complexity, instruction format, etc.

The foregoing may be better understood in light of the following clauses:

Clause a1. A data processing device comprising:

A processing unit configured to execute a specification instruction on the first multidimensional data, and

A storage unit configured to store data during execution of the specification instruction;

wherein the processing unit is configured to execute the specification instructions by:

determining a specification dimension of first multidimensional data for which the specification instruction is intended;

When a first ratio of the size of the non-reduced dimension of the first multidimensional data to the alignment granularity required by the non-reduced dimension to meet the instruction alignment requirement is lower than a first threshold, performing a folding operation on the reduced dimension of the first multidimensional data to expand the size of the non-reduced dimension so that a second ratio of the size of the expanded non-reduced dimension to the alignment granularity required by the non-reduced dimension to meet the instruction alignment requirement exceeds the first threshold to obtain second multidimensional data, and

Performing a plurality of reduction operations on the second multi-dimensional data to obtain a reduction result for the first multi-dimensional data.

Clause A2 the data processing device of clause A1, wherein the processing unit is configured to perform a folding operation on the first multi-dimensional data by:

Determining a number of collapsed groups of the reduction dimension based on an expansion multiple required for the non-reduction dimension, wherein the number of collapsed groups is a number of data groups required to collapse into a set of data of the reduction dimension;

And executing folding operation on the reduced dimension of the first multidimensional data according to the folding group number.

Clause A3 the data processing device of clause A2, wherein the processing unit is configured to perform a plurality of reduction operations on the second multi-dimensional data by:

performing a first reduction operation of the reduction dimension on the second multidimensional data to obtain a first intermediate result;

According to the size of the non-reduced dimension of the first multidimensional data, carrying out dimension splitting on the non-reduced dimension of the first intermediate result to obtain third multidimensional data;

And executing a second reduction operation of the reduction dimension on the third multidimensional data to obtain the reduction result.

Clause A4 the data processing device of clause A3, wherein the processing unit is further configured to execute the specification instructions by:

Responding to the residual group data which does not meet the folding group number in the second multi-dimensional data, and executing the first protocol operation on the folding data which meet the folding group number in the second multi-dimensional data to obtain a first intermediate result;

combining the residual group data with the third multi-dimensional data to obtain combined data;

And executing the second reduction operation on the combined data to obtain the reduction result.

Clause A5 the data processing device of clause A2 or A3, wherein the processing unit is further configured to execute the specification instructions by:

In response to the second multi-dimensional data having remaining group data that does not satisfy the number of folded groups, folding the remaining group data into a group and aligning to the size of the expanded non-reduced dimension;

And performing a plurality of reduction operations based on the aligned second multidimensional data to obtain the reduction result.

Clause A6 the data processing device of clause A3, wherein the processing unit is further configured to perform a corresponding reduction operation on the second or third multi-dimensional data by:

responding to the specification instruction comprising a two-dimensional specification instruction, and executing a bipartite operation on the specification dimension of the second multidimensional data or the third multidimensional data to obtain a first data block and a second data block;

Executing two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result;

And taking a second intermediate result obtained by the current round operation as input data of the next round operation, and circularly executing the bipartite operation and the two-dimensional reduction operation until the first intermediate result or the reduction result is obtained.

Clause A7 the data processing device of clause A6, wherein the processing unit is further configured to perform a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by:

In response to the size of the reduction dimension of the first data block being greater than the size of the reduction dimension of the second data block, splitting the first data block into a first sub-data block and a second sub-data block such that the size of the reduction dimension of the second sub-data block is the same as the size of the reduction dimension of the second data block;

Executing two-dimensional reduction operation on the second data block and the second sub-data block to obtain a third intermediate result;

and merging the third intermediate result with the first sub-data block to obtain the second intermediate result.

Clause A8 the data processing apparatus of clause A6 OR A7, wherein the two-dimensional specification instructions comprise at least one of PROD instructions, AND instructions, OR instructions.

Clause A9 the data processing apparatus of any of clauses A1-A8, wherein the reduction dimension is higher than the non-reduction dimension.

Clause a10. The data processing apparatus of any of clauses A1-A9, wherein the non-canonical dimension comprises a dimension or a dimension that is a combination of multiple dimensions.

Clause a11 a chip comprising the data processing device of any of clauses A1-a 10.

Clause a12. A board card comprising the chip of clause a 11.

Clause a13. A data processing method implemented by a data processing device, the data processing device comprising a processing unit and a storage unit, the data processing method comprising:

The processing unit determines a specification dimension of the first multidimensional data for which the specification instruction is intended;

Clause a14 the data processing method of clause a13, wherein performing the folding operation on the reduced dimension of the first multi-dimensional data comprises:

Clause a15 the data processing method of clause a14, wherein performing the reduction operation on the second multi-dimensional data comprises:

Clause a16 the data processing method of clause a15, further comprising:

Clause a17 the data processing method of clause a14 or a15, further comprising:

Clause a18 the data processing method of clause a15, further comprising performing a corresponding reduction operation on the second or third multi-dimensional data by:

Clause a19 the data processing method of clause a18, further comprising performing a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by:

Clause a20 the data processing method of clause a18 OR a19, wherein the two-dimensional specification instructions comprise at least one of PROD instructions, AND instructions, OR instructions.

Clause a21 the data processing method of any of clauses a13-a20, wherein the reduction dimension is higher than the non-reduction dimension.

Clause a22. The data processing method of any of clauses a13-a21, wherein the non-canonical dimension comprises a dimension or a dimension that is a combination of multiple dimensions.

While various embodiments of the present application have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the application. It should be understood that various alternatives to the embodiments of the application described herein may be employed in practicing the application. The appended claims are intended to define the scope of the application and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A data processing apparatus, comprising:

Processing unit, configured to execute reduction instructions on the first multidimensional data; and

Storage unit, configured to store data during execution of the specification instructions;

The processing unit is configured to execute the specification instructions through the following steps:

Determine the reduction dimension of the first multidimensional data to which the reduction instruction is applied;

When the ratio of the size of the unreduced dimension of the first multidimensional data to the first alignment granularity required to satisfy the instruction alignment requirement is lower than a first threshold, a folding operation is performed on the reduced dimension of the first multidimensional data to expand the size of the unreduced dimension, such that the ratio of the expanded size of the unreduced dimension to the second alignment granularity required to satisfy the instruction alignment requirement exceeds the first threshold, resulting in second multidimensional data; and

Perform multiple reduction operations on the second multidimensional data to obtain the reduction result for the first multidimensional data.

2. The data processing apparatus according to claim 1, wherein the processing unit is configured to perform a folding operation on the first multidimensional data by means of the following steps:

Based on the required augmentation factor of the unreduced dimension, determine the number of fold groups of the reduced dimension, wherein the number of fold groups is the number of data groups required to fold into a set of data of the reduced dimension;

According to the number of fold groups, perform a folding operation on the reduced dimensions of the first multidimensional data.

3. The data processing apparatus according to claim 2, wherein the processing unit is configured to perform multiple reduction operations on the second multidimensional data by means of the following steps:

Perform the first reduction operation on the reduced dimension on the second multidimensional data to obtain a first intermediate result;

Based on the size of the unreduced dimension of the first multidimensional data, the unreduced dimension of the first intermediate result is split into dimensions to obtain the third multidimensional data;

Perform the second reduction operation on the third multidimensional data to obtain the reduction result.

4. The data processing apparatus of claim 3, wherein the processing unit is further configured to execute the protocol instructions by means of the following steps:

In response to the existence of remaining data groups in the second multidimensional data that do not satisfy the number of folded groups, the first reduction operation is performed on the folded data in the second multidimensional data that satisfy the number of folded groups to obtain a first intermediate result;

The remaining data groups are merged with the third multidimensional data to obtain merged data;

The second reduction operation is performed on the merged data to obtain the reduction result.

5. The data processing apparatus according to claim 2 or 3, wherein the processing unit is further configured to execute the protocol instructions by means of the following steps:

In response to the existence of remaining data groups in the second multidimensional data that do not meet the number of folded groups, the remaining data groups are folded into a group and aligned to the size of the expanded unreduced dimension;

Multiple reduction operations are performed based on the aligned second multidimensional data to obtain the reduction result.

6. The data processing apparatus according to claim 3, wherein the processing unit is further configured to perform a corresponding reduction operation on the second multidimensional data or the third multidimensional data by means of the following steps:

In response to the reduction instruction including the two-dimensional reduction instruction, a binary split operation is performed on the reduction dimension of the second multidimensional data or the third multidimensional data to obtain the first data block and the second data block.

Perform a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result;

The second intermediate result obtained in the current round of operation is used as the input data for the next round of operation. The binary search operation and the two-dimensional reduction operation are executed cyclically until the first intermediate result or the reduction result is obtained.

7. The data processing apparatus according to claim 6, wherein the processing unit is further configured to perform a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by means of the following steps:

In response to the fact that the size of the reduced dimension of the first data block is greater than the size of the reduced dimension of the second data block, the first data block is split into a first sub-data block and a second sub-data block, such that the size of the reduced dimension of the second sub-data block is the same as the size of the reduced dimension of the second data block;

Perform a two-dimensional reduction operation on the second data block and the second sub-data block to obtain a third intermediate result;

The third intermediate result is merged with the first sub-data block to obtain the second intermediate result.

8. The data processing apparatus according to claim 6 or 7, wherein the two-dimensional specification instruction includes at least one of the PROD instruction, AND instruction, and OR instruction.

9. The data processing apparatus according to any one of claims 1-8, wherein the reduced dimension is higher than the unreduced dimension.

10. The data processing apparatus according to any one of claims 1-9, wherein the unreduced dimension comprises a single dimension or a dimension formed by merging multiple dimensions.

11. A chip, characterized in that the chip includes a data processing device as described in any one of claims 1-10.

12. A circuit board, characterized in that the circuit board comprises the chip of claim 11.

13. A data processing method implemented by a data processing apparatus, the data processing apparatus comprising a processing unit and a storage unit, the data processing method comprising:

The processing unit determines the reduction dimension of the first multidimensional data to which the reduction instruction is applied;

14. The data processing method according to claim 13, wherein performing a folding operation on the reduced dimensions of the first multidimensional data includes:

15. The data processing method according to claim 14, wherein performing multiple reduction operations on the second multidimensional data includes:

16. The data processing method according to claim 15, further comprising:

17. The data processing method according to claim 14 or 15, further comprising:

18. The data processing method according to claim 15 further includes performing a corresponding reduction operation on the second multidimensional data or the third multidimensional data by the following steps:

19. The data processing method according to claim 18 further includes performing a two-dimensional reduction operation on the first data block and the second data block to obtain a second intermediate result by the following steps:

20. The data processing method according to claim 18 or 19, wherein the two-dimensional specification instruction includes at least one of the PROD instruction, AND instruction, and OR instruction.

21. The data processing method according to any one of claims 13-20, wherein the reduced dimension is higher than the unreduced dimension.

22. The data processing method according to any one of claims 13-21, wherein the unreduced dimension includes a single dimension or a dimension formed by merging multiple dimensions.