CN115186815A

CN115186815A - Data processing method and device, electronic device and medium

Info

Publication number: CN115186815A
Application number: CN202210916696.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-10-14
Anticipated expiration: 2042-08-01
Also published as: CN115186815B

Abstract

A data processing method, data processing apparatus, electronic apparatus and computer-readable storage medium. The data processing method includes: loading input data into a first register and loading mask data into a second register, the size of the input data is N*M, the size of the mask data is P*Q, and each element of the mask data is For Z bits, Q*Z=2*M, each bit of each element of the masked data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=Q/ 2=Z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements located in the same row in the input data, and N, M, P, Q, i, j, and Z are all positive integers ; Based on the correspondence between each element of the input data and each bit of each element of the masked data, perform a product calculation to obtain the output data. The method can effectively reduce the number of assembly instructions, greatly improve the running efficiency, and reduce the running time.

Description

Data processing method and device, electronic device and medium

Technical Field

Embodiments of the present disclosure relate to a data processing method, a data processing apparatus, an electronic apparatus, and a computer-readable storage medium.

Background

In the machine learning model, if the parameters of the model are too many and the training samples are too few, the trained model is easy to generate an overfitting phenomenon. The overfitting is embodied in: the model has smaller loss function on the training data and higher prediction accuracy; but the loss function is larger on the test data, and the prediction accuracy is lower. The discarding method (dropout) can effectively alleviate the occurrence of overfitting and achieve the regularization effect to a certain extent. The discarding method is to stop the activation value of a certain neuron at a certain probability during forward propagation, so that the model generalization is stronger.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data processing method, including: loading input data into a first register and loading mask data into a second register, wherein the size of the input data is N x M, the size of the mask data is P x Q, each element of the mask data is Z bits, Q x Z =2*M, each bit of each element of the mask data corresponds to one element of the input data, storage units of the second register are arranged into i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially correspond to continuous i elements located in the same row in the input data, and N, M, P, Q, i, j and Z are positive integers; based on the correspondence of the respective elements of the input data and the respective bits of the respective elements of the mask data, product calculation is performed to obtain output data.

For example, in the data processing method provided by an embodiment of the present disclosure, the 1 st column to the j/2 th column of the second register are the first group of storage units, and the (j/2+1) th column to the j th column of the second register are the second group of storage units; performing a product calculation based on correspondence of respective elements of the input data and respective bits of respective elements of the mask data to obtain output data, including: selecting a bit stored in an s-th column of memory cells in the first group of memory cells and a bit stored in an s-th column of memory cells in the second group of memory cells with a selection instruction, the bit stored in the s-th column of memory cells in the first group of memory cells and the bit stored in the s-th column of memory cells in the second group of memory cells corresponding to elements located in consecutive i columns in every two consecutive rows in the input data; based on the corresponding relation, performing product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain corresponding elements which are positioned in every two continuous rows and are positioned in i continuous columns in the output data; shifting the first group of storage units and the second group of storage units by 1 bit to the left or 1 bit to the right to obtain a shifted first group of storage units and a shifted second group of storage units, selecting a bit stored in an s-th column storage unit in the shifted first group of storage units and a bit stored in an s-th column storage unit in the shifted second group of storage units by using a selection instruction, performing product calculation on the bit in the selected storage unit and the corresponding element of the input data based on the corresponding relation, and continuing shift operation and product calculation until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.

For example, in the data processing method provided in an embodiment of the present disclosure, N =512, m =1024, p =256, i = j = q/2= z =32.

For example, in the data processing method provided by an embodiment of the present disclosure, s =16, the bit stored in the 16 th column of the first group of memory cells corresponds to the element located in the X row, 1 st column to 32 nd column in the input data, and the bit stored in the 16 th column of the second group of memory cells corresponds to the element located in the X +1 th row, 1 st column to 32 nd column in the input data; corresponding elements positioned in every two continuous rows and i continuous columns in the obtained output data comprise elements positioned in the 1 st column to the 32 nd column of the X-th row and the X +1 th row in the output data; the bit stored in the 16 th column of memory cells in the shifted first group of memory cells corresponds to the element in the input data located in the 33 rd column to the 64 th column of the X-th row, the bit stored in the 16 th column of memory cells in the shifted second group of memory cells corresponds to the element in the input data located in the 33 rd column to the 64 th column of the X +1 th row, X is a positive integer and X is an odd number.

For example, in the data processing method provided in an embodiment of the present disclosure, the first group of memory cells includes 512 bits, the second register includes 2 × 512 bits, and the 2 × 512 bits correspond to elements in the 1 st column to the 512 th column in every two consecutive rows of the input data.

For example, in the data processing method provided in an embodiment of the present disclosure, the elements from the 513 th column to the 1024 th column in every two consecutive rows in the input data are in one-to-one correspondence with 2 × 512 bits stored in the third register, the third register stores mask data that is different from but the same in size as the mask data stored in the second register, and the mask data in the third register is arranged in the same manner as the mask data in the second register.

For example, in a data processing method provided by an embodiment of the present disclosure, performing product calculation on bits in selected storage units and corresponding elements of input data includes: taking the corresponding element of the input data as the corresponding element of the output data when the value of the bit in the selected memory cell is 1; if the value of the bit in the selected memory cell is 0, 0 is set as the corresponding element in the output data.

For example, in a data processing method provided in an embodiment of the present disclosure, performing product calculation based on correspondence between each element of input data and each bit of each element of mask data to obtain output data, further includes: the result of the multiplication is divided by (1-drop _ prob), which represents the probability of 0 per bit of each element of the mask data, to obtain output data.

For example, a data processing method provided in an embodiment of the present disclosure further includes: and storing the output data into the memory according to the corresponding position of the input data.

For example, in the data processing method provided by an embodiment of the present disclosure, the format type of the selection instruction is the same as the format type of the input data.

For example, in the data processing method provided in an embodiment of the present disclosure, the format type of the selection instruction is BF16, and the format type of the input data is BF16.

For example, in a data processing method provided by an embodiment of the present disclosure, the data processing method is used for calculation of a discardable method layer of a neural network, an input portion of the discardable method layer includes input data and mask data, and in a case where a bit of an element of the mask data has a value of 1, the corresponding element of the input data serves as an output of the discardable method layer; in the case where the value of the bit masking an element of data is 0, the corresponding element of input data is discarded.

For example, a data processing method provided in an embodiment of the present disclosure further includes: before loading the input data into the first register and the mask data into the second register, an alignment operation is performed on the input data and the mask data such that every 2*N element in the input data corresponds to every 1*Z element in the mask data.

At least one embodiment of the present disclosure also provides a data processing apparatus, including: the data loading unit is configured to load input data into a first register and load mask data into a second register, the size of the input data is N M, the size of the mask data is P Q, each element of the mask data is Z bits, Q Z =2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially correspond to i continuous elements located in the same row in the input data, and N, M, P, Q, i, j and Z are positive integers; and the calculation unit is configured to perform product calculation to obtain output data based on the corresponding relation between each element of the input data and each bit of each element of the mask data.

At least one embodiment of the present disclosure also provides a data processing apparatus, including: a processor; and a memory storing computer-executable instructions that, when executed by the processor, implement a data processing method provided by at least one embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides an electronic device including the data processing device provided by at least one embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement a data processing method provided by at least one embodiment of the present disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a diagram illustrating the structure of a register;

fig. 2 illustrates a schematic flow chart of a data processing method provided by at least one embodiment of the present disclosure;

FIG. 3 shows a schematic flow chart of one example of step S202 in FIG. 2;

fig. 4 shows a schematic diagram of one example of step S301 in fig. 3;

fig. 5 shows a schematic diagram of one example of step S303 in fig. 3;

fig. 6 shows a schematic block diagram of a data processing apparatus provided in at least one embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 8 shows a schematic view of an electronic device according to an embodiment of the disclosure;

fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The input part of the drop method layer (dropout layer) includes input data and mask data corresponding to the input data. Each bit of each element of the masking data corresponds to an element of the input data. If the value of the bit of the element of the mask data is 1, the corresponding element of the input data needs to be preserved, and if the value of the bit of the element of the mask data is 0, the corresponding element of the input data is changed to 0. Assume that the input data is of size [512,1024], the data type is BF16, the mask data is of size [512,32], and the data type is FP32. The 32 bits of each FP32 data in the masked data correspond to 32 elements of the same row of the input data. Each row in the cloaking data has 32 elements, i.e., 32 × 32=1024 bits, which 1024 bits correspond to a complete row in the input data, i.e., 1024 elements of the input data. The mask data is read from memory into registers, the schematic structure of which is shown in fig. 1. Each register includes 32 channels (Lane 0-Lane 31), each channel containing a 32-Bit number (Bit 0-Bit 31), each Bit corresponding to an element of the input data. One register can just put an element that masks the complete row of data, i.e. 32 FP32 elements.

Typically, a 32-bit number for each lane in the register would correspond to 32 elements across the input data. Since the format of the input data is BF16, two lines of BF16 data, i.e., 2 × 32, are included after the register is read. In order to correspond to bits of elements of the mask data one by one, a register for storing input data is firstly split into two groups of independent 1 × 32 FP32 data, then only 32 bits of one channel in the register for storing the mask data can be read and stored in a scalar register, and then selective multiplication operation is carried out on the split input data by using the mask data. If the bit of the element of the masked data is 1, the corresponding element of the input data is retained, if the bit of the element of the masked data is 0, the corresponding element of the input data is changed to 0, and then two sets of calculation results of the FP32 type of 1 × 32 are merged into one set of output data of the BF16 type of 2 × 32. The calculation process is complex in instruction and extremely low in efficiency.

The data processing method provided by the embodiment of the disclosure changes the corresponding mode of the mask data and the input data, effectively reduces the number of instructions, greatly improves the operation efficiency, and reduces the operation time.

At least one embodiment of the present disclosure also provides a data processing apparatus, an electronic apparatus, and a computer-readable storage medium corresponding to the above-described data processing method.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

Fig. 2 shows a schematic flow chart of a data processing method provided by at least one embodiment of the present disclosure.

As shown in fig. 2, the data processing method includes steps S201 to S202 as follows.

Step S201: input data is loaded into a first register and cloaking data is loaded into a second register.

For example, the size of the input data is N × M, the size of the mask data is P × Q, each element of the mask data is Z bits, Q × Z =2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements located in the same row in the input data, and N, M, P, Q, i, j, and Z are positive integers.

For example, N =512, m =1024, p =256, i = j = q/2= z =32, that is, the size of the input data is 512 × 1024, the size of the mask data is 256 × 64, each element of the mask data is 32 bits, the number of columns (64) of the mask data multiplied by the number of bits (32) of each element of the mask data is equal to 2 times the number of columns (2048) of the input data, the number of rows (32) of the storage units of the second register is equal to the number of columns (32) of the storage units of the second register, and is equal to the number of columns of the mask data divided by 2 (32), and the number of bits (32) of each element of the mask data. The above specific numerical values are only an example, and the disclosure does not limit the specific numerical values of N, M, P, Q, i, j, Z, and only needs to satisfy the above relationships.

Step S202: based on the correspondence of the respective elements of the input data and the respective bits of the respective elements of the mask data, product calculation is performed to obtain output data.

For example, before step S201, the data processing method provided by the embodiment of the present disclosure may further include: the input data and the masked data are aligned such that each 2*N element in the input data corresponds to each 1*Z element in the masked data.

For example, N =512, z =32, and the number of bits per element of the mask data is 32, then every 2 × 512 elements (2 rows and 512 columns, i.e., 1024 elements) in the input data correspond to every 1 × 32 elements (1 row and 32 columns, i.e., 32 elements, and the number of bits per 32 elements is 32 × 32= 1024) in the mask data.

The method adjusts the corresponding relation between each element of the input data and each bit of each element of the masking data, namely, j bits of each channel in the second register correspond to continuous j elements of the input data in the transverse direction, but i bits stored in each column of storage units in the second register correspond to continuous i elements of the input data in the transverse direction, so that the subsequent calculation of each bit of the masking data and each element of the input data is facilitated, for example, a plurality of bits of the masking data are conveniently read in batch by combining with a selection instruction, and the format adjustment and conversion of the read bits are avoided, so that the instruction number is effectively reduced, and the operation efficiency is improved.

For example, in some embodiments of the present disclosure, the second register is divided into two groups of memory cells, the 1 st column to the j/2 th column of the second register are the first group of memory cells, and the (j/2+1) th column to the j th column of the second register are the second group of memory cells. Therefore, the selection and calculation of each bit can be realized by combining the selection instruction, so that the efficiency is improved, and the realization is convenient.

Fig. 3 shows a schematic flow chart of one example of step S202 in fig. 2.

As shown in fig. 3, for the second register being divided into two groups of memory cells, one example of step S202 may include steps S301 to S303 as follows.

Step S301: and selecting a bit stored in an s-th column storage unit in the first group of storage units and a bit stored in an s-th column storage unit in the second group of storage units by using a selection instruction, wherein the bit stored in the s-th column storage unit in the first group of storage units and the bit stored in the s-th column storage unit in the second group of storage units correspond to elements located in continuous i columns in every two continuous rows in the input data.

For example, in some embodiments of the present disclosure, the format type of the selection instruction is the same as the format type of the input data.

For example, in some embodiments of the present disclosure, the format type of the select instruction is BF16, and the format type of the input data is BF16.

Because the format type of the selection instruction is the same as that of the input data, the selection operation can be directly carried out on the input data, and the method is more concise and efficient.

Fig. 4 shows a schematic diagram of one example of step S301 in fig. 3.

As shown in fig. 4, the memory cells of the second register are arranged in 32 rows and 32 columns (the second register has 32 channels (Lane 0 to Lane 31) and 32 bits (bit 0 to bit 32) per channel), the 1 st column to 16 th column of the second register are the first group of memory cells, and the 17 th column to 32 th column of the second register are the second group of memory cells. The highest bit of the first group of memory cells and the second group of memory cells, i.e. the bit stored in the 16 th column of memory cells in the first group of memory cells and the bit stored in the 16 th column of memory cells in the second group of memory cells, is selected by a select instruction. For example, the bit stored in the 16 th column memory cell in the first group of memory cells and the bit stored in the 16 th column memory cell in the second group of memory cells correspond to elements located in the 1 st column to the 32 nd column of the first row and the second row in the input data.

In addition, the present disclosure does not limit which column of memory cells of the first group and the second group of memory cells stores a bit, as long as the selection is performed in a column unit.

Step S302: and performing product calculation on the bits in the selected storage units and the corresponding elements of the input data based on the corresponding relation to obtain corresponding elements which are positioned in every two continuous rows and i continuous columns in the output data.

For example, in some embodiments of the present disclosure, step S302 may include: if the value of the bit in the selected memory cell is 1, the corresponding element of the input data is set as the corresponding element of the output data, and if the value of the bit in the selected memory cell is 0, 0 is set as the corresponding element of the output data.

Step S303: shifting the first group of storage units and the second group of storage units by 1 bit to the left or 1 bit to the right to obtain a shifted first group of storage units and a shifted second group of storage units, selecting a bit stored in an s-th column storage unit in the shifted first group of storage units and a bit stored in an s-th column storage unit in the shifted second group of storage units by using a selection instruction, performing product calculation on the bit in the selected storage unit and the corresponding element of the input data based on the corresponding relation, and continuing shift operation and product calculation until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.

Fig. 5 shows a schematic diagram of one example of step S303 in fig. 3.

As shown in fig. 5, the size of the second register is the same as that of the second register shown in fig. 4. The selection instruction only selects the bit stored in the 16 th column (bit 15) storage unit in the first group of storage units and the bit stored in the 16 th column (bit 31) storage unit in the second group of storage units each time, the step S302 is executed after the selection operation is finished, the first group of storage units and the second group of storage units are shifted by 1 bit to the left after the step S302 is executed, so that the bit30 is shifted to the position of the bit31, the bit14 is shifted to the position of the bit15, the rest bits are shifted by one bit to the left in turn, and the product calculation of the step 302 is executed until the selection and the product calculation of all columns in the first group of storage units and the second group of storage units are finished.

It should be noted that the first group of storage units and the second group of storage units may also be shifted to the right by 1 bit, which may be adjusted according to the corresponding relationship between the element of the input data and the bit of the element of the mask data, and the disclosure is not limited thereto. In other embodiments, if the bits in the column direction are selected in other ways instead of using the selection instruction, the shift operation may be omitted as long as the bits in each column can be selected sequentially, and the embodiment of the disclosure is not limited thereto.

Returning to fig. 3, for example, in some embodiments of the present disclosure, step S202 may further include step S304: the result of the multiplication is divided by (1-drop _ prob), which represents the probability of 0 per bit of each element of the mask data, to obtain output data.

For example, each bit of each element of the masking data has a probability of 0, which is denoted herein as drop _ prob, and the probability of 1 for each bit of each element of the masking data is (1-drop _ prob). The result of the calculation is typically scaled by 1/(1-drop _ prob) or by (1-drop _ prob).

For example, the data processing method provided by the embodiment of the present disclosure may further include: and storing the output data into a memory according to the corresponding position of the input data.

Since the size and data type of the output data are completely identical to those of the input data, after the output data are obtained, the output data are stored in the memory according to the same coordinates based on the coordinates of the input data currently processed.

The data processing method provided by the present disclosure is explained below by a specific embodiment.

For example, in one embodiment of the present disclosure, N =512, m =1024, p =256, i = j = q/2= z =32. That is, the size of the input data is 512 × 1024, the size of the mask data is 256 × 64, each element of the mask data is 32 bits, and the memory cells of the second register are arranged in 32 rows and 32 columns. The format type of the input data is BF16, the format type of the masking data is FP32, and the format type of the selection instruction is BF16.

For example, columns 1 through 16 of the second register are a first group of memory cells, columns 17 through 32 of the second register are a second group of memory cells

First, input data is loaded into a first register and mask data is loaded into a second register.

Then, the bit stored in the 16 th column memory cell in the first group of memory cells and the bit stored in the 16 th column memory cell in the second group of memory cells are selected by the selection instruction, that is, the 16 th column and the 32 nd column in the 32 th column memory cell of the second register are selected by the selection instruction.

Then, based on the corresponding relationship, the product calculation is performed on the bits in the selected storage units and the corresponding elements of the input data, so as to obtain the corresponding elements of the output data, which are located in 32 continuous columns in every two continuous rows. The corresponding elements in the output data located in the consecutive i columns in every two consecutive rows include the elements in the first i columns in the X-th row and the X + 1-th row in the output data.

For example, the bit stored in the 16 th column of memory cells in the first group of memory cells corresponds to the element of the input data located in the X row, the 1 st column to the 32 nd column, the bit stored in the 16 th column of memory cells in the second group of memory cells corresponds to the element of the input data located in the X +1 th row, the 1 st column to the 32 nd column, X is a positive integer and X is an odd number. For example, X =1, the bit stored in the 16 th column of memory cells in the first group of memory cells corresponds to the element in the input data located in the 1 st row, 1 st column to 32 nd column, and the bit stored in the 16 th column of memory cells in the second group of memory cells corresponds to the element in the input data located in the 2 nd row, 1 st column to 32 nd column. And performing product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain corresponding elements positioned in the 1 st column to the 32 nd column in the first row and the second row in the output data.

Then, the first group of memory cells and the second group of memory cells are shifted to the left by 1 bit to obtain a shifted first group of memory cells and a shifted second group of memory cells, and a selection instruction is used to select a bit stored in the 16 th column of memory cells in the shifted first group of memory cells and a bit stored in the 16 th column of memory cells in the shifted second group of memory cells, which corresponds to the selection of the 15 th column and the 31 th column of the 32 th column of memory cells of the second register before shifting. Then, the bits in the selected memory cells and the corresponding elements of the input data are multiplied based on the correspondence. In a similar manner, the shifting, selecting, and product calculations are performed alternately until the selecting and product calculations for all columns in the first and second sets of memory cells are completed.

For example, the bit stored in the 16 th column of memory cells in the shifted first group of memory cells corresponds to the element in the X-th row, 33 rd column to 64 th column in the input data, the bit stored in the 16 th column of memory cells in the shifted second group of memory cells corresponds to the element in the X +1 th row, 33 rd column to 64 th column in the input data, X is a positive integer and X is an odd number. For example, X =1, the bit stored in the 16 th column memory cell in the shifted first group of memory cells corresponds to the element located in the 33 rd column to 64 th column of the 1 st row in the input data, and the bit stored in the 16 th column memory cell in the shifted second group of memory cells corresponds to the element located in the 33 rd column to 64 th column of the 2 nd row in the input data. The first group of memory cells comprises 512 bits, the second register comprises 2 x 512 bits, and the 2 x 512 bits correspond to elements from the 1 st column to the 512 th column in two consecutive rows of the input data. The elements of the 513 th column to the 1024 th column in every two consecutive rows in the input data are corresponded thereto by another register. For example, the 513 th column to the 1024 th column in every two consecutive rows of the input data are in one-to-one correspondence with 2 × 512 bits stored in the third register, the third register stores mask data different from but the same in size as the mask data stored in the second register, and the mask data in the third register is arranged in the same manner as the mask data in the second register.

For example, in the initial state, the bit stored in the 16 th column of memory cells in the first group of memory cells corresponds to the element located in the 1 st row, the 1 st column to the 32 nd column of the input data, and the bit stored in the 16 th column of memory cells in the second group of memory cells corresponds to the element located in the 2 nd row, the 1 st column to the 32 nd column of the input data; shifting the first group of memory cells and the second group of memory cells by 1 bit to the left, wherein at the moment, the bit stored in the 16 th column of memory cells in the first group of memory cells corresponds to the element positioned in the 33 rd column to the 64 th column of the 1 st row in the input data, and the bit stored in the 16 th column of memory cells in the second group of memory cells corresponds to the element positioned in the 33 rd column to the 64 th column of the 2 nd row in the input data; shifting the first group of storage units and the second group of storage units by 1 bit to the left, wherein at the moment, the bit stored in the 16 th column of storage units in the first group of storage units corresponds to the elements positioned in the 1 st row, the 65 th column to the 96 th column in the input data, and the bit stored in the 16 th column of storage units in the second group of storage units corresponds to the elements positioned in the 2 nd row, the 65 th column to the 96 th column in the input data; and so on, until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed, 512 bits in the first group of storage units correspond to the elements of the 1 st row, the 1 st column and the 512 th column in the input data, and 512 bits in the second group of storage units correspond to the elements of the 2 nd row, the 1 st column and the 512 th column in the input data.

Then, the result of the product calculation is divided by (1-drop _ prob) to obtain output data.

And finally, storing the output data into the memory according to the corresponding position of the input data.

The data processing method provided by the embodiment of the disclosure effectively reduces the number of assembly instructions, and greatly improves the operation efficiency. Assuming that about 38 assembly instructions are required for input data of four registers, namely input data of a BF16 type of 8 × 32, according to a general data processing method, while the data processing method provided according to the embodiment of the present disclosure only requires 6 assembly instructions to complete, the optimization effect is significant.

For example, in some embodiments of the present disclosure, a data processing method is used for the calculation of the discardable layer of the neural network.

For a standard neural network, the training process of the neural network is as follows: the input data is first propagated forward through the neural network and the loss results are then propagated backward to decide how to update the parameters of the neural network for the neural network to learn. After using the discard method layer, the training process becomes: firstly, randomly deleting half of hidden neurons in a neural network, and keeping an input neuron and an output neuron unchanged; then, the input data is transmitted forward through the modified neural network, the obtained loss result is transmitted backward through the modified neural network, and after a small batch of training samples finish the process, corresponding parameters are updated on the undeleted neurons according to a random gradient descent method; the process then continues to repeat.

The input part of the discarding method layer comprises input data and covering data, and under the condition that the numerical value of a bit of an element of the covering data is 1, the corresponding element of the input data is used as the output of the discarding method layer; in the case where the value of the bit masking an element of data is 0, the corresponding element of input data is discarded (e.g., a deleted neuron).

The above description has described that the result of performing the product calculation is divided by (1-drop _ prob) to obtain the output data, because the discard method layer needs to be scaled. Some neurons can be randomly discarded during training, but are not randomly discarded during prediction, and if some neurons are discarded, the problem of unstable results is caused, so that the model prediction is inaccurate. One approach is to multiply the weight of each neuron by a probability such that the prediction data is approximately the same as the training data. For example, if the output of a neuron is x, then it has a probability of drop _ prob being discarded at the time of training, and a probability of (1-drop _ prob) is involved in training, so that the output of this neuron is divided by (1-drop _ prob) at the time of prediction.

Fig. 6 illustrates a schematic block diagram of a data processing apparatus 600 provided in at least one embodiment of the present disclosure, which may be used to execute the data processing method illustrated in fig. 2.

As shown in fig. 6, the data processing apparatus 600 includes a data loading unit 601 and a calculation unit 602.

The data loading unit 601 is configured to load input data into a first register and load mask data into a second register, the input data having a size of N × M, the mask data having a size of P × Q, each element of the mask data being Z bits, Q × Z =2*M, each bit of each element of the mask data corresponding to one element of the input data, the storage units of the second register being arranged in i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially corresponding to i consecutive elements located in the same row in the input data, N, M, P, Q, i, j, Z being positive integers.

The calculation unit 602 is configured to perform a product calculation to obtain output data based on correspondence of respective elements of the input data and respective bits of respective elements of the mask data.

The data loading unit 601 may implement step S201 in the data processing method shown in fig. 2, and the calculating unit 602 may implement step S202 in the data processing method shown in fig. 2, and for related description, reference may be made to the above contents, which is not described herein again. The data processing apparatus 600 has the same technical effect as the data processing method shown in fig. 2, and is not described herein again.

For example, the data processing apparatus may be implemented in hardware, software, firmware, or any feasible combination thereof, and the present disclosure is not limited thereto.

For example, the data loading unit 601 and the computing unit 602 may be hardware, software, firmware, or any feasible combination thereof. For example, the data loading unit 601 and the calculating unit 602 may be a dedicated or general-purpose circuit, a chip, a device, or the like, or may be a combination of a processor and a memory. The embodiment of the present disclosure is not limited in this regard to specific implementation forms of the data loading unit 601 and the calculating unit 602.

It should be noted that, in the embodiment of the present disclosure, each unit of the data processing apparatus 600 corresponds to each step of the foregoing data processing method, and for the specific function of the data processing apparatus 600, reference may be made to the related description of the data processing method above, which is not described herein again. The components and configurations of data processing apparatus 600 shown in FIG. 6 are for illustrative purposes only and are not intended to be limiting, as data processing apparatus 600 may include other components and configurations as desired.

At least one embodiment of the present disclosure also provides a data processing apparatus, including: a memory for non-transitory storage of computer-executable instructions; and a processor for executing computer-executable instructions, wherein the computer-executable instructions, when executed by the processor, perform a data processing method provided by at least one embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of a data processing apparatus 700 according to an embodiment of the present disclosure. As shown in fig. 7, a data processing apparatus 700 according to an embodiment of the present disclosure may include a processing apparatus 701 and a memory 702, which may be interconnected by a bus 703.

The processing device 701 may perform various actions and processes according to programs or code stored in the memory 702. In particular, the processing device 701 may be an integrated circuit chip having signal processing capabilities. For example, the processing means may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, flows, and logic blocks disclosed in the embodiments of the disclosure may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, either of the X86 architecture or the ARM architecture or the like.

The memory 702 stores computer-executable instructions, which when executed by the processing device 701 implement a data processing method provided by at least one embodiment of the present disclosure. The memory 702 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct main bus random access memory (DRRAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

At least one embodiment of the present disclosure also provides an electronic device including the data processing device provided by at least one embodiment of the present disclosure. In one embodiment, the electronic device is, for example, a central processor, such as a single-core or multi-core processor. In one embodiment, the electronic device is a computer system, the computer system including one or more processors,

fig. 8 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure may include a data processing device 600.

At least one embodiment of the present disclosure provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement a data processing method provided by at least one embodiment of the present disclosure.

Fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure. As shown in fig. 9, storage medium 900 is used to store computer-executable instructions 910. For example, the computer-executable instructions 910, when executed by a computer, may perform one or more steps in accordance with the data processing methods described above.

Similarly, computer-readable storage media in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the disclosure.

The technical effects of the data processing apparatus, the electronic apparatus and the storage medium are the same as those of the data processing method shown in fig. 2, and are not described herein again.

The following points need to be explained:

(1) The drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to common designs.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A method of data processing, comprising:

loading input data into a first register and loading mask data into a second register, wherein the size of the input data is N M, the size of the mask data is P Q, each element of the mask data is Z bits, Q Z = 2M, N =2P, each bit of each element of the mask data corresponds to one element of the input data, storage units of the second register are arranged in i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially correspond to continuous i elements located in the same row in the input data, and N, M, P, Q, i, j and Z are positive integers;

and performing product calculation based on the corresponding relation between each element of the input data and each bit of each element of the covering data to obtain output data.

2. The data processing method according to claim 1, wherein columns 1 to j/2 of the second register are a first group of memory cells, and columns (j/2+1) to j are a second group of memory cells;

performing a product calculation based on correspondence between each element of the input data and each bit of each element of the mask data to obtain the output data, including:

selecting a bit stored in an s-th column of memory cells in the first group of memory cells and a bit stored in an s-th column of memory cells in the second group of memory cells with a selection instruction, wherein the bit stored in the s-th column of memory cells in the first group of memory cells and the bit stored in the s-th column of memory cells in the second group of memory cells correspond to elements located in consecutive i columns in every two consecutive rows in the input data;

based on the corresponding relation, performing product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain corresponding elements which are positioned in continuous i columns in every two continuous rows in the output data;

shifting the first group of storage units and the second group of storage units by 1 bit to the left or 1 bit to the right to obtain a shifted first group of storage units and a shifted second group of storage units, selecting bits stored in an s-th column of storage units in the shifted first group of storage units and bits stored in an s-th column of storage units in the shifted second group of storage units by using the selection instruction, performing product calculation on the bits in the selected storage units and corresponding elements of input data based on the corresponding relation, and continuing to perform shift operation and product calculation until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.

3. The data processing method according to claim 2, wherein N =512, m =1024, p =256, i = j = q/2= z =32.

4. The data processing method according to claim 3, wherein s =16, the bit stored in the 16 th column of memory cells in the first group of memory cells corresponds to an element of the input data located in the X row, 1 st column to 32 nd column, and the bit stored in the 16 th column of memory cells in the second group of memory cells corresponds to an element of the input data located in the X +1 th row, 1 st column to 32 nd column;

corresponding elements which are positioned in every two continuous rows and are positioned in i continuous columns in the obtained output data comprise elements which are positioned in 1 st column to 32 nd column of the X-th row and the X +1 th row in the output data;

the bit stored in the 16 th column of the shifted first group of memory cells corresponds to the element in the input data located in the 33 th column to the 64 th column of the X-th row, the bit stored in the 16 th column of the shifted second group of memory cells corresponds to the element in the input data located in the 33 th column to the 64 th column of the X +1 th row, X is a positive integer and X is an odd number.

5. The data processing method of claim 4, wherein the first group of memory cells contains 512 bits, the second group of memory cells contains 512 bits, and the second register contains 2 x 512 bits, the 2 x 512 bits corresponding to the elements of columns 1 to 512 in every two consecutive rows of the input data.

6. The data processing method according to claim 5, wherein the 513 th column to 1024 th column in the every two consecutive rows in the input data are in one-to-one correspondence with 2 x 512 bits stored in a third register, the third register storing mask data different from but the same in size as the mask data stored in the second register, and the mask data in the third register is arranged in the same manner as the mask data in the second register.

7. The data processing method of claim 2, wherein multiplying the bits in the selected memory cell with the corresponding elements of the input data comprises:

taking an element of the corresponding input data as a corresponding element in the output data if the value of the bit in the selected memory cell is 1;

in the case where the value of the bit in the selected memory cell is 0, 0 is taken as the corresponding element in the output data.

8. The data processing method according to claim 7, wherein performing a product calculation based on correspondence of respective elements of the input data and respective bits of respective elements of the mask data to obtain the output data, further comprises:

dividing the result of the multiplication by (1-drop _ prob) to obtain the output data, wherein drop _ prob represents the probability that each bit of each element of the mask data is 0.

9. The data processing method of any of claims 2-8, further comprising:

and storing the output data into a memory according to the corresponding position of the input data.

10. The data processing method of claim 9, wherein a format type of the selection instruction and a format type of the input data are the same.

11. The data processing method according to claim 10, wherein the format type of the selection instruction is BF16, and the format type of the input data is BF16.

12. The data processing method of claim 11, wherein the data processing method is for computation of a discardability layer of a neural network, an input portion of the discardability layer comprising the input data and the masking data,

in the case that the value of the bit of the element of the mask data is 1, the corresponding element of the input data is used as the output of the discarding method layer;

in case the value of the bit of the element of the masked data is 0, the corresponding element of the input data is discarded.

13. The data processing method of claim 1, further comprising:

prior to loading the input data into the first register and the mask data into the second register,

performing an alignment operation on the input data and the mask data such that each 2*N element in the input data corresponds to each 1*Z element in the mask data.

14. A data processing apparatus comprising:

a data loading unit configured to load input data into a first register and load mask data into a second register, wherein the input data has a size of N × M, the mask data has a size of P × Q, each element of the mask data is Z bits, Q × Z =2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i = j = Q/2=Z, i bits stored in the same column of the second register sequentially correspond to i consecutive elements of the input data in the same row, and N, M, P, Q, i, j, Z are positive integers;

a calculation unit configured to perform a product calculation to obtain output data based on a correspondence relationship between each element of the input data and each bit of each element of the mask data.

15. A data processing apparatus comprising:

a processor; and

a memory storing computer-executable instructions that,

wherein the computer-executable instructions, when executed by the processor, implement a data processing method according to any one of claims 1-13.

16. An electronic device comprising the data processing device of claim 15.

17. A computer-readable storage medium for non-transitory storage of computer-executable instructions,

wherein the computer executable instructions, when executed by a processor, implement a data processing method according to any one of claims 1-13.