CN115186815B

CN115186815B - Data processing method and device, electronic device and medium

Info

Publication number: CN115186815B
Application number: CN202210916696.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2025-07-11
Anticipated expiration: 2042-08-01
Also published as: CN115186815A

Abstract

A data processing method, a data processing device, an electronic device and a computer-readable storage medium. The data processing method comprises: loading input data into a first register and loading mask data into a second register, the size of the input data is N*M, the size of the mask data is P*Q, each element of the mask data is Z bits, Q*Z=2*M, each bit of each element of the mask data corresponds to an element of the input data, the storage units of the second register are arranged as i rows and j columns, i=j=Q/2=Z, the i bits stored in the same column of the second register correspond to i consecutive elements in the same row of the input data in sequence, N, M, P, Q, i, j, Z are all positive integers; based on the corresponding relationship between each element of the input data and each bit of each element of the mask data, a product calculation is performed to obtain output data. The method can effectively reduce the number of assembly instructions, greatly improve the operation efficiency, and reduce the running time.

Description

Data processing method and device, electronic device and medium

Technical Field

Embodiments of the present disclosure relate to a data processing method, a data processing apparatus, an electronic apparatus, and a computer-readable storage medium.

Background

In a machine-learned model, if the parameters of the model are too many and the training samples are too few, the trained model is easy to generate the phenomenon of over-fitting. The overfitting specific body shows that the model has smaller loss function on training data and higher prediction accuracy, but has larger loss function on test data and lower prediction accuracy. The dropping method (dropout) can effectively relieve the occurrence of overfitting, and achieves the regularization effect to a certain extent. The discarding method is to stop the operation of the activation value of a certain neuron with a certain probability when propagating forward, so that the model generalization is stronger.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data processing method, including loading input data into a first register and loading mask data into a second register, where the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to an element of the input data, storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in a same column of the second register sequentially correspond to consecutive i elements of the input data in a same row, N, M, P, Q, i, j, Z are both positive integers, and product calculation is performed based on correspondence between each element of the input data and each bit of each element of the mask data to obtain output data.

For example, in the data processing method provided in an embodiment of the present disclosure, the 1 st column to the j/2 th column of the second register are a first group of storage units, the (j/2+1) th column to the j th column of the second register are a second group of storage units, product calculation is performed based on the correspondence between each element of the input data and each bit of each element of the mask data to obtain output data, including selecting bits stored in the s-th column storage unit in the first group of storage units and bits stored in the s-th column storage unit in the second group of storage units by using a selection instruction, the bits stored in the s-th column storage unit in the first group of storage units and bits stored in the s-th column storage unit in the second group of storage units correspond to elements of consecutive i columns in each two consecutive rows in the input data, product calculation is performed based on the corresponding relationship between the bits in the selected storage units and the elements of the corresponding input data to obtain elements of consecutive i columns in each consecutive two rows, the product calculation is performed by shifting the bits stored in the first group of storage units and the second group of storage units by using the selection instruction and the second group of storage units after shifting the bits stored in the first group of storage units and the second storage units and shifting the selected bit storage units and the second group of storage units and shifting the bit storage units after shifting the selected bit storage units in the first group of storage units and the second storage units and the storage units are shifted, the shifting operation and product calculation are continued until the selection of all columns in the first set of memory cells and the second set of memory cells and the product calculation are completed.

For example, in the data processing method provided in an embodiment of the present disclosure, n=512, m=1024, p=256, i=j=q/2=z=32.

For example, in the data processing method provided in an embodiment of the present disclosure, s=16, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements located in the 1 st column to the 32 nd column in the X-th row in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements located in the 1 st column to the 32 nd column in the x+1 row in the input data, the corresponding elements located in the i-th column in every two consecutive rows in the output data include the elements located in the 1 st column to the 32 nd column in the X-th row and the x+1 row in the output data, the bits stored in the 16 th column storage unit in the shifted first group of storage units correspond to the elements located in the 33 th column to the 64 th column in the X-th row in the input data, the bits stored in the 16 th column storage unit in the shifted second group of storage units correspond to the elements located in the 33 st+1 st column to the 64 th column in the input data, and the X is a positive integer.

For example, in the data processing method provided in an embodiment of the present disclosure, the first set of memory cells includes 512 bits, the second register includes 2×512 bits, and 2×512 bits correspond to elements from 1 st column to 512 th column in every two consecutive rows of the input data.

For example, in the data processing method provided in an embodiment of the present disclosure, elements from 513 th to 1024 th columns in every two consecutive rows in input data are in one-to-one correspondence with 2×512 bits stored in a third register, the third register stores mask data different from but the same size as mask data stored in a second register, and the arrangement manner of the mask data in the third register is the same as that of the mask data in the second register.

For example, in the data processing method provided in an embodiment of the present disclosure, performing product calculation on the bit in the selected memory cell and the corresponding element of the input data includes taking the corresponding element of the input data as the corresponding element in the output data when the value of the bit in the selected memory cell is 1, and taking 0 as the corresponding element in the output data when the value of the bit in the selected memory cell is 0.

For example, in the data processing method provided in an embodiment of the present disclosure, performing product calculation to obtain output data based on correspondence between respective elements of input data and respective bits of respective elements of mask data, further includes dividing a result of performing the product calculation by (1-drop_prob) to obtain output data, where drop_prob represents a probability that each bit of each element of mask data is 0.

For example, the data processing method provided in an embodiment of the present disclosure further includes storing output data in a memory according to a corresponding location of input data.

For example, in the data processing method provided in an embodiment of the present disclosure, the format type of the selection instruction is the same as the format type of the input data.

For example, in the data processing method provided in an embodiment of the present disclosure, the format type of the selection instruction is BF16, and the format type of the input data is BF16.

For example, in the data processing method provided in an embodiment of the present disclosure, the data processing method is used for calculation of a discarding method layer of a neural network, an input portion of the discarding method layer includes input data and mask data, an element of the corresponding input data is output as the discarding method layer in a case where a value of a bit of an element of the mask data is 1, and an element of the corresponding input data is discarded in a case where a value of a bit of an element of the mask data is 0.

For example, the data processing method provided in an embodiment of the present disclosure further includes, before loading the input data into the first register and loading the mask data into the second register, performing an alignment operation on the input data and the mask data such that each 2*N element in the input data corresponds to each 1*Z element in the mask data.

The data processing device further comprises a data loading unit configured to load input data into a first register and load mask data into a second register, wherein the size of the input data is N x M, the size of the mask data is P x Q, each element of the mask data is Z bits, Q x Z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=Q/2=Z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, N, M, P, Q, i, j, Z are all positive integers, and a calculating unit configured to perform product calculation based on the corresponding relation between each element of the input data and each bit of each element of the mask data to obtain output data.

At least one embodiment of the present disclosure also provides a data processing apparatus including a processor, and a memory storing computer-executable instructions that, when executed by the processor, implement the data processing method provided by at least one embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides an electronic device, including a data processing device provided by at least one embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement the data processing method provided by at least one embodiment of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 shows a schematic diagram of a register;

FIG. 2 illustrates a schematic flow diagram of a data processing method provided by at least one embodiment of the present disclosure;

FIG. 3 shows a schematic flow chart of one example of step S202 in FIG. 2;

Fig. 4 shows a schematic diagram of one example of step S301 in fig. 3;

fig. 5 shows a schematic diagram of one example of step S303 in fig. 3;

FIG. 6 illustrates a schematic block diagram of a data processing apparatus provided by at least one embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of an electronic device according to an embodiment of the disclosure;

fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The input part of the drop layer (drop layer) includes input data and mask data corresponding thereto. Each bit of each element of mask data corresponds to an element of input data. If the value of the bit of the element of the mask data is 1, the corresponding element of the input data needs to be preserved, and if the value of the bit of the element of the mask data is 0, the corresponding element of the input data will be changed to 0. Let the size of the input data be 512,1024, the data type be BF16, the size of the mask data be 512,32, and the data type be FP32. The 32 bits of each FP32 data in the mask data correspond to the 32 elements of the same row of input data. Every row of 32 elements in the masked data, i.e. 32 x 32 = 1024 bits, which 1024 bits correspond to a complete row of 1024 elements of the input data. Masked data is read from the memory into registers, the schematic structure of which is shown in fig. 1. Each register comprises 32 channels (Lane 0-Lane 31), each channel comprises a number of 32 bits (Bit 0-Bit 31), and each Bit corresponds to an element of input data. So that one register can just hold a complete row of elements in the mask data, i.e. 32 elements of FP32.

Typically, a 32-bit number for each channel in the register will correspond to 32 elements of the input data in the lateral direction. Since the input data is in BF16 format, two rows of BF16 data, 2 x 32, are included after the register is read. In order to correspond to the bits of the elements of the mask data one to one, the register storing the input data needs to be split into two groups of independent FP32 data of 1 x 32, then only 32 bits of one channel in the register storing the mask data can be read first and stored in the scalar register at a time, and then selective product operation is performed on the split input data by using the mask data. If the bit of the element of the mask data is 1, the corresponding element of the input data is reserved, if the bit of the element of the mask data is 0, the corresponding element of the input data is changed to 0, and then two sets of calculation results of the FP32 type of 1 x 32 are combined into one set of output data of the BF16 type of 2 x 32. This calculation process is very complex in instruction and very inefficient.

The data processing method provided by the embodiment of the disclosure changes the corresponding mode of covering the data and inputting the data, effectively reduces the number of instructions, greatly improves the operation efficiency and reduces the operation time.

At least one embodiment of the present disclosure also provides a data processing apparatus, an electronic apparatus, and a computer-readable storage medium corresponding to the above-described data processing method.

Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.

Fig. 2 shows a schematic flow chart of a data processing method provided by at least one embodiment of the present disclosure.

As shown in FIG. 2, the data processing method includes the following steps S201 to S202.

Step S201, loading the input data into the first register and loading the mask data into the second register.

For example, the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the memory cells of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, and N, M, P, Q, i, j, Z are positive integers.

For example, n=512, m=1024, p=256, i=j=q/2=z=32, that is, the size of the input data is 512×1024, the size of the mask data is 256×64, each element of the mask data is 32 bits, the number of columns of the mask data (64) times the number of columns of the input data (2048) by which the number of bits (32) of each element of the mask data is equal to 2, the number of columns (32) of the memory cells of the second register is equal to the number of columns (32) of the memory cells of the second register, and is equal to the number of columns of the mask data divided by 2 (32), and is equal to the number of bits (32) of each element of the mask data. The above specific values are merely examples, and the present disclosure does not limit the specific values of N, M, P, Q, i, j, Z, only the above relationship needs to be satisfied.

Step S202, product calculation is carried out based on the corresponding relation between each element of the input data and each bit of each element of the mask data so as to obtain output data.

For example, prior to step S201, the data processing method provided by the embodiment of the present disclosure may further include performing an alignment operation on the input data and the mask data, such that each 2*N elements in the input data corresponds to each 1*Z elements in the mask data.

For example, n=512, z=32, and the number of bits of each element of the mask data is 32, then every 2×512 elements (2 rows and 512 columns, i.e., 1024 elements) in the input data corresponds to every 1×32 elements (1 row and 32 columns, i.e., 32 elements, and the number of bits of 32 elements is 32×32=1024) in the mask data.

The method and the device have the advantages that the corresponding relation between each element of the input data and each bit of each element of the mask data is adjusted, namely, j bits of each channel in the second register are not used for corresponding to continuous j elements of the input data in the transverse direction, i bits stored in each column of storage units in the second register are used for corresponding to continuous i elements of the input data in the transverse direction, so that the subsequent calculation of each bit of the mask data and each element of the input data is facilitated, for example, a plurality of bits of the mask data are conveniently read in batch in combination with a selection instruction, format adjustment and conversion of the read bits are avoided, the instruction number is effectively reduced, and the operation efficiency is improved.

For example, in some embodiments of the present disclosure, the second register is divided into two sets of memory cells, with columns 1 through j/2 of the second register being the first set of memory cells and columns (j/2+1) through j of the second register being the second set of memory cells. Therefore, the selection and calculation of each bit can be realized by combining the selection instruction, so that the efficiency is improved, and the realization is convenient.

Fig. 3 shows a schematic flow chart of one example of step S202 in fig. 2.

As shown in fig. 3, one example of step S202 may include the following steps S301 to S303 for the second register to be divided into two sets of memory cells.

Step S301, selecting bits stored in the S-th column storage unit in the first group of storage units and bits stored in the S-th column storage unit in the second group of storage units by using a selection instruction, wherein the bits stored in the S-th column storage unit in the first group of storage units and the bits stored in the S-th column storage unit in the second group of storage units correspond to elements of continuous i columns in every two continuous rows in input data.

For example, in some embodiments of the present disclosure, the format type of the select instruction is the same as the format type of the input data.

For example, in some embodiments of the present disclosure, the format type of the select instruction is BF16, and the format type of the input data is BF16.

Because the format type of the selection instruction is the same as the format type of the input data, the input data can be directly selected, and the method is more concise and efficient.

Fig. 4 shows a schematic diagram of one example of step S301 in fig. 3.

As shown in fig. 4, the memory cells of the second register are arranged in 32 rows and 32 columns (the second register has 32 channels (Lane 0 to Lane 31), each channel has 32 bits (bit 0 to bit 32)), the 1 st to 16 th columns of the second register are the first group of memory cells, and the 17 th to 32 th columns of the second register are the second group of memory cells. The highest bit of the first group of memory cells and the second group of memory cells, i.e. the bit stored in the 16 th column of memory cells in the first group of memory cells and the bit stored in the 16 th column of memory cells in the second group of memory cells, is selected by means of a selection instruction. For example, the bits stored in the 16 th column storage unit in the first group storage unit and the bits stored in the 16 th column storage unit in the second group storage unit correspond to elements of 1 st to 32 nd columns located in the first row and the second row in the input data.

The present disclosure is not limited to selecting which column of memory cells of the first group of memory cells and the second group of memory cells is stored, as long as the selection is made in units of columns.

Step S302, based on the corresponding relation, the product calculation is carried out on the bits in the selected storage unit and the corresponding elements of the input data, so as to obtain the corresponding elements of the output data, which are positioned in every two continuous rows and in every two continuous columns.

For example, in some embodiments of the present disclosure, step S302 may include taking the element of the corresponding input data as the corresponding element in the output data if the value of the bit in the selected memory cell is 1, and taking 0 as the corresponding element in the output data if the value of the bit in the selected memory cell is 0.

Step S303, shifting the first group of storage units and the second group of storage units left by 1 bit or shifting the second group of storage units right by 1 bit to obtain a shifted first group of storage units and a shifted second group of storage units, selecting bits stored in a S-th column storage unit in the shifted first group of storage units and bits stored in a S-th column storage unit in the shifted second group of storage units by using a selection instruction, calculating products of the bits in the selected storage units and elements of corresponding input data based on a corresponding relation, and continuing shifting operation and product calculation until selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.

Fig. 5 shows a schematic diagram of one example of step S303 in fig. 3.

As shown in fig. 5, the size of the second register is the same as that of the second register shown in fig. 4. The selection instruction only selects the bit stored in the 16 th column (bit 15) storage unit in the first group of storage units and the bit stored in the 16 th column (bit 31) storage unit in the second group of storage units at a time, step S302 is executed after the selection operation is completed, the first group of storage units and the second group of storage units are shifted left by 1 bit after the step S302 is executed, so that the bit30 is shifted to the bit31 position, the bit14 is shifted to the bit15 position, the rest bits are shifted left by one bit in sequence, and the product calculation of step 302 is executed until the selection and the product calculation of all columns in the first group of storage units and the second group of storage units are completed.

It should be noted that, the first set of storage units and the second set of storage units may be shifted to the right by 1 bit, which may be adjusted according to the correspondence between the elements of the input data and the bits of the elements of the mask data, which is not limited in this disclosure. In other embodiments, if the selection instruction is not adopted, but the bits in the column direction are selected in other manners, the shifting operation may be omitted, as long as the bits in each column can be sequentially selected, which is not limited by the embodiments of the present disclosure.

Returning to FIG. 3, for example, in some embodiments of the present disclosure, step S202 may further include step S304 of dividing the result of the product calculation by (1-drop_prob) to obtain output data, drop_prob representing a probability of 0 for each bit of each element of the mask data.

For example, if each bit of each element of the mask data has a probability of 0, where the probability is represented by drop_prob, then the probability of 1 for each bit of each element of the mask data is (1-drop_prob). The calculation result is typically scaled, i.e. multiplied by 1/(1-drop_prob) or divided by (1-drop_prob).

For example, the data processing method provided by the embodiment of the disclosure may further include storing the output data in the memory according to the corresponding position of the input data.

Since the size and data type of the output data are completely consistent with those of the input data, after the output data are obtained, the output data are stored in the memory according to the same coordinates as the coordinates of the input data currently processed.

The data processing method provided by the present disclosure is described below by way of a specific embodiment.

For example, in one embodiment of the present disclosure, n=512, m=1024, p=256, i=j=q/2=z=32. I.e. the size of the input data is 512 x 1024, the size of the mask data is 256 x 64, each element of the mask data is 32 bits, and the memory cells of the second register are arranged in 32 rows and 32 columns. The format type of the input data is BF16, the format type of the mask data is FP32, and the format type of the selection instruction is BF16.

For example, columns 1 to 16 of the second register are the first group of memory cells, and columns 17 to 32 of the second register are the second group of memory cells

First, input data is loaded into a first register and mask data is loaded into a second register.

Then, the bits stored in the 16 th column storage unit in the first group storage unit and the bits stored in the 16 th column storage unit in the second group storage unit are selected by the selection instruction, that is, the 16 th column and the 32 th column in the 32 th column storage unit of the second register are selected by the selection instruction.

Then, based on the correspondence, the product calculation is performed on the bits in the selected storage unit and the corresponding elements of the input data, so as to obtain the corresponding elements of the output data, which are located in 32 columns in every two consecutive rows. The corresponding elements of the output data in the consecutive i columns of every two consecutive rows include the elements of the output data in the first i columns of the X-th row and the x+1-th row.

For example, the bits stored in the 16 th column of memory cells in the first group of memory cells correspond to elements located in the 1 st column to the 32 nd column of the X-th row in the input data, the bits stored in the 16 th column of memory cells in the second group of memory cells correspond to elements located in the 1 st column to the 32 nd column of the x+1-th row in the input data, X is a positive integer and X is an odd number. For example, x=1, the bit stored in the 16 th column storage unit in the first group of storage units corresponds to the element located in the 1 st row, the 1 st column, and the 32 nd column in the input data, and the bit stored in the 16 th column storage unit in the second group of storage units corresponds to the element located in the 2 nd row, the 1 st column, and the 32 nd column in the input data. And performing product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain the corresponding elements from the 1 st column to the 32 nd column in the first row and the second row in the output data.

Then, the first group of memory cells and the second group of memory cells are shifted left by 1 bit, resulting in shifted first group of memory cells and shifted second group of memory cells, and the bits stored in the 16 th column of memory cells in the shifted first group of memory cells and the bits stored in the 16 th column of memory cells in the shifted second group of memory cells are selected by the selection instruction, which corresponds to the 15 th column and the 31 th column of the 32 th column of memory cells of the second register before shifting. Then, the product calculation is performed on the bits in the selected memory cells and the elements of the corresponding input data based on the correspondence. In a similar manner, shifting, selecting, and product calculating are alternately performed until selecting and product calculating for all columns in the first set of memory cells and the second set of memory cells is completed.

For example, the bit stored in the 16 th column of memory cells in the shifted first group of memory cells corresponds to the element located in the 33 th column to the 64 th column of the X-th row in the input data, the bit stored in the 16 th column of memory cells in the shifted second group of memory cells corresponds to the element located in the 33 rd column to the 64 th column of the x+1-th row in the input data, X is a positive integer and X is an odd number. For example, x=1, the bits stored in the 16 th column storage unit in the shifted first group of storage units correspond to elements in the 1 st row, 33 th column, and 64 th column in the input data, and the bits stored in the 16 th column storage unit in the shifted second group of storage units correspond to elements in the 2 nd row, 33 th column, and 64 th column in the input data. The first group of memory cells comprises 512 bits, the second register comprises 2 x 512 bits, and the 2 x 512 bits correspond to elements from 1 st column to 512 th column in two consecutive rows of input data. Elements of 513 th to 1024 th columns in every two consecutive rows in the input data are corresponding thereto by another register. For example, elements from 513 th to 1024 th columns in every two consecutive rows in the input data are one-to-one corresponding to 2×512 bits stored in a third register, the third register stores mask data different from but the same size as mask data stored in a second register, and the arrangement of the mask data in the third register is the same as that of the mask data in the second register.

For example, in the initial state, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the 1 st column and the 32 nd column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 2 nd row, the 1 st column and the 32 nd column in the input data, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the 33 th column and the 64 th column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 2 nd row, the 33 th column and the 64 th column in the input data, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 96 th column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 512 th column in the first group of storage units, and the product of the bits stored in the first group of the 16 th column storage units are calculated until the first column of the 512 th column in the first group of storage units corresponds to the 512 th column in the input data, and the first column of the first group of the elements from the 1 th column and the 512 th column in the first group of storage units are completely calculated.

Then, the result of the product calculation is divided by (1-drop_prob) to obtain output data.

And finally, storing the output data into a memory according to the corresponding position of the input data.

The data processing method provided by the embodiment of the disclosure effectively reduces the number of assembly instructions and greatly improves the operation efficiency. It is assumed that about 38 assembler instructions are required for input data of four registers, i.e., input data of BF16 type of 8×32, according to a general data processing method, and the data processing method provided according to the embodiment of the present disclosure can be completed only by 6 assembler instructions, with a remarkable optimization effect.

For example, in some embodiments of the present disclosure, a data processing method is used for the computation of the drop method layer of a neural network.

For a standard neural network, the training process of the neural network is to first forward propagate the input data through the neural network and then back propagate the loss results to determine how to update the parameters of the neural network to make the neural network learn. After the discarding method layer is used, the training flow is changed into that firstly, half of hidden neurons in the neural network are deleted randomly, the input neurons and the output neurons are kept unchanged, then the input data are transmitted forwards through the modified neural network, the obtained loss result is transmitted backwards through the modified neural network, after a small batch of training samples are executed, the corresponding parameters are updated on the neurons which are not deleted according to the random gradient descent method, and then the process is continuously repeated.

The input part of the discarding method layer includes input data and mask data, and the corresponding input data element is taken as an output of the discarding method layer in case that the bit value of the mask data element is 1, and the corresponding input data element is discarded (e.g., deleted neuron) in case that the bit value of the mask data element is 0.

It was said above that the result of the product calculation is divided by (1-drop_prob) to obtain the output data, since the dropping method layer needs to be scaled. Some neurons are randomly discarded during training, but some neurons cannot be randomly discarded during prediction, and if some neurons are discarded, the problem of unstable results is caused, so that model prediction is inaccurate. One solution is to multiply the weight of each neuron by a probability such that the predicted data and the training data are approximately the same. For example, the output of a neuron is x, then the probability that it has drop_prob is discarded at the time of training, and the probability of (1-drop_prob) participates in training, so the output of this neuron is divided by (1-drop_prob) at the time of prediction.

Fig. 6 illustrates a schematic block diagram of a data processing apparatus 600 that may be used to perform the data processing method illustrated in fig. 2, provided in accordance with at least one embodiment of the present disclosure.

As shown in fig. 6, the data processing apparatus 600 includes a data loading unit 601 and a computing unit 602.

The data loading unit 601 is configured to load input data into a first register and mask data into a second register, where the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the memory cells of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row of the input data, and N, M, P, Q, i, j, Z are positive integers.

The calculation unit 602 is configured to perform product calculation based on the correspondence relationship of each element of the input data and each bit of each element of the mask data to obtain output data.

The data loading unit 601 may implement step S201 in the data processing method shown in fig. 2, the calculating unit 602 may implement step S202 in the data processing method shown in fig. 2, and the relevant description may refer to the above and will not be described herein. The technical effects of the data processing apparatus 600 are the same as those of the data processing method shown in fig. 2, and will not be described herein.

For example, the data processing apparatus may be implemented in hardware, software, firmware, and any feasible combination thereof, which is not limited in this disclosure.

For example, the data loading unit 601 and the computing unit 602 may be hardware, software, firmware, and any feasible combination thereof. For example, the data loading unit 601 and the computing unit 602 may be dedicated or general purpose circuits, chips, devices, or the like, or may be a combination of a processor and a memory. With respect to specific implementations of the data loading unit 601 and the computing unit 602, embodiments of the present disclosure are not limited in this regard.

It should be noted that, in the embodiment of the present disclosure, each unit of the data processing apparatus 600 corresponds to each step of the foregoing data processing method, and the specific function of the data processing apparatus 600 may refer to the related description of the data processing method, which is not repeated herein. The components and structures of data processing apparatus 600 shown in fig. 6 are exemplary only and not limiting, and data processing apparatus 600 may include other components and structures as desired.

At least one embodiment of the present disclosure also provides a data processing apparatus including a memory for non-transitory storage of computer-executable instructions and a processor for executing the computer-executable instructions, wherein the computer-executable instructions, when executed by the processor, perform the data processing method provided by at least one embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of a data processing apparatus 700 according to an embodiment of the disclosure. As shown in fig. 7, a data processing apparatus 700 according to an embodiment of the present disclosure may include a processing apparatus 701 and a memory 702, which may be interconnected by a bus 703.

The processing device 701 may perform various actions and processes in accordance with programs or code stored in the memory 702. Specifically, the processing device 701 may be an integrated circuit chip with signal processing capabilities. For example, the processing means may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Various methods, steps, procedures, and logic blocks disclosed in embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be an X86 architecture or an ARM architecture or the like.

The memory 702 stores computer executable instructions that, when executed by the processing device 701, implement a data processing method provided by at least one embodiment of the present disclosure. The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct main bus random access memory (DRRAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

At least one embodiment of the present disclosure also provides an electronic device, including a data processing device provided by at least one embodiment of the present disclosure. In one embodiment, the electronic device is, for example, a central processor, such as a single-core or multi-core processor. In one embodiment, the electronic device is a computer system, the computer system including one or more processors,

Fig. 8 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure may include a data processing device 600.

At least one embodiment of the present disclosure provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement a data processing method provided by at least one embodiment of the present disclosure.

Fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure. As shown in fig. 9, a storage medium 900 is used to store computer executable instructions 910. For example, computer-executable instructions 910, when executed by a computer, may perform one or more steps in accordance with the data processing methods described above.

Similarly, the computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs a data processing method according to an embodiment of the present disclosure.

The technical effects of the data processing apparatus, the electronic apparatus, and the storage medium are the same as those of the data processing method shown in fig. 2, and will not be described herein.

The following points need to be described:

(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims

1.A data processing method, comprising:

Loading input data into a first register and loading mask data into a second register, wherein the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z=2×m, n=2p, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row of the input data, and N, M, P, Q, i, j, Z are positive integers;

And carrying out product calculation based on the corresponding relation between each element of the input data and each bit of each element of the mask data to obtain output data.

2. The data processing method according to claim 1, wherein columns 1 to j/2 of the second register are a first group of memory cells, and columns (j/2+1) to j of the second register are a second group of memory cells;

Performing product calculation to obtain the output data based on the correspondence between each element of the input data and each bit of each element of the mask data, including:

Selecting bits stored in an s-th column storage unit in the first group of storage units and bits stored in an s-th column storage unit in the second group of storage units by using a selection instruction, wherein the bits stored in the s-th column storage unit in the first group of storage units and the bits stored in the s-th column storage unit in the second group of storage units correspond to elements of the input data located in consecutive i columns in every two consecutive rows;

based on the corresponding relation, carrying out product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain corresponding elements of continuous i columns in every two continuous rows in the output data;

Shifting the first group of storage units and the second group of storage units left by 1 bit or shifting the second group of storage units right by 1 bit to obtain a shifted first group of storage units and a shifted second group of storage units, selecting bits stored in a s-th column of storage units in the shifted first group of storage units and bits stored in a s-th column of storage units in the shifted second group of storage units by using the selection instruction, calculating products of the bits in the selected storage units and elements of corresponding input data based on the corresponding relation, and continuing to perform shifting operation and product calculation until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.

3. The data processing method according to claim 2, wherein n=512, m=1024, p=256, i=j=q/2=z=32.

4. A data processing method according to claim 3, wherein s = 16, bits stored in a 16 th column storage unit in the first group of storage units correspond to elements in an X-th row 1 st to a 32 nd column in the input data, and bits stored in a 16 th column storage unit in the second group of storage units correspond to elements in an x+1st row 1 st to a 32 nd column in the input data;

The elements of the output data corresponding to the continuous i columns in every two continuous rows comprise elements of the output data from the 1 st column to the 32 nd column in the X row and the X+1 row;

The bits stored in the 16 th column of memory cells in the shifted first group of memory cells correspond to elements located in the 33 rd to 64 th columns of the X-th row in the input data, the bits stored in the 16 th column of memory cells in the shifted second group of memory cells correspond to elements located in the 33 rd to 64 th columns of the x+1-th row in the input data, X is a positive integer and X is an odd number.

5. The data processing method according to claim 4, wherein the first group of memory cells includes 512 bits, the second register includes 2 x 512 bits, and the 2 x 512 bits correspond to elements from 1 st column to 512 th column in every two consecutive rows of the input data.

6. The data processing method according to claim 5, wherein the elements from 513 th to 1024 th columns in the every two consecutive rows in the input data are one-to-one corresponding to 2 x 512 bits stored in a third register, the third register storing mask data different from but the same size as mask data stored in the second register, the mask data in the third register being arranged in the same manner as mask data in the second register.

7. The data processing method of claim 2, wherein multiplying the bits in the selected memory cells with the corresponding elements of the input data comprises:

taking the element of the corresponding input data as the corresponding element in the output data under the condition that the value of the bit in the selected storage unit is 1;

in the case where the value of the bit in the selected memory cell is 0, 0 is taken as the corresponding element in the output data.

8. The data processing method according to claim 7, wherein performing product calculation to obtain the output data based on correspondence of respective elements of the input data and respective bits of respective elements of the mask data, further comprises:

dividing the result of the product calculation by (1-drop_prob) to obtain the output data, wherein drop_prob represents a probability that each bit of each element of the mask data is 0.

9. The data processing method according to any one of claims 2 to 8, further comprising:

and storing the output data into a memory according to the corresponding position of the input data.

10. The data processing method according to claim 9, wherein a format type of the selection instruction and a format type of the input data are the same.

11. The data processing method according to claim 10, wherein the format type of the selection instruction is BF16, and the format type of the input data is BF16.

12. The data processing method according to claim 11, wherein the data processing method is used for calculation of a discarding method layer of a neural network, an input section of the discarding method layer including the input data and the mask data,

In the case that the bit value of the element of the mask data is 1, the corresponding element of the input data is used as the output of the discarding method layer;

in case that the value of the bit of the element of the mask data is 0, the corresponding element of the input data is discarded.

13. The data processing method of claim 1, further comprising:

before loading the input data into the first register and the mask data into the second register,

And performing alignment operation on the input data and the mask data, so that each 2*N element in the input data corresponds to each 1*Z element in the mask data.

14. A data processing apparatus comprising:

A data loading unit configured to load input data into a first register and mask data into a second register, wherein the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, and N, M, P, Q, i, j, Z are positive integers;

And a calculating unit configured to perform product calculation based on the correspondence between each element of the input data and each bit of each element of the mask data to obtain output data.

15. A data processing apparatus comprising:

Processor, and

A memory storing computer-executable instructions,

Wherein the computer executable instructions, when executed by the processor, implement the data processing method according to any of claims 1-13.

16. An electronic device comprising the data processing device of claim 15.

17. A computer-readable storage medium for non-transitory storage of computer-executable instructions,

Wherein the computer executable instructions, when executed by a processor, implement the data processing method according to any of claims 1-13.