[go: up one dir, main page]

CN115186815B - Data processing method and device, electronic device and medium - Google Patents

Data processing method and device, electronic device and medium Download PDF

Info

Publication number
CN115186815B
CN115186815B CN202210916696.7A CN202210916696A CN115186815B CN 115186815 B CN115186815 B CN 115186815B CN 202210916696 A CN202210916696 A CN 202210916696A CN 115186815 B CN115186815 B CN 115186815B
Authority
CN
China
Prior art keywords
data
input data
column
register
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210916696.7A
Other languages
Chinese (zh)
Other versions
CN115186815A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Shanghai Biren Intelligent Technology Co Ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Shanghai Biren Intelligent Technology Co Ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202210916696.7A priority Critical patent/CN115186815B/en
Publication of CN115186815A publication Critical patent/CN115186815A/en
Application granted granted Critical
Publication of CN115186815B publication Critical patent/CN115186815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Executing Machine-Instructions (AREA)
  • Complex Calculations (AREA)

Abstract

一种数据处理方法、数据处理装置、电子装置和计算机可读存储介质。该数据处理方法包括:将输入数据加载到第一寄存器中并将掩盖数据加载到第二寄存器中,输入数据的尺寸为N*M,掩盖数据的尺寸为P*Q,掩盖数据的每个元素为Z个比特,Q*Z=2*M,掩盖数据的每个元素的每个比特对应于输入数据的一个元素,第二寄存器的存储单元排列为i行j列,i=j=Q/2=Z,第二寄存器的同一列中存储的i个比特依序对应于输入数据中位于同一行的连续的i个元素,N、M、P、Q、i、j、Z均为正整数;基于输入数据的各个元素与掩盖数据的各个元素的各个比特的对应关系,进行乘积计算以得到输出数据。该方法可以有效减少汇编指令的数量,极大地提升运行效率,减少运行时间。

A data processing method, a data processing device, an electronic device and a computer-readable storage medium. The data processing method comprises: loading input data into a first register and loading mask data into a second register, the size of the input data is N*M, the size of the mask data is P*Q, each element of the mask data is Z bits, Q*Z=2*M, each bit of each element of the mask data corresponds to an element of the input data, the storage units of the second register are arranged as i rows and j columns, i=j=Q/2=Z, the i bits stored in the same column of the second register correspond to i consecutive elements in the same row of the input data in sequence, N, M, P, Q, i, j, Z are all positive integers; based on the corresponding relationship between each element of the input data and each bit of each element of the mask data, a product calculation is performed to obtain output data. The method can effectively reduce the number of assembly instructions, greatly improve the operation efficiency, and reduce the running time.

Description

Data processing method and device, electronic device and medium
Technical Field
Embodiments of the present disclosure relate to a data processing method, a data processing apparatus, an electronic apparatus, and a computer-readable storage medium.
Background
In a machine-learned model, if the parameters of the model are too many and the training samples are too few, the trained model is easy to generate the phenomenon of over-fitting. The overfitting specific body shows that the model has smaller loss function on training data and higher prediction accuracy, but has larger loss function on test data and lower prediction accuracy. The dropping method (dropout) can effectively relieve the occurrence of overfitting, and achieves the regularization effect to a certain extent. The discarding method is to stop the operation of the activation value of a certain neuron with a certain probability when propagating forward, so that the model generalization is stronger.
Disclosure of Invention
At least one embodiment of the present disclosure provides a data processing method, including loading input data into a first register and loading mask data into a second register, where the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to an element of the input data, storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in a same column of the second register sequentially correspond to consecutive i elements of the input data in a same row, N, M, P, Q, i, j, Z are both positive integers, and product calculation is performed based on correspondence between each element of the input data and each bit of each element of the mask data to obtain output data.
For example, in the data processing method provided in an embodiment of the present disclosure, the 1 st column to the j/2 th column of the second register are a first group of storage units, the (j/2+1) th column to the j th column of the second register are a second group of storage units, product calculation is performed based on the correspondence between each element of the input data and each bit of each element of the mask data to obtain output data, including selecting bits stored in the s-th column storage unit in the first group of storage units and bits stored in the s-th column storage unit in the second group of storage units by using a selection instruction, the bits stored in the s-th column storage unit in the first group of storage units and bits stored in the s-th column storage unit in the second group of storage units correspond to elements of consecutive i columns in each two consecutive rows in the input data, product calculation is performed based on the corresponding relationship between the bits in the selected storage units and the elements of the corresponding input data to obtain elements of consecutive i columns in each consecutive two rows, the product calculation is performed by shifting the bits stored in the first group of storage units and the second group of storage units by using the selection instruction and the second group of storage units after shifting the bits stored in the first group of storage units and the second storage units and shifting the selected bit storage units and the second group of storage units and shifting the bit storage units after shifting the selected bit storage units in the first group of storage units and the second storage units and the storage units are shifted, the shifting operation and product calculation are continued until the selection of all columns in the first set of memory cells and the second set of memory cells and the product calculation are completed.
For example, in the data processing method provided in an embodiment of the present disclosure, n=512, m=1024, p=256, i=j=q/2=z=32.
For example, in the data processing method provided in an embodiment of the present disclosure, s=16, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements located in the 1 st column to the 32 nd column in the X-th row in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements located in the 1 st column to the 32 nd column in the x+1 row in the input data, the corresponding elements located in the i-th column in every two consecutive rows in the output data include the elements located in the 1 st column to the 32 nd column in the X-th row and the x+1 row in the output data, the bits stored in the 16 th column storage unit in the shifted first group of storage units correspond to the elements located in the 33 th column to the 64 th column in the X-th row in the input data, the bits stored in the 16 th column storage unit in the shifted second group of storage units correspond to the elements located in the 33 st+1 st column to the 64 th column in the input data, and the X is a positive integer.
For example, in the data processing method provided in an embodiment of the present disclosure, the first set of memory cells includes 512 bits, the second register includes 2×512 bits, and 2×512 bits correspond to elements from 1 st column to 512 th column in every two consecutive rows of the input data.
For example, in the data processing method provided in an embodiment of the present disclosure, elements from 513 th to 1024 th columns in every two consecutive rows in input data are in one-to-one correspondence with 2×512 bits stored in a third register, the third register stores mask data different from but the same size as mask data stored in a second register, and the arrangement manner of the mask data in the third register is the same as that of the mask data in the second register.
For example, in the data processing method provided in an embodiment of the present disclosure, performing product calculation on the bit in the selected memory cell and the corresponding element of the input data includes taking the corresponding element of the input data as the corresponding element in the output data when the value of the bit in the selected memory cell is 1, and taking 0 as the corresponding element in the output data when the value of the bit in the selected memory cell is 0.
For example, in the data processing method provided in an embodiment of the present disclosure, performing product calculation to obtain output data based on correspondence between respective elements of input data and respective bits of respective elements of mask data, further includes dividing a result of performing the product calculation by (1-drop_prob) to obtain output data, where drop_prob represents a probability that each bit of each element of mask data is 0.
For example, the data processing method provided in an embodiment of the present disclosure further includes storing output data in a memory according to a corresponding location of input data.
For example, in the data processing method provided in an embodiment of the present disclosure, the format type of the selection instruction is the same as the format type of the input data.
For example, in the data processing method provided in an embodiment of the present disclosure, the format type of the selection instruction is BF16, and the format type of the input data is BF16.
For example, in the data processing method provided in an embodiment of the present disclosure, the data processing method is used for calculation of a discarding method layer of a neural network, an input portion of the discarding method layer includes input data and mask data, an element of the corresponding input data is output as the discarding method layer in a case where a value of a bit of an element of the mask data is 1, and an element of the corresponding input data is discarded in a case where a value of a bit of an element of the mask data is 0.
For example, the data processing method provided in an embodiment of the present disclosure further includes, before loading the input data into the first register and loading the mask data into the second register, performing an alignment operation on the input data and the mask data such that each 2*N element in the input data corresponds to each 1*Z element in the mask data.
The data processing device further comprises a data loading unit configured to load input data into a first register and load mask data into a second register, wherein the size of the input data is N x M, the size of the mask data is P x Q, each element of the mask data is Z bits, Q x Z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=Q/2=Z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, N, M, P, Q, i, j, Z are all positive integers, and a calculating unit configured to perform product calculation based on the corresponding relation between each element of the input data and each bit of each element of the mask data to obtain output data.
At least one embodiment of the present disclosure also provides a data processing apparatus including a processor, and a memory storing computer-executable instructions that, when executed by the processor, implement the data processing method provided by at least one embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides an electronic device, including a data processing device provided by at least one embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement the data processing method provided by at least one embodiment of the present disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 shows a schematic diagram of a register;
FIG. 2 illustrates a schematic flow diagram of a data processing method provided by at least one embodiment of the present disclosure;
FIG. 3 shows a schematic flow chart of one example of step S202 in FIG. 2;
Fig. 4 shows a schematic diagram of one example of step S301 in fig. 3;
fig. 5 shows a schematic diagram of one example of step S303 in fig. 3;
FIG. 6 illustrates a schematic block diagram of a data processing apparatus provided by at least one embodiment of the present disclosure;
FIG. 7 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 8 shows a schematic diagram of an electronic device according to an embodiment of the disclosure;
fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
The input part of the drop layer (drop layer) includes input data and mask data corresponding thereto. Each bit of each element of mask data corresponds to an element of input data. If the value of the bit of the element of the mask data is 1, the corresponding element of the input data needs to be preserved, and if the value of the bit of the element of the mask data is 0, the corresponding element of the input data will be changed to 0. Let the size of the input data be 512,1024, the data type be BF16, the size of the mask data be 512,32, and the data type be FP32. The 32 bits of each FP32 data in the mask data correspond to the 32 elements of the same row of input data. Every row of 32 elements in the masked data, i.e. 32 x 32 = 1024 bits, which 1024 bits correspond to a complete row of 1024 elements of the input data. Masked data is read from the memory into registers, the schematic structure of which is shown in fig. 1. Each register comprises 32 channels (Lane 0-Lane 31), each channel comprises a number of 32 bits (Bit 0-Bit 31), and each Bit corresponds to an element of input data. So that one register can just hold a complete row of elements in the mask data, i.e. 32 elements of FP32.
Typically, a 32-bit number for each channel in the register will correspond to 32 elements of the input data in the lateral direction. Since the input data is in BF16 format, two rows of BF16 data, 2 x 32, are included after the register is read. In order to correspond to the bits of the elements of the mask data one to one, the register storing the input data needs to be split into two groups of independent FP32 data of 1 x 32, then only 32 bits of one channel in the register storing the mask data can be read first and stored in the scalar register at a time, and then selective product operation is performed on the split input data by using the mask data. If the bit of the element of the mask data is 1, the corresponding element of the input data is reserved, if the bit of the element of the mask data is 0, the corresponding element of the input data is changed to 0, and then two sets of calculation results of the FP32 type of 1 x 32 are combined into one set of output data of the BF16 type of 2 x 32. This calculation process is very complex in instruction and very inefficient.
At least one embodiment of the present disclosure provides a data processing method, including loading input data into a first register and loading mask data into a second register, where the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to an element of the input data, storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in a same column of the second register sequentially correspond to consecutive i elements of the input data in a same row, N, M, P, Q, i, j, Z are both positive integers, and product calculation is performed based on correspondence between each element of the input data and each bit of each element of the mask data to obtain output data.
The data processing method provided by the embodiment of the disclosure changes the corresponding mode of covering the data and inputting the data, effectively reduces the number of instructions, greatly improves the operation efficiency and reduces the operation time.
At least one embodiment of the present disclosure also provides a data processing apparatus, an electronic apparatus, and a computer-readable storage medium corresponding to the above-described data processing method.
Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 2 shows a schematic flow chart of a data processing method provided by at least one embodiment of the present disclosure.
As shown in FIG. 2, the data processing method includes the following steps S201 to S202.
Step S201, loading the input data into the first register and loading the mask data into the second register.
For example, the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the memory cells of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, and N, M, P, Q, i, j, Z are positive integers.
For example, n=512, m=1024, p=256, i=j=q/2=z=32, that is, the size of the input data is 512×1024, the size of the mask data is 256×64, each element of the mask data is 32 bits, the number of columns of the mask data (64) times the number of columns of the input data (2048) by which the number of bits (32) of each element of the mask data is equal to 2, the number of columns (32) of the memory cells of the second register is equal to the number of columns (32) of the memory cells of the second register, and is equal to the number of columns of the mask data divided by 2 (32), and is equal to the number of bits (32) of each element of the mask data. The above specific values are merely examples, and the present disclosure does not limit the specific values of N, M, P, Q, i, j, Z, only the above relationship needs to be satisfied.
Step S202, product calculation is carried out based on the corresponding relation between each element of the input data and each bit of each element of the mask data so as to obtain output data.
For example, prior to step S201, the data processing method provided by the embodiment of the present disclosure may further include performing an alignment operation on the input data and the mask data, such that each 2*N elements in the input data corresponds to each 1*Z elements in the mask data.
For example, n=512, z=32, and the number of bits of each element of the mask data is 32, then every 2×512 elements (2 rows and 512 columns, i.e., 1024 elements) in the input data corresponds to every 1×32 elements (1 row and 32 columns, i.e., 32 elements, and the number of bits of 32 elements is 32×32=1024) in the mask data.
The method and the device have the advantages that the corresponding relation between each element of the input data and each bit of each element of the mask data is adjusted, namely, j bits of each channel in the second register are not used for corresponding to continuous j elements of the input data in the transverse direction, i bits stored in each column of storage units in the second register are used for corresponding to continuous i elements of the input data in the transverse direction, so that the subsequent calculation of each bit of the mask data and each element of the input data is facilitated, for example, a plurality of bits of the mask data are conveniently read in batch in combination with a selection instruction, format adjustment and conversion of the read bits are avoided, the instruction number is effectively reduced, and the operation efficiency is improved.
For example, in some embodiments of the present disclosure, the second register is divided into two sets of memory cells, with columns 1 through j/2 of the second register being the first set of memory cells and columns (j/2+1) through j of the second register being the second set of memory cells. Therefore, the selection and calculation of each bit can be realized by combining the selection instruction, so that the efficiency is improved, and the realization is convenient.
Fig. 3 shows a schematic flow chart of one example of step S202 in fig. 2.
As shown in fig. 3, one example of step S202 may include the following steps S301 to S303 for the second register to be divided into two sets of memory cells.
Step S301, selecting bits stored in the S-th column storage unit in the first group of storage units and bits stored in the S-th column storage unit in the second group of storage units by using a selection instruction, wherein the bits stored in the S-th column storage unit in the first group of storage units and the bits stored in the S-th column storage unit in the second group of storage units correspond to elements of continuous i columns in every two continuous rows in input data.
For example, in some embodiments of the present disclosure, the format type of the select instruction is the same as the format type of the input data.
For example, in some embodiments of the present disclosure, the format type of the select instruction is BF16, and the format type of the input data is BF16.
Because the format type of the selection instruction is the same as the format type of the input data, the input data can be directly selected, and the method is more concise and efficient.
Fig. 4 shows a schematic diagram of one example of step S301 in fig. 3.
As shown in fig. 4, the memory cells of the second register are arranged in 32 rows and 32 columns (the second register has 32 channels (Lane 0 to Lane 31), each channel has 32 bits (bit 0 to bit 32)), the 1 st to 16 th columns of the second register are the first group of memory cells, and the 17 th to 32 th columns of the second register are the second group of memory cells. The highest bit of the first group of memory cells and the second group of memory cells, i.e. the bit stored in the 16 th column of memory cells in the first group of memory cells and the bit stored in the 16 th column of memory cells in the second group of memory cells, is selected by means of a selection instruction. For example, the bits stored in the 16 th column storage unit in the first group storage unit and the bits stored in the 16 th column storage unit in the second group storage unit correspond to elements of 1 st to 32 nd columns located in the first row and the second row in the input data.
The present disclosure is not limited to selecting which column of memory cells of the first group of memory cells and the second group of memory cells is stored, as long as the selection is made in units of columns.
Step S302, based on the corresponding relation, the product calculation is carried out on the bits in the selected storage unit and the corresponding elements of the input data, so as to obtain the corresponding elements of the output data, which are positioned in every two continuous rows and in every two continuous columns.
For example, in some embodiments of the present disclosure, step S302 may include taking the element of the corresponding input data as the corresponding element in the output data if the value of the bit in the selected memory cell is 1, and taking 0 as the corresponding element in the output data if the value of the bit in the selected memory cell is 0.
Step S303, shifting the first group of storage units and the second group of storage units left by 1 bit or shifting the second group of storage units right by 1 bit to obtain a shifted first group of storage units and a shifted second group of storage units, selecting bits stored in a S-th column storage unit in the shifted first group of storage units and bits stored in a S-th column storage unit in the shifted second group of storage units by using a selection instruction, calculating products of the bits in the selected storage units and elements of corresponding input data based on a corresponding relation, and continuing shifting operation and product calculation until selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.
Fig. 5 shows a schematic diagram of one example of step S303 in fig. 3.
As shown in fig. 5, the size of the second register is the same as that of the second register shown in fig. 4. The selection instruction only selects the bit stored in the 16 th column (bit 15) storage unit in the first group of storage units and the bit stored in the 16 th column (bit 31) storage unit in the second group of storage units at a time, step S302 is executed after the selection operation is completed, the first group of storage units and the second group of storage units are shifted left by 1 bit after the step S302 is executed, so that the bit30 is shifted to the bit31 position, the bit14 is shifted to the bit15 position, the rest bits are shifted left by one bit in sequence, and the product calculation of step 302 is executed until the selection and the product calculation of all columns in the first group of storage units and the second group of storage units are completed.
It should be noted that, the first set of storage units and the second set of storage units may be shifted to the right by 1 bit, which may be adjusted according to the correspondence between the elements of the input data and the bits of the elements of the mask data, which is not limited in this disclosure. In other embodiments, if the selection instruction is not adopted, but the bits in the column direction are selected in other manners, the shifting operation may be omitted, as long as the bits in each column can be sequentially selected, which is not limited by the embodiments of the present disclosure.
Returning to FIG. 3, for example, in some embodiments of the present disclosure, step S202 may further include step S304 of dividing the result of the product calculation by (1-drop_prob) to obtain output data, drop_prob representing a probability of 0 for each bit of each element of the mask data.
For example, if each bit of each element of the mask data has a probability of 0, where the probability is represented by drop_prob, then the probability of 1 for each bit of each element of the mask data is (1-drop_prob). The calculation result is typically scaled, i.e. multiplied by 1/(1-drop_prob) or divided by (1-drop_prob).
For example, the data processing method provided by the embodiment of the disclosure may further include storing the output data in the memory according to the corresponding position of the input data.
Since the size and data type of the output data are completely consistent with those of the input data, after the output data are obtained, the output data are stored in the memory according to the same coordinates as the coordinates of the input data currently processed.
The data processing method provided by the present disclosure is described below by way of a specific embodiment.
For example, in one embodiment of the present disclosure, n=512, m=1024, p=256, i=j=q/2=z=32. I.e. the size of the input data is 512 x 1024, the size of the mask data is 256 x 64, each element of the mask data is 32 bits, and the memory cells of the second register are arranged in 32 rows and 32 columns. The format type of the input data is BF16, the format type of the mask data is FP32, and the format type of the selection instruction is BF16.
For example, columns 1 to 16 of the second register are the first group of memory cells, and columns 17 to 32 of the second register are the second group of memory cells
First, input data is loaded into a first register and mask data is loaded into a second register.
Then, the bits stored in the 16 th column storage unit in the first group storage unit and the bits stored in the 16 th column storage unit in the second group storage unit are selected by the selection instruction, that is, the 16 th column and the 32 th column in the 32 th column storage unit of the second register are selected by the selection instruction.
Then, based on the correspondence, the product calculation is performed on the bits in the selected storage unit and the corresponding elements of the input data, so as to obtain the corresponding elements of the output data, which are located in 32 columns in every two consecutive rows. The corresponding elements of the output data in the consecutive i columns of every two consecutive rows include the elements of the output data in the first i columns of the X-th row and the x+1-th row.
For example, the bits stored in the 16 th column of memory cells in the first group of memory cells correspond to elements located in the 1 st column to the 32 nd column of the X-th row in the input data, the bits stored in the 16 th column of memory cells in the second group of memory cells correspond to elements located in the 1 st column to the 32 nd column of the x+1-th row in the input data, X is a positive integer and X is an odd number. For example, x=1, the bit stored in the 16 th column storage unit in the first group of storage units corresponds to the element located in the 1 st row, the 1 st column, and the 32 nd column in the input data, and the bit stored in the 16 th column storage unit in the second group of storage units corresponds to the element located in the 2 nd row, the 1 st column, and the 32 nd column in the input data. And performing product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain the corresponding elements from the 1 st column to the 32 nd column in the first row and the second row in the output data.
Then, the first group of memory cells and the second group of memory cells are shifted left by 1 bit, resulting in shifted first group of memory cells and shifted second group of memory cells, and the bits stored in the 16 th column of memory cells in the shifted first group of memory cells and the bits stored in the 16 th column of memory cells in the shifted second group of memory cells are selected by the selection instruction, which corresponds to the 15 th column and the 31 th column of the 32 th column of memory cells of the second register before shifting. Then, the product calculation is performed on the bits in the selected memory cells and the elements of the corresponding input data based on the correspondence. In a similar manner, shifting, selecting, and product calculating are alternately performed until selecting and product calculating for all columns in the first set of memory cells and the second set of memory cells is completed.
For example, the bit stored in the 16 th column of memory cells in the shifted first group of memory cells corresponds to the element located in the 33 th column to the 64 th column of the X-th row in the input data, the bit stored in the 16 th column of memory cells in the shifted second group of memory cells corresponds to the element located in the 33 rd column to the 64 th column of the x+1-th row in the input data, X is a positive integer and X is an odd number. For example, x=1, the bits stored in the 16 th column storage unit in the shifted first group of storage units correspond to elements in the 1 st row, 33 th column, and 64 th column in the input data, and the bits stored in the 16 th column storage unit in the shifted second group of storage units correspond to elements in the 2 nd row, 33 th column, and 64 th column in the input data. The first group of memory cells comprises 512 bits, the second register comprises 2 x 512 bits, and the 2 x 512 bits correspond to elements from 1 st column to 512 th column in two consecutive rows of input data. Elements of 513 th to 1024 th columns in every two consecutive rows in the input data are corresponding thereto by another register. For example, elements from 513 th to 1024 th columns in every two consecutive rows in the input data are one-to-one corresponding to 2×512 bits stored in a third register, the third register stores mask data different from but the same size as mask data stored in a second register, and the arrangement of the mask data in the third register is the same as that of the mask data in the second register.
For example, in the initial state, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the 1 st column and the 32 nd column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 2 nd row, the 1 st column and the 32 nd column in the input data, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the 33 th column and the 64 th column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 2 nd row, the 33 th column and the 64 th column in the input data, the bits stored in the 16 th column storage unit in the first group of storage units correspond to the elements from the 1 st row, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 96 th column in the input data, the bits stored in the 16 th column storage unit in the second group of storage units correspond to the elements from the 512 th column in the first group of storage units, and the product of the bits stored in the first group of the 16 th column storage units are calculated until the first column of the 512 th column in the first group of storage units corresponds to the 512 th column in the input data, and the first column of the first group of the elements from the 1 th column and the 512 th column in the first group of storage units are completely calculated.
Then, the result of the product calculation is divided by (1-drop_prob) to obtain output data.
And finally, storing the output data into a memory according to the corresponding position of the input data.
The data processing method provided by the embodiment of the disclosure effectively reduces the number of assembly instructions and greatly improves the operation efficiency. It is assumed that about 38 assembler instructions are required for input data of four registers, i.e., input data of BF16 type of 8×32, according to a general data processing method, and the data processing method provided according to the embodiment of the present disclosure can be completed only by 6 assembler instructions, with a remarkable optimization effect.
For example, in some embodiments of the present disclosure, a data processing method is used for the computation of the drop method layer of a neural network.
For a standard neural network, the training process of the neural network is to first forward propagate the input data through the neural network and then back propagate the loss results to determine how to update the parameters of the neural network to make the neural network learn. After the discarding method layer is used, the training flow is changed into that firstly, half of hidden neurons in the neural network are deleted randomly, the input neurons and the output neurons are kept unchanged, then the input data are transmitted forwards through the modified neural network, the obtained loss result is transmitted backwards through the modified neural network, after a small batch of training samples are executed, the corresponding parameters are updated on the neurons which are not deleted according to the random gradient descent method, and then the process is continuously repeated.
The input part of the discarding method layer includes input data and mask data, and the corresponding input data element is taken as an output of the discarding method layer in case that the bit value of the mask data element is 1, and the corresponding input data element is discarded (e.g., deleted neuron) in case that the bit value of the mask data element is 0.
It was said above that the result of the product calculation is divided by (1-drop_prob) to obtain the output data, since the dropping method layer needs to be scaled. Some neurons are randomly discarded during training, but some neurons cannot be randomly discarded during prediction, and if some neurons are discarded, the problem of unstable results is caused, so that model prediction is inaccurate. One solution is to multiply the weight of each neuron by a probability such that the predicted data and the training data are approximately the same. For example, the output of a neuron is x, then the probability that it has drop_prob is discarded at the time of training, and the probability of (1-drop_prob) participates in training, so the output of this neuron is divided by (1-drop_prob) at the time of prediction.
Fig. 6 illustrates a schematic block diagram of a data processing apparatus 600 that may be used to perform the data processing method illustrated in fig. 2, provided in accordance with at least one embodiment of the present disclosure.
As shown in fig. 6, the data processing apparatus 600 includes a data loading unit 601 and a computing unit 602.
The data loading unit 601 is configured to load input data into a first register and mask data into a second register, where the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the memory cells of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row of the input data, and N, M, P, Q, i, j, Z are positive integers.
The calculation unit 602 is configured to perform product calculation based on the correspondence relationship of each element of the input data and each bit of each element of the mask data to obtain output data.
The data loading unit 601 may implement step S201 in the data processing method shown in fig. 2, the calculating unit 602 may implement step S202 in the data processing method shown in fig. 2, and the relevant description may refer to the above and will not be described herein. The technical effects of the data processing apparatus 600 are the same as those of the data processing method shown in fig. 2, and will not be described herein.
For example, the data processing apparatus may be implemented in hardware, software, firmware, and any feasible combination thereof, which is not limited in this disclosure.
For example, the data loading unit 601 and the computing unit 602 may be hardware, software, firmware, and any feasible combination thereof. For example, the data loading unit 601 and the computing unit 602 may be dedicated or general purpose circuits, chips, devices, or the like, or may be a combination of a processor and a memory. With respect to specific implementations of the data loading unit 601 and the computing unit 602, embodiments of the present disclosure are not limited in this regard.
It should be noted that, in the embodiment of the present disclosure, each unit of the data processing apparatus 600 corresponds to each step of the foregoing data processing method, and the specific function of the data processing apparatus 600 may refer to the related description of the data processing method, which is not repeated herein. The components and structures of data processing apparatus 600 shown in fig. 6 are exemplary only and not limiting, and data processing apparatus 600 may include other components and structures as desired.
At least one embodiment of the present disclosure also provides a data processing apparatus including a memory for non-transitory storage of computer-executable instructions and a processor for executing the computer-executable instructions, wherein the computer-executable instructions, when executed by the processor, perform the data processing method provided by at least one embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of a data processing apparatus 700 according to an embodiment of the disclosure. As shown in fig. 7, a data processing apparatus 700 according to an embodiment of the present disclosure may include a processing apparatus 701 and a memory 702, which may be interconnected by a bus 703.
The processing device 701 may perform various actions and processes in accordance with programs or code stored in the memory 702. Specifically, the processing device 701 may be an integrated circuit chip with signal processing capabilities. For example, the processing means may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Various methods, steps, procedures, and logic blocks disclosed in embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be an X86 architecture or an ARM architecture or the like.
The memory 702 stores computer executable instructions that, when executed by the processing device 701, implement a data processing method provided by at least one embodiment of the present disclosure. The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct main bus random access memory (DRRAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
At least one embodiment of the present disclosure also provides an electronic device, including a data processing device provided by at least one embodiment of the present disclosure. In one embodiment, the electronic device is, for example, a central processor, such as a single-core or multi-core processor. In one embodiment, the electronic device is a computer system, the computer system including one or more processors,
Fig. 8 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure may include a data processing device 600.
At least one embodiment of the present disclosure provides a computer-readable storage medium for non-transitory storage of computer-executable instructions that, when executed by a processor, implement a data processing method provided by at least one embodiment of the present disclosure.
Fig. 9 is a schematic diagram of a storage medium according to some embodiments of the present disclosure. As shown in fig. 9, a storage medium 900 is used to store computer executable instructions 910. For example, computer-executable instructions 910, when executed by a computer, may perform one or more steps in accordance with the data processing methods described above.
Similarly, the computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs a data processing method according to an embodiment of the present disclosure.
The technical effects of the data processing apparatus, the electronic apparatus, and the storage medium are the same as those of the data processing method shown in fig. 2, and will not be described herein.
The following points need to be described:
(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims (17)

1.A data processing method, comprising:
Loading input data into a first register and loading mask data into a second register, wherein the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z=2×m, n=2p, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row of the input data, and N, M, P, Q, i, j, Z are positive integers;
And carrying out product calculation based on the corresponding relation between each element of the input data and each bit of each element of the mask data to obtain output data.
2. The data processing method according to claim 1, wherein columns 1 to j/2 of the second register are a first group of memory cells, and columns (j/2+1) to j of the second register are a second group of memory cells;
Performing product calculation to obtain the output data based on the correspondence between each element of the input data and each bit of each element of the mask data, including:
Selecting bits stored in an s-th column storage unit in the first group of storage units and bits stored in an s-th column storage unit in the second group of storage units by using a selection instruction, wherein the bits stored in the s-th column storage unit in the first group of storage units and the bits stored in the s-th column storage unit in the second group of storage units correspond to elements of the input data located in consecutive i columns in every two consecutive rows;
based on the corresponding relation, carrying out product calculation on the bits in the selected storage units and the corresponding elements of the input data to obtain corresponding elements of continuous i columns in every two continuous rows in the output data;
Shifting the first group of storage units and the second group of storage units left by 1 bit or shifting the second group of storage units right by 1 bit to obtain a shifted first group of storage units and a shifted second group of storage units, selecting bits stored in a s-th column of storage units in the shifted first group of storage units and bits stored in a s-th column of storage units in the shifted second group of storage units by using the selection instruction, calculating products of the bits in the selected storage units and elements of corresponding input data based on the corresponding relation, and continuing to perform shifting operation and product calculation until the selection and product calculation of all columns in the first group of storage units and the second group of storage units are completed.
3. The data processing method according to claim 2, wherein n=512, m=1024, p=256, i=j=q/2=z=32.
4. A data processing method according to claim 3, wherein s = 16, bits stored in a 16 th column storage unit in the first group of storage units correspond to elements in an X-th row 1 st to a 32 nd column in the input data, and bits stored in a 16 th column storage unit in the second group of storage units correspond to elements in an x+1st row 1 st to a 32 nd column in the input data;
The elements of the output data corresponding to the continuous i columns in every two continuous rows comprise elements of the output data from the 1 st column to the 32 nd column in the X row and the X+1 row;
The bits stored in the 16 th column of memory cells in the shifted first group of memory cells correspond to elements located in the 33 rd to 64 th columns of the X-th row in the input data, the bits stored in the 16 th column of memory cells in the shifted second group of memory cells correspond to elements located in the 33 rd to 64 th columns of the x+1-th row in the input data, X is a positive integer and X is an odd number.
5. The data processing method according to claim 4, wherein the first group of memory cells includes 512 bits, the second register includes 2 x 512 bits, and the 2 x 512 bits correspond to elements from 1 st column to 512 th column in every two consecutive rows of the input data.
6. The data processing method according to claim 5, wherein the elements from 513 th to 1024 th columns in the every two consecutive rows in the input data are one-to-one corresponding to 2 x 512 bits stored in a third register, the third register storing mask data different from but the same size as mask data stored in the second register, the mask data in the third register being arranged in the same manner as mask data in the second register.
7. The data processing method of claim 2, wherein multiplying the bits in the selected memory cells with the corresponding elements of the input data comprises:
taking the element of the corresponding input data as the corresponding element in the output data under the condition that the value of the bit in the selected storage unit is 1;
in the case where the value of the bit in the selected memory cell is 0, 0 is taken as the corresponding element in the output data.
8. The data processing method according to claim 7, wherein performing product calculation to obtain the output data based on correspondence of respective elements of the input data and respective bits of respective elements of the mask data, further comprises:
dividing the result of the product calculation by (1-drop_prob) to obtain the output data, wherein drop_prob represents a probability that each bit of each element of the mask data is 0.
9. The data processing method according to any one of claims 2 to 8, further comprising:
and storing the output data into a memory according to the corresponding position of the input data.
10. The data processing method according to claim 9, wherein a format type of the selection instruction and a format type of the input data are the same.
11. The data processing method according to claim 10, wherein the format type of the selection instruction is BF16, and the format type of the input data is BF16.
12. The data processing method according to claim 11, wherein the data processing method is used for calculation of a discarding method layer of a neural network, an input section of the discarding method layer including the input data and the mask data,
In the case that the bit value of the element of the mask data is 1, the corresponding element of the input data is used as the output of the discarding method layer;
in case that the value of the bit of the element of the mask data is 0, the corresponding element of the input data is discarded.
13. The data processing method of claim 1, further comprising:
before loading the input data into the first register and the mask data into the second register,
And performing alignment operation on the input data and the mask data, so that each 2*N element in the input data corresponds to each 1*Z element in the mask data.
14. A data processing apparatus comprising:
A data loading unit configured to load input data into a first register and mask data into a second register, wherein the size of the input data is n×m, the size of the mask data is p×q, each element of the mask data is Z bits, q×z= 2*M, each bit of each element of the mask data corresponds to one element of the input data, the storage units of the second register are arranged in i rows and j columns, i=j=q/2=z, i bits stored in the same column of the second register sequentially correspond to consecutive i elements in the same row in the input data, and N, M, P, Q, i, j, Z are positive integers;
And a calculating unit configured to perform product calculation based on the correspondence between each element of the input data and each bit of each element of the mask data to obtain output data.
15. A data processing apparatus comprising:
Processor, and
A memory storing computer-executable instructions,
Wherein the computer executable instructions, when executed by the processor, implement the data processing method according to any of claims 1-13.
16. An electronic device comprising the data processing device of claim 15.
17. A computer-readable storage medium for non-transitory storage of computer-executable instructions,
Wherein the computer executable instructions, when executed by a processor, implement the data processing method according to any of claims 1-13.
CN202210916696.7A 2022-08-01 2022-08-01 Data processing method and device, electronic device and medium Active CN115186815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210916696.7A CN115186815B (en) 2022-08-01 2022-08-01 Data processing method and device, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210916696.7A CN115186815B (en) 2022-08-01 2022-08-01 Data processing method and device, electronic device and medium

Publications (2)

Publication Number Publication Date
CN115186815A CN115186815A (en) 2022-10-14
CN115186815B true CN115186815B (en) 2025-07-11

Family

ID=83520666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210916696.7A Active CN115186815B (en) 2022-08-01 2022-08-01 Data processing method and device, electronic device and medium

Country Status (1)

Country Link
CN (1) CN115186815B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428879A (en) * 2020-03-04 2020-07-17 深圳芯英科技有限公司 Data processing method, device, chip and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040122887A1 (en) * 2002-12-20 2004-06-24 Macy William W. Efficient multiplication of small matrices using SIMD registers
EP1821414B1 (en) * 2004-12-07 2016-06-22 Nippon Telegraph And Telephone Corporation Information compression-coding device, method thereof, program thereof and recording medium storing the program
CN111402860B (en) * 2020-03-16 2021-11-02 恒睿(重庆)人工智能技术研究院有限公司 Parameter management method, system, medium and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428879A (en) * 2020-03-04 2020-07-17 深圳芯英科技有限公司 Data processing method, device, chip and computer readable storage medium

Also Published As

Publication number Publication date
CN115186815A (en) 2022-10-14

Similar Documents

Publication Publication Date Title
US11574031B2 (en) Method and electronic device for convolution calculation in neural network
US10592241B2 (en) Apparatus and methods for matrix multiplication
US12174908B2 (en) Method, electronic device and storage medium for convolution calculation in neural network
US20240265234A1 (en) Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment
CN107301456B (en) Implementation method of multi-core acceleration of deep neural network based on vector processor
CN110415157B (en) Matrix multiplication calculation method and device
KR20230081697A (en) Method and apparatus for accelerating dilatational convolution calculation
CN109313663B (en) Artificial intelligence calculation auxiliary processing device, method, storage medium and terminal
CN110929854B (en) A data processing method, device and hardware accelerator
CN116152520B (en) Data processing method for neural network accelerator, chip and electronic equipment
CN117795473A (en) Partially managed and reconfigurable systolic streaming architecture for in-memory computing
CN118152713B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN113743600A (en) Design method of systolic array with integrated storage and computing architecture suitable for multi-precision neural network
US20210319291A1 (en) Neural network computation apparatus having systolic array
CN115186815B (en) Data processing method and device, electronic device and medium
CN119719595B (en) Data processing method, electronic device, medium, and computer program product
KR102913162B1 (en) Mixed-precision neural processing unit(npu) using spatial fusion with load balancing
CN115480919A (en) Convolution optimization operation method and device, computer equipment and storage medium
CN116721006A (en) Feature map processing method and device
KR20200072308A (en) Method and apparatus for performing convolution operations in neural networks
US20220207332A1 (en) Scalable neural network accelerator architecture
CN115456170A (en) Neural network parallel training method with low communication overhead
US20220164127A1 (en) Memory for an Artificial Neural Network Accelerator
CN113627587A (en) Multichannel convolutional neural network acceleration method and device
JP2825133B2 (en) Parallel data processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant