WO2019165946A1 - Dispositif à microcircuit intégré, carte de circuit imprimé et produit associé - Google Patents
Dispositif à microcircuit intégré, carte de circuit imprimé et produit associé Download PDFInfo
- Publication number
- WO2019165946A1 WO2019165946A1 PCT/CN2019/076088 CN2019076088W WO2019165946A1 WO 2019165946 A1 WO2019165946 A1 WO 2019165946A1 CN 2019076088 W CN2019076088 W CN 2019076088W WO 2019165946 A1 WO2019165946 A1 WO 2019165946A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data block
- processing circuit
- data
- basic
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device, a board, and related products.
- ANN Artificial Neural Network
- a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other.
- the calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) to implement the operation of the neural network. Such calculations have a large amount of calculation and high power consumption.
- Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed of the computing device and improve efficiency.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;
- the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- an integrated circuit chip device in a second aspect, includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction, and the convolution kernel data block Dividing into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data block including the horizontal data block And/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the convolution instruction;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;
- the main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, divide the input data block into horizontal data blocks according to the multiplication instruction, and divide the weight data block into vertical data blocks. Blocking; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block to obtain the processed first data block; the first data block includes the horizontal data block and/or the a vertical data block; transmitting, according to the multiplication instruction, the processed first data block to at least one of the basic processing circuits connected to the main processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, and perform the operation in the neural network in parallel according to the processed second data block. Obtaining an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is sent by the main processing circuit determined by the basic processing circuit Data block, the second data block is associated with the processed first data block;
- the main processing circuit is configured to process the operation result to obtain an instruction result of the multiplication instruction.
- a fourth aspect provides an integrated circuit chip device for performing a neural network forward operation, the neural network comprising n layers; the integrated circuit chip device comprising: a main processing circuit and a plurality of foundations Processing circuit; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second mapping circuit are both used to execute a neural network Compression processing of each data in the operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the main processing circuit is configured to receive a forward operation instruction, and parse the forward operation instruction to obtain a first operation instruction included in an ith layer of the forward operation instruction in the neural network forward operation, and the first processing instruction An input data block and a weight data block required for an operation instruction, wherein the value range of i is an integer greater than or equal to 1, and less than or equal to n, as the i is greater than or equal to 2, and the input data block is the ith -1 layer of output data blocks;
- the main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction
- the operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block;
- the forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block.
- the operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;
- the main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.
- an integrated circuit chip device for performing training of a neural network, the neural network includes n layers, and the value of n ranges from an integer of 2 or more, and the integrated circuit chip device includes a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second The mapping circuits are all used to perform compression processing of each data in the neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data.
- the n-th layer forward operation obtains the nth output result of the forward operation;
- the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
- the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
- An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
- the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
- the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
- a neural network processor board includes: a neural network chip package structure, a first electrical and non-electrical connection device, and a first substrate; the neural network chip package structure The method includes: a neural network chip, a second electrical and non-electrical connection device, and a second substrate, the second substrate carries the neural network chip, and the second substrate passes the second electrical and non-electrical connection device Neural network chip connection;
- the neural network chip includes the integrated circuit chip device provided by any of the above aspects to the fifth aspect.
- a neural network computing device comprising the integrated circuit chip device provided by any one of the first to fifth aspects.
- a combined processing apparatus includes: a neural network computing apparatus provided by the seventh aspect, a universal interconnection interface, and a general processing apparatus;
- the neural network computing device is coupled to the general purpose processing device via the universal interconnect interface.
- a chip is provided, the chip integrating the apparatus provided in any one of the first to eighth aspects above.
- an electronic device comprising the chip of the ninth aspect.
- a method for computing a neural network is provided, the method being applied to an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit of any one of the first aspect to the fifth aspect A chip device for performing an operation of a neural network.
- the compression mapping circuit is provided to compress the data block and then perform operations, thereby saving transmission resources and computing resources, so that it has the advantages of low power consumption and small calculation amount.
- 1a is a schematic structural view of an integrated circuit chip device.
- FIG. 1b is a schematic structural view of another integrated circuit chip device.
- Figure 1c is a schematic structural view of a basic processing circuit.
- Figure 1d is a schematic diagram of the structure of a main processing circuit.
- Figure 2a is a schematic diagram of the use of a basic processing circuit.
- Figure 2b is a schematic diagram of the data transmitted by the main processing circuit.
- Figure 2c is a schematic diagram of a matrix multiplied by a vector.
- 2d is a schematic structural view of an integrated circuit chip device.
- 2e is a schematic structural view of still another integrated circuit chip device.
- Figure 2f is a schematic diagram of a matrix multiplied by a matrix.
- Figure 3a is a schematic diagram of convolution input data.
- Figure 3b is a schematic diagram of a convolution kernel.
- Figure 3c is a schematic diagram of the operation window of a three-dimensional data block of input data.
- Figure 3d is a schematic diagram of another operational window of a three-dimensional data block of input data.
- Figure 3e is a schematic diagram of still another operational window of a three-dimensional data block of input data.
- Figure 4a is a schematic diagram of a training method of a neural network.
- Figure 4b is a schematic diagram of the forward operation of a neural network.
- Figure 4c is a schematic diagram of a neural network operation.
- 5a-5b are schematic structural diagrams of two mapping circuits provided by an embodiment of the present application.
- Figure 6a is a flow chart of a method of multiplying a matrix by a matrix.
- Figure 6b is a flow chart of a method of multiplying a matrix by a vector.
- Figure 7a is a schematic diagram of neural network training.
- Figure 7b is a schematic diagram of another neural network training.
- Figure 7c is a schematic diagram of the forward and reverse operations of the neural network.
- Figure 7d is a schematic diagram of a multi-layer structure of neural network training.
- FIG. 8 is a schematic structural diagram of a neural network chip provided by the flow of the embodiment of the present disclosure.
- FIG. 9a is a schematic structural diagram of a combined processing apparatus according to the present disclosure.
- FIG. 9b is another schematic structural diagram of a combined processing device according to the present disclosure.
- FIG. 10 is a schematic structural diagram of a neural network processor card provided by an embodiment of the present disclosure.
- FIG. 10b is a schematic structural diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure.
- 11a is a schematic diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure.
- FIG. 11b is a schematic diagram of another neural network chip package structure provided by the flow of the embodiment of the present disclosure.
- the apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least one of the plurality of basic processing circuits includes a second mapping circuit The first mapping circuit and the second mapping circuit are both configured to perform compression processing of each data in a neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;
- the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction; And dividing the horizontal data block and the pre-stored identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identifier data blocks associated with the basic data block; and the plurality of basic data blocks and An identification data block associated with each of the plurality of basic data blocks is distributed to a base processing circuit connected thereto; the vertical data block and the identification data block associated with the vertical data block are broadcasted to a base processing circuit connected thereto.
- the identifier data block may be specifically represented by a direct index or a step index, and an optional List of Lists (LIL), a Coordinate list (COO), and a compressed sparse line (Compressed). Sparse Row (CSR), Compressed Sparse Column (CSC), (ELL Pack, ELL), and Hybrid (HyB) are not limited in this application.
- the identification data block is represented by a direct index
- the identification data block may specifically be a data block composed of 0 and 1, wherein 0 represents data included in the data block (such as weight or input nerve).
- the absolute value of the element is less than or equal to the first threshold, and 1 indicates that the absolute value of the data (such as the weight or the input neuron) contained in the data block is greater than the first threshold, and the first threshold is a custom random setting on the user side or the device side. , such as 0.05, 0, and so on.
- the target data in the plurality of basic data blocks and the plurality of basic data may be specifically
- the identification data blocks associated with the blocks are distributed to the basic processing circuit connected thereto; optionally, the target data in the processed vertical data block and the identification data block associated with the vertical data block may also be broadcasted to the same.
- the target data refers to data in which the absolute value of the data block is greater than the first threshold, or refers to non-zero data in the data block (here, specifically, the processed horizontal data block or the processed vertical data block). .
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
- the horizontal data block is a matrix of M 1 rows and N 1 columns
- the basic data block is a matrix of M 2 rows and N 2 columns, where M 1 > M 2 , N 1 > ; N 2 .
- the identification data block associated with the horizontal data block is also a matrix of M 1 rows and N 1 columns
- the identification data block associated with the basic data block is also a matrix of M 2 rows and N 2 columns.
- the first threshold is 0.05
- the identifier data block associated with the basic data block is The processing of the data block with respect to the first mapping circuit and the second mapping circuit will be specifically described later.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identifier data block associated with the horizontal data block, the processed vertical data block, and the An identification data block associated with the vertical data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data associated with each of the basic data blocks Blocking, the identification data block associated with each of the plurality of basic data blocks and the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block and the identification data associated with the vertical data block are The block is broadcast to the base processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; and identify the data block according to the connection Processing the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation result, Transmitting the operation result to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
- the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
- the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inner product operation.
- connection identification data block is a data block obtained by performing an element-by-element operation on the identification data block associated with the basic data block and the identification data block associated with the partial vertical data block.
- connection identifier data block is used to represent data in which two data blocks (specifically, basic data blocks and vertical data blocks) are larger than absolute values. The details will be described in detail later.
- the identification data block associated with the horizontal data block is a 2*3 matrix.
- the identification data block associated with some vertical data blocks is a 2*2 matrix Corresponding to the obtained connection identification data block is
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block to obtain a processed horizontal data block and the identification data block associated with the horizontal data block, or to activate the first mapping circuit according to the pre-stored horizontal data block Processing, by the associated identifier data block, the horizontal data block to obtain a processed horizontal data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks And the identification data block associated with each of the basic data blocks, and the identification data blocks associated with each of the plurality of basic data blocks and the plurality of basic data blocks are distributed to a basic processing circuit connected thereto, and the vertical data is The block is broadcast to the base processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; Performing an inner product operation on the vertical data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain an instruction result.
- the main processing circuit is further configured to split the vertical data block to obtain a plurality of partial vertical data blocks; and pass the plurality of partial vertical data blocks once or Broadcasting to the base processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to process the partial vertical data block according to the identifier data block associated with the basic data block to obtain a processed partial vertical data block; and the basic data block and the The processed partial vertical data block performs an inner product operation.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the vertical data block, obtaining the processed vertical data block and the identification data block associated with the vertical data block, or starting the first mapping circuit according to the pre-stored
- the vertical data block is processed by the identification data block associated with the vertical data block to obtain a processed vertical data block; and the horizontal data block is split to obtain a plurality of basic data blocks;
- the basic data block is distributed to the basic processing circuit connected thereto, and the processed vertical data block and the identification data block associated with the vertical data block are broadcasted to the basic processing circuit connected thereto;
- the basic processing circuit is configured to start, by the second mapping circuit, the basic data block to be processed according to the identification data block associated with the vertical data block to obtain a processed basic data block; Performing an inner product operation on the data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain an instruction result.
- the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inner product operation.
- the main processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to be connected thereto by one broadcast.
- the basic processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to be connected thereto by one broadcast.
- the basic processing circuit is specifically configured to perform inner product of the basic data block (which may be the basic data block or the processed basic data block) and the vertical data block.
- the processing obtains the inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main processing circuit.
- the main processing circuit is configured to accumulate the operation result and obtain an accumulation result when the operation result is the result of the inner product processing, and arrange the accumulation result to obtain the The data block to be calculated and the instruction result of the operation instruction.
- the main processing circuit is specifically configured to divide the vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the The basic processing circuit; the plurality of partial vertical data blocks are combined to form the vertical data block.
- the basic processing circuit is specifically configured to execute the partial vertical data block (specifically, a partial vertical data block or a processed partial vertical data block) and the basic data block
- the inner product processing result is obtained
- the inner product processing result is accumulated to obtain a partial operation result
- the partial operation result is sent to the main processing circuit.
- the basic data block here takes the kernel 3*3 as an example.
- the partial vertical data block takes the 3*3 matrix as an example, and performs the multiplication operation of the corresponding position on the 3*3 matrix and the core 3*3 respectively, then the corresponding data is obtained.
- the inner product result has three inner product processing results, and the three inner product processing results are accumulated to obtain a partial operation result.
- the results of the three inner product processing Out0 (the inner product of the 0th row of the 3*3 matrix and the 0th row of the core 3*3), the Out1 (the inner product of the 1st row of the 3*3 matrix and the 1st row of the core 3*3), Out2 (the inner product of the 2nd line of the 3*3 matrix and the 2nd line of the 3*3 core) can be specifically:
- r of r00 represents a partial vertical data block
- 00 represents a 0th column element of the 0th row
- K0[0] k represents the basic data block, and 0[0] represents the 0th column element of the 0th row;
- the basic processing circuit is specifically configured to multiplex the partial vertical data block n times to perform the partial vertical data block and the n basic data block inner product operations to obtain n partial processing
- the n partial processing results are respectively accumulated to obtain n partial operation results
- the n partial operation results are sent to the main processing circuit
- the n is an integer greater than or equal to 2.
- the basic data block takes p cores 3*3 as an example.
- the partial vertical data block takes a 3*3 matrix as an example, and the p3 3*3 matrix is multiplexed with the core 3*3 to perform p-time corresponding position multiplication.
- the sub-operation that is, the corresponding inner product result, has p inner product results, and the three inner product results form a set of inner product operation results, and the three inner product results of each group in the p group are accumulated to obtain p partial operation results.
- the main processing circuit includes: a main register or a main on-chip buffer circuit;
- the basic processing circuit includes a basic register or a basic on-chip buffer circuit.
- the main processing circuit comprises: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a first mapping circuit or a data rearrangement circuit.
- the basic processing circuit is further configured to forward the vertical data block and the basic data block to other basic processing circuits to perform data processing and then perform an inner product operation to obtain an operation result, and The operation result is sent to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
- the data block may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
- the operation instruction is a multiplication instruction
- the main processing circuit determines that the multiplier data block is a vertical data block, and the multiplicand data block is a horizontal data block;
- the main processing circuit determines that the convolution input data block is a vertical data block, and the convolution kernel is a horizontal data block.
- the operation of the neural network includes: a convolution operation, a matrix multiplication matrix operation, a matrix multiplication vector operation, an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation.
- a convolution operation a matrix multiplication matrix operation
- a matrix multiplication vector operation an offset operation
- a full connection operation a GEMM operation
- a GEMV operation a GEMV operation
- an activation operation kind or any combination.
- the operation instructions according to the present invention include, but are not limited to, a convolution instruction, a multiplication instruction, a forward operation instruction, and a training instruction.
- the inner product operation referred to above may specifically be an operation indicated by the operation instruction.
- the operation instruction is a convolution instruction
- the inner product operation above is a convolution operation.
- the operation instruction is a multiplication instruction
- the inner product operation above is a multiplication operation.
- the operation instruction is a forward operation instruction
- the inner product operation above is a forward operation.
- the operation instruction is a training instruction
- the inner product operation above is a reverse operation.
- the main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction. Dividing the convolution kernel data block into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; a data block including the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of a basic processing circuit connected to the main processing circuit according to the convolution instruction Basic processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;
- the main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.
- the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
- the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a convolution operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the instruction result.
- the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
- the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
- the subsequent vertical data block and the processed basic data block perform a convolution operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
- the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
- the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a convolution operation.
- the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
- the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
- the vertical data block and the processed basic data block perform an inner product operation to obtain an operation result, and the operation result is sent to the main processing circuit.
- the main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, and divide the input data block according to the multiplication instruction. Forming a horizontal data block, dividing the weight data block into a vertical data block; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; The first data block includes the horizontal data block and/or the vertical data block; and the processed first data block is sent to a basic processing circuit connected to the main processing circuit according to the multiplication instruction.
- At least one basic processing circuit configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, according to the processed second data block in a parallel manner Performing an operation in the neural network to obtain an operation result, and transmitting the operation result to the main unit through a basic processing circuit connected to the main processing circuit a processing circuit;
- the second data block is a data block that is received by the basic processing circuit and received by the main processing circuit, and the second data block is associated with the processed first data block;
- a circuit for processing the operation result to obtain an instruction result of the multiplication instruction.
- the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
- the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing multiplication on the processed vertical data block and the basic data block to obtain an operation result Transmitting the operation result to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the instruction result.
- the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
- the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
- the subsequent vertical data block and the processed basic data block perform a multiplication operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
- the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
- the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a multiplication operation.
- the main processing circuit is configured to receive a forward operation instruction, and the forward operation instruction is parsed to obtain the forward operation instruction in the a first operation instruction included in the i-th layer of the neural network forward operation and an input data block and a weight data block required by the first operation instruction, wherein the range of i is greater than or equal to 1, and less than or equal to n An integer, if the i is greater than or equal to 2, the input data block is an output data block of the i-1th layer;
- the main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction
- the operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block;
- the forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block.
- the operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;
- the main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.
- the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
- the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a forward operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the forward operation includes but is not limited to a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;
- the main processing circuit is configured to process the operation result to obtain the instruction result.
- the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
- the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
- the subsequent vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
- the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
- the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a forward operation.
- the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
- the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
- the vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is sent to the main processing circuit.
- the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs a forward operation.
- the operation of the i-th layer further comprises: one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.
- the integrated circuit chip device is configured to receive a training instruction, and determine first layer input data and first layer weight group data according to the training instruction, The first layer of input data and the first layer of weight group data perform an n-th layer forward operation of the neural network to obtain an nth output result of the forward operation;
- the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
- the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
- An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
- the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
- the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
- the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
- the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing inverse operations on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the reverse operation includes, but is not limited to, a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;
- the main processing circuit is configured to process the operation result to obtain the instruction result.
- the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
- the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
- the subsequent vertical data block and the processed basic data block perform an inverse operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
- the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
- the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inverse operation.
- the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
- the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
- the vertical data block and the processed basic data block perform a reverse operation to obtain an operation result, and the operation result is sent to the main processing circuit.
- the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
- the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inverse operation.
- the inverse operation of the n layer further includes one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.
- the nth output result gradient is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;
- the nth layer input data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;
- the n-layer weight group data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
- FIG. 1a is an integrated circuit chip device according to the present disclosure.
- the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, and the plurality of basic processing circuits are arranged in an array (m*n Array), wherein the values of m and n are integers greater than or equal to 1 and at least one of m and n is greater than or equal to 2.
- each of the basic processing circuits is connected to an adjacent basic processing circuit, the main processing circuit connecting k basic processing circuits of the plurality of basic processing circuits, the k basics
- the processing circuit may be: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column.
- the main processing circuit includes a first mapping circuit for compressing data to obtain processed data and identification data.
- the identification data is used to indicate whether the absolute value of the data is greater than a first threshold.
- the main processing circuit may only send the processed data (specifically, the data whose absolute value is greater than the first threshold) and the identification data associated with the data to the basic processing circuit.
- the advantage is that the amount of data sent to the basic processing circuit for data processing is reduced, and the data processing rate is improved.
- the first threshold is customized on the user side or the device side, for example, 0.05, 0.5, etc., and is not limited.
- the input data of the main processing circuit is a matrix data block.
- the processed matrix data block can be obtained as The identification data block associated with the matrix data block is The specific processing regarding the first mapping circuit will be described later in detail.
- the main processing circuit distributes data to the basic processing circuit, only two data of 1 and 0.5 may be transmitted, not the processed matrix data block and 8 data; and the identifier of the matrix data block is also required to be associated.
- the data blocks are sent together to the basic processing circuit, so that the basic processing circuit correspondingly knows that the two data are located at the position of the original matrix data block according to the received identification data block and the received two data (1 and 0.5). That is, the basic processing circuit can correspondingly restore the processed matrix data block in the main processing circuit according to the received identification data block and the received data.
- At least one of the plurality of base circuits may include a second mapping circuit.
- a part of the basic processing circuits may include a second mapping circuit.
- the k basic processing circuits may be configured with the second mapping circuit, so that the n basic processing circuits may be respectively responsible for The data processing steps of the m basic processing circuits of the column are performed. This setting can improve the operation efficiency and reduce the power consumption, because for the n basic processing circuits of the first row, since the data transmitted by the main processing circuit is received first, the compression of the received data can be reduced.
- the calculation amount of the subsequent basic processing circuit and the amount of data transmission with the subsequent basic processing circuit are similar.
- the configuration of the second mapping circuit for the m basic processing circuits of the first column also has the advantages of small calculation amount and low power consumption.
- the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits of the first column, and the main processing circuit transmits and distributes to the n basic processing circuits of the first row. data.
- the specific processing regarding the second mapping circuit will be described later in detail.
- the main processing circuit is configured to perform each successive operation in the neural network operation and the basic processing circuit connected thereto to transmit data; the continuous operation is not limited to: an accumulation operation, an ALU operation, an activation operation, and the like. .
- the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- the above-described parallel execution of operations in the neural network includes, but is not limited to, inner product operations, matrix or vector multiplication operations, and the like.
- the main processing circuit may include: a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit may integrate a horizontal data distributing circuit and a vertical data distributing circuit.
- the horizontal data distributing circuit and the vertical data distributing circuit are also Can be set separately.
- horizontal data that is, data that needs to be sent to each of the basic processing circuits in the row direction (or horizontal direction)
- the horizontal data is transmitted to the basic processing circuit in any one or more of the m rows as shown in FIG. 1a.
- For vertical data that is, data that needs to be sent to part of the basic processing circuit in the column direction (or vertical direction), specifically, such as convolution operation, the convolution input data of the convolution operation needs to be sent to all the basic processing.
- the manner in which the horizontal data is specifically selected to be sent to the basic processing circuit can be specifically determined by the main processing circuit according to the load and other allocation methods.
- data can be sent to each basic processing circuit in broadcast form.
- horizontal/vertical data is transmitted to each basic processing circuit by means of one broadcast, and horizontal/vertical data can also be transmitted to each basic processing circuit by means of multiple broadcasts, the present disclosure specifically The embodiment does not limit the number of times of the above broadcast).
- the main processing circuit can also be selectively sent to a part of the basic processing circuit.
- the main processing circuit may include a register and/or an on-chip buffer circuit, and the main processing circuit may further include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit) circuit, and accumulation. Circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in practical applications, the above main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. Wait for other circuits.
- Each of the basic processing circuits may include a base register and/or a base on-chip buffer circuit; each of the base processing circuits may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
- the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be separately provided circuits.
- the accumulator circuit of the nth basic processing circuit of the mth row can perform an accumulation operation of the inner product operation, because for the mth line basic processing circuit, it can receive the product of all the basic processing circuits of the column.
- the accumulation operation of the inner product operation performs the accumulation operation of the inner product operation through the n basic processing circuits of the mth row, so that the calculation resources can be effectively allocated, and the power consumption is saved.
- This technical solution is especially suitable for a large number of m.
- the main processing circuit can allocate the executed circuit. Specifically, the executed circuit can be allocated by display or implicit manner. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the processing circuit receives the special indication or instruction, it determines to perform data compression processing. If the basic processing circuit does not receive a special indication or instruction, it determines that the compression processing of the data is not performed. As another example, it may be performed in a suggestive manner. For example, if the basic processing circuit receives the sparse data (ie, includes 0, or includes data smaller than a preset threshold greater than a preset amount) and determines that an inner product operation needs to be performed, the sparseness will be sparse The data is compressed.
- the sparse data ie, includes 0, or includes data smaller than a preset threshold greater than a preset amount
- the special instruction or indication may be configured with a descending sequence, the decrement sequence is decremented by one each time a basic processing circuit is passed, and the basic processing circuit reads the value of the decrementing sequence, and if the value is greater than zero, the data is executed.
- the compression process if the value is equal to or less than zero, does not perform data compression processing.
- This setting is configured according to the basic processing circuit of the array allocation. For example, for the m basic processing circuits of the i-th column, the main processing circuit needs the first five basic processing circuits to perform data compression processing, and the main processing circuit issues one.
- the special instruction includes a descending sequence, and the initial value of the descending sequence may be 5, and the value of the descending sequence is decremented by 1 every time a basic processing circuit is passed, to the 5th basic processing circuit, the descending sequence The value is 1, and when the sixth basic processing circuit is used, the decrementing sequence is 0. At this time, the sixth basic processing circuit will not perform the data compression processing, which can enable the main processing circuit to dynamically configure data compression.
- An embodiment of the present disclosure provides an integrated circuit chip device including a main processing circuit (also referred to as a main unit) and a plurality of basic processing circuits (also referred to as a base unit); the structure of the embodiment is as shown in FIG. 1b.
- the dotted line frame is the internal structure of the neural network computing device;
- the gray filled arrow indicates the data transmission path between the main processing circuit and the basic processing circuit array, and
- the hollow arrows indicate the respective basic processing circuits in the basic processing circuit array ( Data transmission path between adjacent basic processing circuits).
- the length and width of the basic processing circuit array may be different, that is, the values of m and n may be different, and may of course be the same. The disclosure does not limit the specific value of the above values.
- the circuit structure of the basic processing circuit is shown in Figure 1c; the dotted line in the figure indicates the boundary of the basic processing circuit, and the thick arrow crossing the dotted frame indicates the data input and output channel (the input channel is pointed in the dotted line box, indicating that the dotted line frame is the output channel) ); the rectangular box in the dashed box indicates the memory cell circuit (register and / or on-chip buffer), including input data 1, input data 2, multiplication or inner product results, accumulate data; diamond box represents the operator circuit, including multiplication or internal Product operator, adder.
- the neural network computing device includes a main processing circuit and 16 basic processing circuits (16 basic processing circuits are for illustrative purposes only, and other values may be used in practical applications);
- the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to the unit in FIG. 1b) is referred to as input 0.
- the vertical input interface (vertical arrow pointing to this unit in Figure 1b) is called input 1; each horizontal data output interface (the horizontal arrow indicated from this unit in Figure 1b) is called output 0, vertical
- the data output interface (the vertical arrow indicated from this unit in Figure 1b) is called output 1.
- each basic processing circuit can be respectively connected to different units, including a main processing circuit and other basic processing circuits;
- the input processing 0 of the four basic processing circuits of the basic processing circuits 0, 4, 8, 12 (numbered as shown in FIG. 1b) is connected to the data output interface of the main processing circuit;
- the input 1 of the four basic processing circuits of the basic processing circuits 0, 1, 2, 3 is connected to the data output interface of the main processing circuit;
- the output 1 of the four basic processing circuits of the basic processing circuits 12, 13, 14, 15 is connected to the data input interface of the main processing circuit;
- connection of the output interface of the basic processing circuit to the input interfaces of other basic processing circuits is shown in Figure 1b, and will not be enumerated one by one;
- the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit will be able to receive data from its P1 interface that the S unit sends to its S1 interface.
- the embodiment includes a main processing circuit, the main processing circuit is connected with an external device (both an input interface and an output interface), and a part of the data output interface of the main processing circuit is connected with a data input interface of a part of the basic processing circuit; the main processing circuit A part of the data input interface is connected to a data output interface of a part of the basic processing circuit.
- the data involved in the usage method provided by the present disclosure may be compressed data.
- the data in the present application may be an input neuron or a weight in a neural network, which may be a matrix data or a vector data, etc., which is not limited herein. That is, the data or data blocks set forth below in this application may be input neurons or weights in a neural network, which may be embodied in the form of a matrix or a vector.
- the data compression processing involved in the present application is specifically performed in the first mapping circuit and the second mapping circuit described above. It should be understood that since the neural network is an algorithm with high computational complexity and high memory access, the more weights, the larger the calculation amount and the memory access amount. In particular, in the case where the weight is small (for example, 0, or less than the weight of the set value), in order to increase the calculation rate and reduce the overhead, it is necessary to compress the data with smaller weights. In practical applications, data compression processing is applied in sparse neural networks, and the effect is most obvious, such as reducing the workload of data calculation, reducing data overhead, and increasing data calculation rate.
- the input data includes, but is not limited to, at least one input neuron and/or at least one weight.
- the first mapping circuit After the first mapping circuit receives the first input data (specifically, the data block to be calculated sent by the main processing circuit, such as a horizontal data block or a vertical data block, etc.), the first mapping circuit may be the first The input data is processed to obtain the processed first input data and the identifier mask data associated with the first input data, the mask data is used to indicate whether the absolute value of the first input data is greater than a first threshold, such as 0.5, 0. and many more.
- a first threshold such as 0.5, 0. and many more.
- the input matrix data block is The first threshold is 0.05, and the processed matrix data block can be obtained after being processed by the first mapping circuit.
- the identification data block also referred to as the mask matrix
- the target data in the processed matrix data block may be transmitted (in this example, 1, 0.06) And 0.5) and the identification data block associated with the matrix data block.
- the main processing circuit may distribute the target data in the processed matrix data block to the basic processing circuit according to a setting rule, for example, sequentially transmitting in a row order or sequentially in a column order, etc., the present application Not limited.
- the basic processing circuit restores the target data block to the processed matrix data block according to a setting rule (for example, a row order).
- the underlying processing circuitry can be based on received data (1, 0.06, and 0.5) and the identified data block. It can be known that the matrix data block corresponding to the data (that is, the matrix data block processed by the first mapping circuit in the main processing circuit) is
- the first input data may be a horizontal data block and/or a vertical data block.
- the second mapping circuit can process the second input data by using the identification data associated with the first input data, thereby obtaining the processed second input data; wherein the first input data is different from the second input data.
- the second input data may be at least one input neuron; or, when the first input data is at least one input neuron, then The second input data can be at least one weight.
- the second input data is different from the first input data, and the second input data may be any one of the following: a horizontal data block, a basic data block, a vertical data block, and a partial vertical To the data block.
- the second input data is a partial vertical data block.
- the second input data is a matrix data block
- the obtained partial vertical data block is Since the dimension of the matrix data block involved in the input data is large in practical applications, the present application is merely illustrative and is not intended to be limiting.
- the first mapping circuit is configured to process the first input data and the second input data to obtain the processed first input data and the first identifier mask data associated with the first input data, and the processed second Input data and second identification mask data associated with the second input data.
- the first mask data or the second mask data is used to indicate whether the absolute value of the first or second input data is greater than a second threshold, and the second threshold is customized by the user side or the device side, for example, 0.05. 0 and so on.
- the processed first input data or second input data may be processed input data or may be input data before processing.
- the first input data is a horizontal data block, such as the matrix data block in the above example.
- the processed horizontal data block can be obtained, and the processed horizontal data block can be the original matrix data block.
- the processed input data (such as processed basic data block or partial vertical data block, etc.) should be compression processing.
- the data sent by the main processing circuit to the basic processing circuit may be the target data in the processed input data, and the target data may be data with an absolute value greater than a preset threshold, or may be non-zero. Data and more.
- the second mapping circuit may obtain connection identification data according to the first identification data associated with the first input data and the second identification data associated with the second input data; the connection identification data is used for And indicating that the absolute value of the first input data and the second input data are greater than a third threshold, wherein the third threshold is customized by the user side or the device side, such as 0.05, 0, and the like. Further, the second mapping circuit may process the received first input data and the second input data respectively according to the connection identification data, thereby obtaining the processed first input data and the processed second input data.
- the first input data is a matrix data block
- the second input data block is also a matrix data block Obtaining, by the first mapping circuit, the first identification data block associated with the first input data And the processed first input data block Correspondingly obtaining the second identification data block associated with the second input data
- the processed second input data block is Correspondingly, in order to increase the data transmission rate, only the target data 1, 0.06 and 0.5 in the processed first input data block and the first identification data block associated with the first input data block may be sent to the main processing circuit.
- the basic processing circuit at the same time, the target data 1, 1.1, 0.6, 0.3 and 0.5 in the processed second input data block and the second identification data block associated with the second input data block are sent to the basic processing circuit.
- the basic processing circuit may perform the element-by-element operation on the first identification data block and the second identification data block by using the second mapping circuit to obtain the connection identification data block.
- the second mapping circuit separately processes the processed first input data block and the processed second input data block by using the connection identification data block, respectively, so that the processed first input data block is obtained as The processed second input data block is
- the first processing block is configured to determine, according to the first identification data block and the target data in the received first data block, the first data block corresponding to the target data (ie, the first processed by the first mapping circuit)
- the first mapping circuit is not disposed in the main processing circuit, but the main processing circuit may send the third input data and the pre-stored third identification data associated with the third input data to the basic processing circuit connected thereto in.
- a second mapping circuit is provided in the basic processing circuit. A specific embodiment of the data compression process involved in the second mapping circuit is explained below.
- the third input data includes, but is not limited to, a basic data block, a partial vertical data block, a vertical data block, and the like.
- the third input data may also be at least one weight, and/or at least one input nerve, which is not limited herein.
- the second mapping circuit may process the third input data according to the third identification data associated with the received third input data, thereby obtaining the processed third input data, so as to be subsequently
- the processed third input data performs related arithmetic operations, such as inner product operations.
- the third input data received by the second mapping circuit is a matrix data block.
- the third mapping circuit processes the third input data block according to the third identification data block, and the processed third input data block is specifically
- the input neurons and output neurons mentioned in the embodiments of the present invention do not refer to neurons in the input layer of the entire neural network and neurons in the output layer, but to any adjacent two layers in the neural network.
- the neurons in the lower layer of the network feedforward operation are the input neurons
- the neurons in the upper layer of the network feedforward operation are the output neurons.
- the layer, the neurons in the layer are the above input neurons, the K+1 layer is called the output layer, and the neurons in the layer are the above-mentioned output neurons, that is, each layer can be used as an input except the top layer.
- Layer, the next layer is the corresponding output layer.
- a mapping circuit is not provided in the main processing circuit, and a first mapping circuit and a second mapping circuit are disposed in the basic processing circuit.
- the mapping circuit is not disposed in the basic processing circuit, and the first mapping circuit and the second mapping circuit are both disposed in the main processing circuit, and the first mapping circuit and the second mapping circuit are
- the main processing circuit completes the compression processing of the data, and sends the processed input data to the basic processing circuit, so that the basic processing circuit utilizes the processed input data (specifically, the processed neurons and the processed weights) Perform the corresponding arithmetic operations.
- mapping circuit involved in the present application.
- Two possible mapping circuits are shown in Figures 5a and 5b.
- the mapping circuit shown in FIG. 5a includes a comparator and a selector.
- the application is not limited.
- a comparator and two selectors are shown in Fig. 5a, wherein the comparator is used to determine whether the input data satisfies a preset condition.
- the preset condition may be customized for the user side or the device side. For example, the absolute value of the input data described above in the application is greater than or equal to a preset threshold.
- the comparator may determine to allow output of the input data, the input data corresponding to the associated identification data being 1; otherwise, it may be determined that the input data is not output, or the input data is 0 by default.
- the identification data corresponding to the input data at this time is 0. That is, after the comparator is passed, the identification data associated with the input data can be known.
- the obtained identification data may be input into the selector, so that the selector uses the identification data to determine whether to output the corresponding input data, that is, obtain the processing. After the input data.
- a comparator can determine a predetermined condition for each data in the matrix data block, so that an identifier data block associated with the matrix data block can be obtained ( Mask matrix). Further, the matrix data block may be filtered by using the identifier data block in the first selector, and the data in the matrix data block whose absolute value is greater than or equal to a preset threshold (ie, a preset condition is satisfied) is reserved. The remaining data is deleted to output the processed matrix data block.
- a preset threshold ie, a preset condition is satisfied
- the identifier data block may also be used in the second selector to process other input data (for example, the second matrix data block), for example, performing an element-by-element AND operation to obtain an absolute value in the second matrix data block.
- Data greater than or equal to the preset threshold is reserved to output the processed second matrix data block.
- the specific structure of the first mapping circuit may include at least one comparator and at least one selector, such as the comparator and the first in FIG. 5a in the above example.
- a selector such as the comparator and the first in FIG. 5a in the above example.
- the specific result of the second mapping circuit may include one or more selectors, such as the second selector of Figure 5a in the above example.
- the mapping circuit includes a selector, and the number of the selectors is not limited, and may be one or multiple.
- the selector is configured to select the input data according to the input data associated with the input data to output the data whose absolute value is greater than or equal to a preset threshold. The data is deleted/not output, thereby obtaining processed input data.
- the matrix data block and the identification data block associated with the matrix data block are input to the mapping circuit, and the selector may select the matrix data block according to the identification data block.
- the data whose absolute value is greater than or equal to 0 is output, and the remaining data is not output, thereby outputting the processed matrix data block.
- the structure shown in FIG. 5b can be applied to the second mapping circuit in the third embodiment described above, that is, the specific result of the second mapping circuit in the third embodiment described above may include at least one selector.
- the first mapping circuit and the second mapping circuit designed in the main processing circuit and the basic processing circuit may be cross-combined or split according to the functional components shown in FIG. 5a and FIG. 5b, which is not limited in this application.
- the main processing circuit first enables the first mapping circuit to process the first input data to obtain the processed first input data and the first identification data associated with the first input data; and then the processed first input data and The first identification data associated with the first input data is transmitted to a base processing circuit for operation.
- the main processing circuit can process the data to be calculated (such as a horizontal block/vertical block) and then transmit the data to the basic processing circuit, which has the advantages of reducing the bit width of the transmitted data and reducing the total number of bits transmitted.
- the basic processing circuit performs data operations with smaller bit widths and is also more efficient and consumes less power.
- the basic processing circuit enables the second mapping circuit to process the received second input data by using the first identification data, obtain the processed second input data, and then perform correlation operations on the processed first input data and the second input data. operating.
- the basic processing circuit receives the second input data (such as sparse data and vertical data blocks) transmitted by the main processing circuit, and performs compression processing on the first operation to improve the operation efficiency and reduce the power consumption.
- the main processing circuit may first input the first input data (such as a basic data block), the first identification data associated with the first input data, the second input data (such as a partial vertical data block, etc.), and the second input data.
- the associated second identification data is first transmitted to the basic processing circuit for operation.
- the basic processing circuit may first enable the second mapping circuit to obtain the connection identification data block according to the first identification data and the second identification data, and then use the connection identification data to use the first input data and the second input data.
- the processing is further performed, and the operation operation for the processed first input data and the second input data can be further completed in the basic processing circuit, and the benefits thereof can reduce the amount of data calculation, improve the operation efficiency, and reduce the power consumption.
- the first identification data associated with the first input data sent by the main processing circuit and the second identification data associated with the second input data are pre-stored in the main processing circuit, or are enabled for the main processing circuit.
- a mapping circuit is obtained by using the first/second input data, which is not limited in this application.
- the main processing circuit receives input data to be calculated from outside the device
- the main processing circuit performs arithmetic processing on the data by using various operation circuits, a vector operation circuit, an inner product operator circuit, an accumulator circuit, and the like of the unit;
- the main processing circuit sends data to the basic processing circuit array (referred to as a basic processing circuit array) through the data output interface (as shown in FIG. 2b);
- the method of sending data may be to directly send data to a part of the basic processing circuit, that is, multiple broadcast modes;
- the method of transmitting data here may separately send different data to different basic processing circuits, that is, a distribution method
- the basic processing circuit array calculates the data
- the basic processing circuit performs an operation after receiving the input data
- the basic processing circuit transmits the data from the data output interface of the unit after receiving the data; (transferred to other basic processing circuits that do not directly receive data from the main processing circuit.)
- the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)
- the main processing circuit receives the output data returned from the basic processing circuit array
- the main processing circuit continues to process (eg, accumulate or activate the operation) the data received from the basic processing circuit array;
- the main processing circuit is processed, and the processing result is transmitted from the data output interface to the outside of the device.
- the tensor multiplying tensor operation is performed using the circuit device, the tensor being the same as the data block described above, which may be any of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
- the data block described above may be any of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
- a combination of terms or multiples; the specific implementation of matrix multiplication vectors and matrix multiplication matrix operations is shown in Figures 2c and 2f, respectively.
- the matrix multiplication vector operation is performed using the circuit device; (the matrix multiplication vector may be an inner product operation of each row in the matrix with the vector, and the results are arranged into a vector in the order of the corresponding rows.)
- This method uses all or a portion of the basic processing circuit of the neural network computing device, assuming that K basic processing circuits are used;
- the main processing circuit transmits data in part or all of the rows of the matrix S to each of the k basic processing circuits;
- the control circuit of the main processing circuit sends data of a certain number of rows in the matrix S to a certain basic processing circuit each time (for example, for a certain basic processing circuit, The first time, the first number of the third, fourth, and fifth lines is transmitted, the second time is the second, the third, fourth, and fifth lines are the second, and the third time is the third, fourth, and fifth lines.
- the third number of lines ..., or the first two digits of the first, third, fourth, and fifth lines, the second, the third, fourth, and fifth lines, the third and fourth digits of each line, Send the 5th, 4th, and 5th lines, the 5th and 6th lines of each line.
- the control circuit of the main processing circuit sequentially transmits the data in the vector P to the 0th basic processing circuit
- the 0th basic processing circuit After receiving the data of the vector P, the 0th basic processing circuit sends the data to the next basic processing circuit connected thereto, that is, the basic processing circuit 1;
- some basic processing circuits cannot obtain all the data required for calculation directly from the main processing circuit.
- the basic processing circuit 1 in FIG. 2d has only one data input interface connected to the main processing circuit, so it can only directly
- the main processing circuit obtains the data of the matrix S, and the data of the vector P needs to be output to the basic processing circuit 1 by the basic processing circuit 0.
- the basic processing circuit 1 continues to output the data of the vector P to the basis. Processing circuit 2.
- Each of the basic processing circuits performs operations on the received data, including but not limited to: inner product operations, multiplication operations, addition operations, and the like;
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result is transmitted from the data output interface (ie, transmitted to other basic processing circuits connected thereto);
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing may be an accumulation operation or an activation operation, etc.).
- the plurality of basic processing circuits used in the method are arranged in the manner as shown in FIG. 2d or FIG. 2e as follows;
- the main processing circuit can respectively obtain a mask matrix corresponding to each of the matrix S and the matrix P (ie, the identification data/identification data block described above).
- the mask matrix corresponding to the matrix S and the matrix P may be pre-stored in the high-speed memory in the main processing circuit; or the main processing circuit enables the first mapping circuit to respectively obtain the corresponding corresponding according to the matrix S and the matrix P.
- Mask matrix may be pre-stored in the high-speed memory in the main processing circuit.
- the control circuit of the main processing unit divides the M rows of the matrix S into K groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is denoted as Ai); correspondingly, the main processing unit
- the control circuit also divides the M-line data of the first mask matrix corresponding to the matrix S into K groups, and sends them to the corresponding basic processing circuit together with the newly formed matrix after the matrix S is divided into K groups, on the basis The processing operation of the related data is completed in the processing circuit.
- the method of grouping M rows of data is any grouping method that does not repeatedly allocate;
- the following allocation mode is adopted: the jth row is allocated to the j%K (% is a remainder operation) basic processing circuit;
- a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit sequentially sends the data in part or all of the rows in the matrix S to the corresponding basic processing circuit; correspondingly, the control circuit also corresponds the data in the matrix S to the first mask matrix.
- the identification data in the data is sent together to the corresponding basic processing circuit.
- the matrix S is a matrix data block of 50*50
- the main processing circuit can divide the matrix S into 10 small matrices, each of which has a size of 5*50, and the main processing circuit can take the first small matrix.
- S0 (5 rows and 50 columns) and the identification data block (5 rows and 50 columns) associated with the small matrix S0 are sent together to the first basic processing circuit to complete the arithmetic processing of the related data in the first basic processing circuit.
- control circuit of the main processing circuit sends one or more data of one of the data of the i-th group of data Mi that it is responsible for, to the i-th basic processing circuit, the i-th data Mi It may be data in the matrix S, or may be data in the first mask matrix corresponding to the matrix S;
- control circuit of the main processing circuit transmits one or more data of each of some or all of the i-th group of data Mi to which it is responsible to the i-th basic processing circuit;
- the control circuit of the main processing circuit sequentially transmits the data in the vector P to the first basic processing circuit; correspondingly, the control circuit of the main processing circuit can sequentially send the data in the second mask matrix associated with the vector P to the first 1 basic processing circuit
- control circuit of the main processing circuit can send one or more data in the second mask matrix associated with the vector P or the vector P each time;
- the i-th basic processing circuit may also send the data to the i+1th basic processing circuit connected thereto;
- Each basic processing circuit receives one or more data from a certain row or rows of the matrix S and one or more data from the vector P, and performs operations (including but not limited to multiplication or addition);
- each of the basic processing circuits receives the data in the matrix S and the first identification data associated with the data in the first mask matrix, the data in the vector P, and the second associated with the data in the second mask data. After the data is identified, the connection identification data may be obtained according to the first identification data and the second identification data; and then the connection identification data is used to determine whether to perform a correlation operation on the data in the matrix P and the data in the vector P.
- connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the vector P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of the same position in the matrix S and/or the data of the same position in the vector P are data whose absolute value is less than or equal to the preset threshold.
- each of the basic processing circuits starts the second mapping circuit to perform correlation operations on the data in the matrix S and the vector P according to the first mask matrix of the matrix S and the second mask matrix of the vector P.
- Operations such as multiplication, addition operations, and the like.
- the first mask matrix and the second mask matrix are used to select data in the matrix S and the matrix P whose absolute value is greater than a preset threshold, and perform a correlation operation, such as a multiplication operation.
- the basic processing circuit receives the data of two rows in the matrix S as a matrix. Corresponding first mask matrix associated with the matrix S 0 Some data received in vector P is vector P 0 [1 0.01 1.1 0.6] T , and the second mask vector [1 0 1 1] T associated with the vector P 0 ; further basic processing circuit can enable second mapping Circuit first Perform element-by-element operations with [1 0 1 1] T to obtain the connection mask matrix Further processing the received matrix S 0 and the vector P 0 by using the connection mask matrix to obtain the processed matrix And the processed vector P 0 [1 0 0 0.6] T , so that the basic processing circuit performs an associated arithmetic operation on the processed matrix S 0 and the processed vector P 0 .
- each basic processing circuit specifically, the data block to be calculated, such as data of a certain row/column in the matrix S or the vector P and the identifier corresponding to the mask matrix
- the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the vector data of the vector P and the data corresponding to the mask.
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the data received by the basic processing circuit may also be an intermediate result, stored in a register or an on-chip buffer;
- the basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected thereto;
- only the output interface of the last basic processing circuit of each column is connected to the main processing circuit.
- only the last basic processing circuit can directly be localized.
- the calculation result is transmitted to the main processing circuit, and the calculation results of other basic processing circuits are transmitted to the next basic processing circuit, and the next basic processing circuit is transferred to the next basic processing circuit until all is transmitted to the last basic processing circuit.
- the last basic processing circuit performs an accumulated calculation on the local calculation result and the results of other basic processing circuits received in the column to obtain an intermediate result, and sends the intermediate result to the main processing circuit; of course, it may be: the last basic processing circuit
- the results of the other basic circuits in this column and the local processing results are sent directly to the main processing circuit.
- each of the basic processing circuits has an output interface connected to the main processing circuit. In this case, each of the basic processing circuits directly transmits the local calculation result. Give the main processing circuit;
- the basic processing circuit After receiving the calculation result transmitted by other basic processing circuits, the basic processing circuit transmits to the next basic processing circuit or main processing circuit connected thereto.
- the main processing circuit receives the result of the M inner product operations as the operation result of the matrix multiplication vector.
- the first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix. a second mask matrix corresponding to P;
- the control circuitry of the main processing circuit transmits data in part or all of the rows of the matrix S to the underlying processing circuitry directly connected to the main processing circuitry via the lateral data input interface (eg, the top gray filled vertical data path in Figure 1b) At the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the first mask matrix to the base processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.
- control circuit of the main processing circuit sends a certain number or part of the data of a row in the matrix S to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time) Send the first number in the third line, the second number in the third line of data, the third number in the third line in the third time... or the first two lines in the third line.
- the third time sends the 3rd and 4th numbers in the 3rd line
- the third time sends the 5th and 6th numbers in the 3rd line...;
- control circuit also transmits one or a portion of the identification data to the underlying processing circuit each time the identification data corresponding to the row in the matrix S in the first mask matrix.
- control circuit of the main processing circuit sends the data of a certain row in the matrix S and the identification data corresponding to the corresponding rows in the first mask matrix to each of the numbers or a part of the data to a certain base.
- Processing circuit for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth lines is transmitted for the first time, and the second number of the third, fourth, and fifth lines for the second time is transmitted for the second time.
- the third time, the third number of the third, fourth, and fifth lines is sent..., or the first two lines of the third, fourth, and fifth lines are sent, and the second time is the third and fourth. 5 lines per line 3rd and 4th, the third time 3, 4, 5 lines 5th and 6th per line...;)
- the control circuitry of the main processing circuit sends data in some or all of the columns in the matrix P to those basic processing circuits that are directly connected to the main processing circuitry through the vertical data input interface (eg, to the left of the basic processing circuitry array in Figure 1b) The gray filled horizontal data path); at the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the second mask matrix to the base processing circuit connected thereto. For example, the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.
- control circuit of the main processing circuit sends a certain number or part of the data of a column in the matrix P to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time)
- the first number in the third column is transmitted
- the second number in the third column data is transmitted in the second time
- the third number in the third column is transmitted in the third time..., or the first two columns in the third column are transmitted.
- the control circuit will also correspond to the row in the matrix P
- the identification data of the corresponding row in the second mask matrix transmits one or a portion of the identification data to a certain basic processing circuit each time.
- control circuit of the main processing circuit sends the data of a certain column in the matrix P and the identification data corresponding to the corresponding rows in the second mask matrix to each certain number or part of each time to a certain base.
- Processing circuit for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth columns is transmitted for the first time, and the second number of the third, fourth, and fifth columns for the second time is transmitted for the second time.
- the third time, the third number of the third, fourth, and fifth columns is sent..., or the first two numbers of the third, fourth, and fifth columns are sent for the first time, and the third and fourth times are sent for the second time. 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th columns in each column...;)
- the basic processing circuit After receiving the data of the matrix S and the identification data of the first mask matrix associated with the matrix S, the basic processing circuit passes the data (specifically, the data of the matrix S and the identification data corresponding to the data in the first mask matrix) through the data
- the horizontal data output interface is transmitted to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); after the basic processing circuit receives the data of the matrix P, the data is Through its vertical data output interface, it is transmitted to the next basic processing circuit (for example, the white filled vertical data path in the middle of the basic processing circuit array in FIG. 1b);
- Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P.
- connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold.
- each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.
- each basic processing circuit specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix
- the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask.
- the identification data in the matrix, etc. until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the method uses a basic processing circuit array arranged in the manner shown in Figure 1b;
- the first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix.
- the second mask matrix corresponding to P optionally, may also obtain the processed matrix S and the matrix P, assuming that the processed matrix S has h rows, and the processed matrix P has w columns.
- the control circuit of the main processing circuit divides the h row data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Hi); meanwhile, the control circuit will also The data corresponds to the identification data in some or all of the rows in the first mask matrix being sent to the underlying processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.
- the method of grouping the h-line data is any grouping method that does not repeatedly allocate;
- the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
- a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Wi); accordingly, the control circuit also
- the identification data corresponding to the column in the matrix P corresponding to the column in the second mask matrix is sent one or a part of the identification data to a certain basic processing circuit.
- the method of grouping the W column data here is any grouping method that does not repeatedly allocate;
- the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
- a part of the columns may be equally distributed first, and the remaining columns may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit transmits data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;
- control circuit of the main processing circuit transmits one or more of a row of data in the i-th data Hi that it is responsible for to the first basic processing circuit of the i-th row in the basic processing circuit array.
- Data can be sent to the first basic processing circuit by using the same method to simultaneously identify the identification data of the i-th data Hi corresponding to the mask matrix;
- control circuit of the main processing circuit transmits each of some or all of the i-th data Hi in its responsibility to the first basic processing circuit of the i-th row in the basic processing circuit array.
- One or more data of the row can be used to send the identification data corresponding to the i-th data Hi in the mask matrix to the first basic processing circuit;
- the control circuit of the main processing circuit sends the data in part or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array; at the same time, the control circuit will also correspond to the portion in the second mask matrix
- the identification data in all or all of the rows is sent to the underlying processing circuitry connected to it.
- the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.
- control circuit of the main processing circuit transmits one or more of a column of data of the i-th data Wi that it is responsible to to the first basic processing circuit of the i-th column of the basic processing circuit array.
- control circuit of the main processing circuit transmits each of some or all of the columns of the i-th data Ni that it is responsible for to the first basic processing circuit of the i-th column of the basic processing circuit array.
- the basic processing circuit After receiving the data of the matrix S, the basic processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b). After receiving the data of the matrix P, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 1b) Vertical data path);
- Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P.
- connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold.
- each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.
- each basic processing circuit specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix
- the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask.
- the identification data in the matrix, etc. until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, The following line of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, and the matrix is multiplied by the device.
- the weight matrix of the fully connected layer is used as the matrix S
- the input vector is used as the matrix P
- the weight of the fully connected layer The value matrix is used as the matrix P
- the input vector is used as the matrix S, and the operation is performed by multiplying the matrix of the device by the matrix;
- the convolution operation is performed using the circuit device:
- the convolution operation is described below.
- One block in the following figure represents a data
- the input data is represented by Figure 3a (N samples, each sample has C channels, and the feature map of each channel has a height H and a width W).
- the weight, that is, the convolution kernel is represented by Figure 3b (there are M convolution kernels, each convolution kernel has C channels, and the height and width are KH and KW, respectively).
- the rules for convolution operations are the same for N samples of input data. The following is a process of convolution on a sample. On one sample, each of the M convolution kernels must perform the same.
- each convolution kernel operation obtains a plane feature map, and M convolution kernels finally calculate M plane feature maps (for one sample, the convolution output is M feature maps), for a convolution kernel
- M convolution kernels For one sample, the convolution output is M feature maps
- FIG. 3c shows a convolution kernel performing inner product operations at the position of the lower right corner of a sample of input data.
- Fig. 3d shows that the position of the convolution slides one space to the left
- Fig. 3e shows that the position of the convolution slides up one space.
- the first mapping circuit of the main processing circuit can process the data in part or all of the convolution kernel of the weight to obtain the corresponding mask data and the processed weight data (that is, part or all of the convolution of the processed weights) Data in the kernel).
- the control circuit of the main processing circuit sends the data in part or all of the convolution kernel of the weight (the data can be the original weight data or the processed weight data) to be directly connected to the main processing circuit through the horizontal data input interface.
- Those basic processing circuits for example, the uppermost gray filled vertical data path in FIG. 1b); at the same time, the control circuit sends the mask data associated with the data to the basic processing circuit connected to the main processing circuit. ;
- control circuit of the main processing circuit sends a certain number or part of the data of a convolution kernel to a basic processing circuit each time (for example, for a basic processing circuit, The first time, the first number of the third line is transmitted, the second number of the third line of data is transmitted for the second time, the third number of the third line is transmitted for the third time, ..., or the third line is transmitted for the first time.
- the control circuit takes a weight in one of the volumes
- the mask data corresponding to the product core also uses the above-mentioned one or a part of data to be sent to the above basic processing circuit;
- control circuit of the main processing circuit sends the data of some convolution kernels of the weights to each of the basic processing circuits each time (for example, For a certain basic processing circuit, the first number of lines in the 3rd, 4th, and 5th lines is transmitted for the first time, and the second number of the 3rd, 4th, and 5th lines is transmitted for the second time, and the third time is transmitted.
- control circuit will be associated with some of the convolution kernels in the weight
- the mask data also uses the same method described above to send one or a portion of the data to the basic processing circuit each time;
- the control circuit of the main processing circuit divides the input data according to the position of the convolution, and the control circuit of the main processing circuit sends the data in part or all of the convolution position in the input data to the main processing circuit through the vertical data input interface.
- the control circuit also performs the mask data associated with the input data according to the position of the convolution Dividing, correspondingly, the control circuit also sends the mask data corresponding to the data in some or all of the convolution positions of the input data to the basic processing circuit electrically connected to the main processing circuit;
- control circuit of the main processing circuit sends a certain number or part of the data of a certain convolution position in the input data and the mask data associated with the data to a certain basic processing circuit; For example, for a certain basic processing circuit, the first number of the third column is transmitted for the first time, the second number of the third column of data is transmitted for the second time, and the third number of the third column is transmitted for the third time... , or send the first two numbers in the third column for the first time, the third and fourth numbers in the third column for the second time, and the fifth and sixth numbers in the third column for the third time...;
- control circuit of the main processing circuit sends data of a certain convolution position in the input data and mask data associated with the data each time by a number or a part of the data.
- a basic processing circuit for example, for a basic processing circuit, the first number of the third, fourth, and fifth columns is sent for the first time, and the second, third, fourth, and fifth columns are for the second 2 digits, the 3rd number of the 3rd, 4th, and 5th columns, the 3rd number of each column..., or the 1st transmission of the 3rd, 4th, and 5th columns, the first two digits, the second transmission of the third digit , 4, 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th, each column...;
- the basic processing circuit receives the data of the weight (specifically, the data of the convolution kernel in the weight (referred to as weight data) or the mask data associated with the weight data), and then outputs the data through the horizontal data thereof.
- the interface is transmitted to the next basic processing circuit (eg, a white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); the basic processing circuit receives the data (the data can be input to the main processing circuit) And the identification mask data associated with the input data, the data is transmitted through its vertical data output interface to the next underlying processing circuit connected thereto (eg, the white-filled vertical in the middle of the basic processing circuit array in FIG. 1b) Data path)
- control circuit of the main processing circuit may send the input data and the mask data associated with the input data to the basic processing circuit, and the basic processing circuit receives the input data and the mask data associated with the input data;
- Each of the basic processing circuits operates on the received data; specifically, the basic processing circuit can enable the second mapping circuit to associate the mask data associated with the input data with the mask data associated with the weight data (ie, the convolution kernel associated with the weights)
- the mask data is obtained by the connection identification data; and the connection identification data is used to select the input data and the data in which the absolute value of the weight data is greater than a preset threshold is multiplied;
- each basic processing circuit specifically, the data block to be calculated, such as the data in the convolution kernel in the weight and the mask data, the input data, or the data associated with the data
- the basic processing circuit will no longer receive new input data, such as data in some convolution kernels of the weights subsequently sent by the main processing circuit and The data corresponds to the associated mask data and the like until the base processing circuit has sufficient buffer/storage space, and then receives the newly transmitted data of the main processing circuit.
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on a register and/or an on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and/or the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the steps of neural network training include:
- Each layer in a (multi-layer) neural network performs a forward operation in sequence
- the entire training process requires repeated execution (ie multiple iterations) to process this process multiple times.
- the device is used for training of a neural network, the neural network includes n layers, and the n value ranges from an integer of 2 or more, characterized in that the integration
- the circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and The second mapping circuit is configured to perform compression processing of each data in the neural network operation;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
- the integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data.
- the n-th layer forward operation obtains the nth output result of the forward operation;
- the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
- the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
- the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
- An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
- the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
- the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
- the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
- each layer uses its own input data and weights to calculate corresponding output data according to an operation rule specified by a layer type;
- the forward operation process (also called inference) of the neural network is a process of processing the input data of each layer layer by layer, and obtaining the output data after a certain calculation, which has the following characteristics:
- the input to a layer can be the input data of the neural network
- the input of one layer can be the output of other layers;
- the input of a layer can be the output of the layer at a moment (corresponding to the case of a cyclic neural network);
- a layer can simultaneously acquire input from a plurality of the above input sources
- the output of a layer can be used as the output of the neural network
- the output of one layer can be the input of other layers
- the output of a layer can be the input of this layer at the next moment (in the case of a cyclic neural network);
- the output of a layer may output a result to the plurality of output directions described above;
- the types of operations of the layers in the neural network include, but are not limited to, the following:
- Normalized (regularized) layer includes LRN (Local Response Normalization) layer, BN (Batch Normalization) layer, and the like;
- Activation layer including but not limited to the following types of Sigmoid layer, ReLU layer, PReLu layer, LeakyReLu layer, Tanh layer;
- the inverse operation of the layer the inverse operation of each layer needs to perform a two-part operation: one part is to calculate the weight gradient using the output data gradient which may be sparse representation and the input data which may be sparse representation (for the weight
- the value update step updates the weight of this layer, and the other part uses the output data gradient, which may be sparse representation, and the weight of the sparse representation, to calculate the input data gradient (used as the next layer in the inverse operation).
- Output a data gradient for its inverse operation
- the inverse operation reverses the gradient from the last layer in the reverse order of the forward operation.
- the inverse of the output data gradient calculated by a layer can be derived from:
- Input data gradient at a moment on this layer (corresponding to the case of a cyclic neural network);
- a layer can simultaneously obtain an output data gradient from a plurality of the above sources
- the gradient of the weights of the layers is calculated.
- the first input buffer and the second input buffer of the device are respectively used to store the weights of the layer.
- the gradient of the weights and then use the weight gradient to update the weights in the arithmetic unit;
- the operations mentioned above are all operations of a layer in a neural network.
- the implementation process is that, in the forward operation, when the previous artificial neural network is executed, the next layer of operations is performed.
- the instruction will operate the output data calculated in the operation unit as the input data of the next layer (or perform some operations on the output data as the input data of the next layer), and at the same time, replace the weight with the next one.
- the weight of one layer; in the reverse operation when the reverse operation of the upper artificial neural network is completed, the next layer of operation instructions will use the input data gradient calculated in the operation unit as the output data of the next layer.
- the tensor multiplying tensor operation is performed as shown in FIG. 1a, and the tensor is the same as the data block described above, which may be a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
- the data block described above may be a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
- an operation of multiplying a matrix by a matrix when the forward operation indicated by the first operation instruction is a matrix multiplication matrix operation, the input data is a first matrix of the matrix multiplication matrix operation, The weight is the second matrix of the matrix multiplication matrix operation.
- Step S401b the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received data in an on-chip buffer and/or a register;
- the data of the matrix S is processed data.
- the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S.
- the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S.
- each row data in the processed matrix S and the row data corresponding to the identification data corresponding to the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit.
- the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
- the control circuit of the main processing circuit respectively distributes one row of the S matrix to the M basic processing circuits; optionally, the corresponding row is also sent by the row.
- the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
- the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;
- the Mi line in S is distributed to the i-th base processing circuit, and the set of Mi lines is called Ai, and the calculation to be performed on the i-th basic processing circuit is shown in Fig. 2e.
- each of the basic processing circuits such as the i-th basic processing circuit:
- the received matrix Ai distributed by the main processing circuit stores the matrix Ai in the i-th basic processing circuit register and/or the on-chip buffer; the advantage is that the amount of data transmission afterwards is reduced, the calculation efficiency is improved, and the power consumption is reduced.
- Step S402b the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast manner;
- the data (parts) of the matrix P can be processed data.
- the main processing circuit enables the first mapping circuit to process the matrix P, thereby obtaining the processed matrix P and a second matrix of the matrix associated with the matrix P.
- the first mapping circuit of the main processing circuit processes the matrix P according to the second mask matrix associated with the pre-stored matrix P to obtain the processed matrix P.
- the data in the processed matrix P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in.
- the main processing circuit sends data to the basic processing circuit, specifically, the data in the processed matrix P whose absolute value is greater than a preset threshold or non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
- each part of the matrix P can be broadcast only once to the registers of the respective basic processing circuits or the on-chip buffer, and the i-th basic processing circuit fully multiplexes the data of the matrix P obtained this time.
- the inner product operation corresponding to each row in the matrix Ai is completed; the multiplexing in this embodiment may be repeatedly used in the calculation by the basic processing circuit, for example, the multiplexing of the data of the matrix P, and may be the data of the matrix P. use many times.
- control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
- the data is not multiplexed, and the inner product operations corresponding to each row in the matrix Ai are completed in stages;
- control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
- the data is partially multiplexed to complete an inner product operation corresponding to each row in the matrix Ai;
- each of the basic processing circuits calculates an inner product of the data of the matrix Ai and the data of the matrix P;
- Step S403b the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.
- step S403b the inner product operator of the basic processing circuit needs to calculate the inner product of the data of the matrix S and the matrix P.
- the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. .
- the basic processing circuit enables the second mapping circuit to process the data of the received matrix P according to the identification data in the received first mask matrix to obtain the data of the processed matrix P.
- the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation.
- the basic processing circuit receives the data in the processed matrix P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. .
- the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S.
- the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed matrix P and the processed data in the matrix S to obtain a result of the inner product operation.
- the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. And the data corresponds to the identification data associated in the second mask matrix.
- the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix
- the data in S and the data in the matrix P are processed to obtain the data of the processed matrix S and the data of the processed matrix P.
- the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation.
- the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the matrix P, and the second identification matrix associated with the matrix P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix.
- the relationship identification matrix is used to process the matrix Ai and the matrix P simultaneously or separately to obtain the processed matrix Ai and the processed matrix P.
- the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed matrix P.
- the basic processing circuit may accumulate the portion obtained by performing the inner product operation each time and transmit back to the main processing circuit;
- the portion obtained by the inner product operation performed by each basic processing circuit may be stored in a register and/or an on-chip buffer of the basic processing circuit, and then accumulated and returned to the main processing circuit;
- FIG. 6b provides a method for implementing a matrix multiplication vector, which may specifically include:
- Step S401 the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received distribution data in an on-chip buffer of the basic processing circuit and/or In the register;
- the data of the matrix S is processed data.
- the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S.
- the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S.
- each row data in the processed matrix S and the row data corresponding to the corresponding associated identification data in the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit.
- the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
- the set of rows in the matrix S processed in the i-th basic processing circuit is Ai, and the Mi rows are shared; correspondingly, the identification matrix Bi corresponding to Ai is also distributed, and Bi is the first mask matrix. Part of a total of greater than or equal to the Mi line.
- the control circuit of the main processing circuit separately distributes one row of the S matrix to the K basic processing circuits; optionally, the corresponding row is also sent by the row.
- the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
- the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;
- the set of rows in S distributed to the i-th base processing circuit is Ai, sharing a total of Mi rows, as shown in Figure 2c, which is to be performed on the i-th base processing circuit.
- the received distribution data such as the matrix Ai
- the received distribution data may be stored in a register and/or an on-chip buffer of the i-th base processing circuit.
- Step S402 the control circuit of the main processing circuit transmits the parts in the vector P to the K basic processing circuits in a broadcast manner;
- the data (parts) of the vector P may be processed data.
- the main processing circuit enables the first mapping circuit to process the vector P to obtain a processed vector P and a second mask matrix associated with the vector P.
- the first mapping circuit of the main processing circuit processes the vector P according to the second mask matrix associated with the pre-stored vector P to obtain the processed vector P.
- the data in the processed vector P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in.
- the main processing circuit sends data to the basic processing circuit, the data of the processed vector P having an absolute value greater than a preset threshold or non-zero data may be specifically sent to the basic processing circuit to reduce the data transmission amount.
- control circuit of the main processing circuit can broadcast each part of the vector P only once to the register or the on-chip buffer of each basic processing circuit, and the i-th basic processing circuit obtains the vector P of this time.
- the data is fully multiplexed to complete the inner product operation corresponding to each row in the matrix Ai.
- control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
- the data is not multiplexed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of the vector P of the single transmission inside the basic processing circuit is reduced, and the basic processing circuit buffer and the lower limit can be reduced. / or the capacity of the register to improve execution efficiency, reduce transmission power consumption, and reduce costs.
- control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
- the data is partially multiplexed to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that the data transmission amount from the main processing circuit to the basic processing circuit is reduced, and the data transmission amount inside the basic processing circuit is also reduced, and the execution is improved. Efficiency, reducing transmission power consumption.
- Step S403 the inner product operator circuit of the K basic processing circuits calculates an inner product of the data of the matrix S and the vector P, for example, the i-th basic processing circuit, and calculates an inner product of the data of the matrix Ai and the data of the vector P;
- the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. .
- the basic processing circuit enables the second mapping circuit to process the data of the received vector P according to the identification data in the received first mask matrix to obtain the data of the processed vector P.
- the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation.
- the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, and the vector P; at this time, the second mapping circuit can be enabled to process the vector P by using the Bi to obtain the processed vector P;
- the product operator circuit performs an inner product operation on the matrix Ai and the processed vector P.
- the basic processing circuit receives the data in the processed vector P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. .
- the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S.
- the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed vector P and the processed data in the matrix S to obtain a result of the inner product operation.
- the i-th basic processing circuit receives the matrix Ai, the processed vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to process the Ai by using the second identification matrix.
- the matrix Ai; the inner product operator circuit is further enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.
- the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. And the data corresponds to the identification data associated in the second mask matrix.
- the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix
- the data in S and the data in the vector P are processed to obtain the data of the processed matrix S and the data of the processed vector P.
- the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation.
- the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix.
- the relationship identification matrix is used to process the matrix Ai and the vector P simultaneously or separately to obtain the processed matrix Ai and the processed vector P.
- the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.
- Step S404 The accumulator circuit of the K basic processing circuits accumulates the result of the inner product operation to obtain an accumulated result, and transmits the accumulated result to the main processing circuit in a fixed point type.
- each part of the basic processing circuit may perform an inner product operation to obtain a part and (partial and part of the accumulated result, for example, the accumulated result is: F1*G1+F2*G2+F3*G3+F4 *G4+F5*G5, then the part and can be: the value of F1*G1+F2*G2+F3*G3) is transferred back to the main processing circuit for accumulation; the advantage is that the calculation amount inside the basic processing circuit is reduced, and the basis is improved. The computational efficiency of the processing circuit.
- the part obtained by the inner product operation performed by each basic processing circuit and the register and/or the on-chip buffer stored in the basic processing circuit may be transferred to the main processing circuit after the accumulation is completed;
- the data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.
- the main processing circuit performs accumulation, and is transferred back to the main processing circuit after the accumulation is completed; the advantage is that the data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the data transmission power consumption is reduced, and the basic processing is reduced.
- the amount of calculation inside the circuit improves the computational efficiency of the basic processing circuit.
- All or part of the data involved in the neural network training process may be processed data.
- the foregoing embodiment which is obtained by the first mapping circuit and/or the second mapping circuit, and is not described here.
- a data block ie, multiple input data blocks, an output data block), or a sub-block of a different portion divided in the same data block, may refer to a processed data block.
- Figure 4c shows the specific calculation of neural network training for single-layer operation.
- Figure 4c shows the forward operation of single-layer neural network.
- the inverse of the single layer neural network is shown by the dashed line in Figure 4c.
- the forward operation of the layer is performed according to the input data and the weight or the parameter, and the output data is obtained, and then the preset rule is calculated according to the output data (the preset rule can be set by the manufacturer according to its own needs, here)
- the specific operation step of the preset rule operation is not limited to obtain the output data gradient of the layer.
- the inverse operation of the layer neural network can be performed according to the input data, the weight or the parameter of the layer, and the output data gradient, and the gradient of the input data and the gradient of the weight or the parameter of the layer can be obtained, and the weight obtained by the calculation can be used.
- the gradient of the parameter corresponds to updating the weight or parameter of the layer, that is, the neural network training of the layer is completed.
- the data involved in the forward operation or the reverse operation may be the processed data.
- the technical solution provided by the embodiment of the present application may be an inverse operation according to the layer.
- the instruction determines whether to initiate a correlation mapping circuit (specifically, the first mapping circuit and/or the second mapping circuit) to process the input data and/or the weight, and then perform the layer operation using the processed input data and/or weights .
- a correlation mapping circuit specifically, the first mapping circuit and/or the second mapping circuit
- the layer operation using the processed input data and/or weights .
- the following is a structural diagram of neural network training for matrix multiplication and convolution, taking FIG. 7a and FIG. 7b as examples.
- the calculation mode of the layer shown in FIG. 7a is matrix multiplication
- the operation mode of the layer shown in FIG. 7b is a convolution operation, assuming that the input data and the weight of the layer are all matrices, and the input data is matrix I for convenience.
- the larger the dimension can be understood as the larger the number of columns of the matrix I and the matrix W and the sum of the number of rows, that is, the matrix I and the matrix W can be considered to occupy a larger space in the memory and/or the register and the calculation amount is larger.
- the amount of data calculation is large; in order to improve the data processing efficiency, the matrix I and the matrix W need to be processed, and then the matrix multiplication operation is performed.
- the matrix I is a sparse matrix of 1000*1000
- the matrix W is also a sparse matrix of 1000*1000.
- the sum of the number of columns and the number of rows is 2000, the number of which is large, and the corresponding calculation amount is larger, matrix multiplication
- the multiplication operation of the inner product of the matrix is 109 times.
- the data processing of the two sparse matrices can greatly reduce the dimension of the matrix I and the matrix W (ie, the amount of data), thereby greatly reducing the amount of data transmitted and Calculate the amount, which in turn reduces transmission overhead and computational overhead.
- FIG. 7c A schematic diagram of the specific structure of the multilayer neural network training is shown in Figures 7c and 7d.
- the direction of the dashed arrow shows a reverse operation.
- the output of the inverse operation is the output data gradient; when the output data gradient is the last layer of the iterative calculation of the multilayer neural network, the output data gradient is specifically the last layer of the iterative calculation
- the output data is subjected to a preset operation (the preset operation can be set by the manufacturer according to his own needs, and the specific operation steps of the preset operation are not limited herein); if the output data gradient is not an iterative for the multilayer neural network
- the last layer of the calculation for example, the output data gradient is the nth layer of the iterative calculation, then the output data gradient of the nth layer can be the input data gradient calculated by the n+1th layer inverse operation.
- FIG. 7d can be understood.
- FIG. 7d can be a schematic diagram of multi-layer convolutional neural network training (including forward operation and reverse operation), and other operations in the figure can be expressed as layers other than the convolution layer. Or the operation between the layers, not limited.
- the present disclosure also provides an integrated circuit chip device for performing training of a neural network, the neural network comprising a plurality of layers, the integrated circuit chip device comprising: a processing circuit and an external interface;
- the external interface is configured to receive a training instruction
- the processing circuit is configured to determine first layer input data and first layer weight data according to the training instruction, and perform n-th layer forward operation of the neural network by using the first layer input data and the first layer weight data to obtain the nth Output result
- the processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of the nth layer reverse operation and the nth reverse operation instruction according to the training instruction.
- the nth layer input data and the nth layer weight group data performing n-layer reversal of the neural network according to the nth reverse operation instruction, the nth output result gradient, the nth layer input data, and the nth layer weight group data The operation obtains n weight gradients of the n-layer operation;
- the processing circuit is further configured to update the n weights of the n-layer operation by applying the n weight gradients.
- the present disclosure also discloses a neural network computing device including one or more chips as shown in FIG. 8 for acquiring data to be processed and control information from other processing devices, performing specified neural network operations, and executing results. Passed to the peripheral device through the I/O interface. Peripherals such as cameras, monitors, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip as shown in FIG. 8 is included, the chips shown in FIG. 8 can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus to support a larger scale. The operation of the neural network. At this point, you can share the same control system, or you can have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.
- the neural network computing device has high compatibility and can be connected to various types of servers through a PCIE interface.
- a method of performing an offset operation using the circuit device
- the vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices
- the vector operator circuit of the main processing circuit can be used to add a vector to each row of a matrix, or to the function on each column.
- the matrix may be derived from the result of the matrix multiplication matrix operation performed by the apparatus;
- the vector may be from a result of the device performing a matrix multiplication vector operation
- the matrix may be derived from externally accepted data from the main processing circuitry of the device.
- the vector may be from externally accepted data from the main processing circuitry of the device.
- the method of activating the function operation is performed using the circuit device:
- the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a value and the output is also a value), and a value is output to the output vector. Corresponding position
- the activation function can be a piecewise linear function
- the activation function can be any function that inputs a number and outputs a number.
- the source of the input vector is (including but not limited to):
- the input data is from the device for performing a matrix multiplication vector
- the input data is from the device for performing a matrix multiplication matrix
- the main processing circuit of the device calculates a result
- the input data is derived from the calculations after the device main processing circuit implements the offset.
- GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
- auxiliary integers as parameters to describe the width and height of the matrix A and B;
- the main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;
- the conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix P;
- op can be a transposition operation of the matrix; using a vector operation function of the main processing circuit or a data rearrangement function (the circuit in which the main processing circuit has data rearrangement is mentioned), and the matrix transposition operation is implemented.
- the above OP can also be directly realized by a conversion circuit, for example, when the matrix transposition operation is performed, the OP operation is directly realized by the matrix transposition circuit;
- the op of a certain matrix may be empty, and the OP operation is not performed
- GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
- the main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;
- the conversion circuit of the main processing circuit performs a corresponding op operation on the input matrix S;
- the op can be a transposition operation of the matrix; the matrix transposition operation is implemented by the matrix transposition circuit of the main processing circuit;
- the op of a certain matrix may be empty, and the op operation is not performed
- the matrix-vector multiplication calculation between the matrix op(S) and the vector P is completed by the matrix multiplication vector calculation method;
- the vector updater circuit of the main processing circuit is used to implement the weight update function in the neural network training process.
- the weight update refers to a method of updating the weight using the gradient of the weight.
- the vector operator circuit of the main processing circuit is used to add and subtract the two vectors of the weight and the weight gradient to obtain an operation result, and the operation result is an update weight.
- the vector operator circuit using the main processing circuit multiplies or divides the weight and the weight gradient by a number to obtain an intermediate weight and an intermediate weight gradient value, and the vector operator circuit pairs the intermediate weight And the intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the update weight.
- a set of momentum can be calculated by using the gradient of the weights, and then the updated weights are obtained by adding and subtracting the momentum and the weights.
- the reverse operation of the fully connected layer can be divided into two parts. As shown in the following figure, the solid arrow indicates the forward calculation process of the fully connected layer, and the broken line indicates the reverse calculation process of the fully connected layer.
- the method of performing matrix multiplication operation using the device can complete the inverse operation of the fully connected layer
- the inverse operation of the convolutional layer can be divided into two parts. As shown in Fig. 9a, the solid arrow indicates the forward calculation process of the convolutional layer, as shown in Fig. 9b, which represents the reverse calculation process of the convolutional layer.
- the reverse operation of the convolutional layer can be accomplished using the apparatus shown in Figure 1b using the apparatus shown in Figure 1a.
- a forward operation or a reverse operation it is actually a plurality of operations of the neural network, including but not limited to: one of a matrix multiplication matrix, a matrix multiplication vector, a convolution operation, an activation operation, and the like.
- the manner of the above operations may be described in the present disclosure and will not be described herein.
- Embodiments of the present disclosure provide a neural network processor board that can be used in numerous general purpose or dedicated computing system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, smart homes, home appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics devices, personal computers (personal computers) , PC), small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and so on.
- FIG. 10a is a schematic structural diagram of a neural network processor card according to an embodiment of the present disclosure.
- the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.
- the disclosure is not limited to the specific structure of the neural network chip package structure 11.
- the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a first Two substrates 113.
- the specific form of the neural network chip 111 involved in the disclosure is not limited.
- the above neural network chip 111 includes, but is not limited to, a neural network chip integrated with a neural network processor, and the above silicon wafer may be made of silicon material, germanium material, quantum material or molecule. Made of materials, etc.
- the neural network wafer can be packaged according to actual conditions (for example, a more severe environment) and different application requirements, so that most of the neural network wafer is wrapped, and the pins on the neural network wafer are passed through the gold wire.
- the conductors are connected to the outside of the package structure for electrical connection to the outer layer.
- the specific structure of the neural network chip 111 is not limited in this disclosure. Alternatively, please refer to the device shown in FIG. 1a or 1b.
- the present disclosure is not limited to the types of the first substrate 13 and the second substrate 113, and may be a printed circuit board (PCB) or a printed wiring board (PWB), and may be other circuit boards. There are no restrictions on the materials used to make the PCB.
- PCB printed circuit board
- PWB printed wiring board
- the second substrate 113 of the present disclosure is used to carry the neural network chip 111, and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112.
- the neural network chip 111 is protected to further encapsulate the neural network chip package structure 11 and the first substrate 13.
- FCBGAP flip chip Flip Chip Ball Grid Array Package
- LQFP Low-profile Quad Flat Package
- HQFP Quad Flat Package with Heat Sink
- FCBGAP flip chip Flip Chip Ball Grid Array Package
- LQFP Low-profile Quad Flat Package
- HQFP Quad Flat Package with Heat Sink
- FBGA Fine-Pitch Ball Grid Package
- Flip Chip is suitable for the case where the area after packaging is required to be high or sensitive to the inductance of the wire and the transmission time of the signal.
- wire bonding can be used to reduce the cost and increase the flexibility of the package structure.
- Ball Grid Array which can provide more pins, and the average lead length of the pins is short, which has the function of transmitting signals at high speed.
- the package can be packaged in Pin Grid Array (PGA). Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA), etc.
- ZIF Zero Insertion Force
- SECC Single Edge Contact Connection
- LGA Land Grid Array
- the neural network chip 111 and the second substrate 113 are encapsulated by using a flip chip ball grid array (Flip Chip Ball Grid Array).
- a flip chip ball grid array flip chip Ball Grid Array
- FIG. 11a the neural network chip package structure includes a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a lead 26.
- the pad 22 is connected to the neural network chip 21, and the solder ball 23 is soldered between the pad 22 and the connection point 25 on the second substrate 24 to connect the neural network chip 21 and the second substrate 24, thereby realizing The package of the neural network chip 21.
- the pin 26 is used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), and can realize transmission of external data and internal data, facilitating the neural network chip 21 or the neural network chip 21.
- the corresponding neural network processor processes the data.
- the type and number of pins are not limited in this disclosure. Different pin types may be selected according to different packaging technologies, and are arranged according to certain rules.
- the neural network chip package structure further includes an insulating filler disposed in the gap between the pad 22, the solder ball 23 and the connection point 25 for preventing interference between the solder ball and the solder ball.
- the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductance interference and the like.
- the neural network chip package structure further includes a heat dissipation device for dissipating heat of the neural network chip 21 during operation.
- the heat sink may be a piece of metal with good thermal conductivity, a heat sink or a heat sink, for example, a fan.
- the neural network chip package structure 11 includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a pin 26, Insulating filler 27, thermal grease 28 and metal housing fins 29.
- the heat dissipation paste 28 and the metal case heat sink 29 are used to dissipate the heat of the neural network chip 21 during operation.
- the neural network chip package structure 11 further includes a reinforcing structure, is connected to the pad 22, and is embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22.
- the reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.
- the present disclosure is not limited to the specific form of the first electrical and non-electrical device 12, and the description of the second electrical and non-electrical device 112 may be referred to, that is, the neural network chip package structure 11 may be packaged by soldering, or may be connected.
- the manner of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging and unplugging facilitates subsequent replacement of the first substrate 13 or the neural network chip package structure 11.
- the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example, a Synchronous Dynamic Random Access Memory (SDRAM), and a Double Rate Rate SDRAM (Double Rate Rate SDRAM). DDR), etc., improve the processing power of the neural network processor by expanding the memory.
- SDRAM Synchronous Dynamic Random Access Memory
- DDR Double Rate Rate SDRAM
- the first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, and an Ethernet interface. Controller Area Network (CAN) interface, etc., used for data transmission between the package structure and external circuits, which can improve the operation speed and convenience of operation.
- PCI-E or PCIe Peripheral Component Interconnect-Express
- SFP Small Form-factor Pluggable
- Ethernet interface Ethernet interface.
- Controller Area Network (CAN) interface etc., used for data transmission between the package structure and external circuits, which can improve the operation speed and convenience of operation.
- CAN Controller Area Network
- the neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, and the neural network chip package structure 11 is packaged as a neural network processor board 10, through an interface on the board (The slot or ferrule) performs data interaction with an external circuit (for example, a computer motherboard), that is, directly implements the function of the neural network processor by using the neural network processor board 10, and protects the neural network chip 111.
- other modules can be added to the neural network processor board 10, which improves the application range and computational efficiency of the neural network processor.
- the present disclosure discloses an electronic device that includes the neural network processor board 10 or neural network chip package structure 11 described above.
- Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , wearables, vehicles, household appliances, and/or medical equipment.
- the vehicle includes an airplane, a ship, and/or a vehicle;
- the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
- the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Image Processing (AREA)
Abstract
La présente invention concerne un dispositif à microcircuit intégré et un produit associé. Le dispositif à microcircuit intégré comprend : un circuit de traitement principal et une pluralité de circuits de traitement de base ; le circuit de traitement principal ou au moins un circuit de traitement de base parmi la pluralité de circuits de traitement de base comprend : un circuit de mappage de compression, le circuit de mappage de compression étant utilisé pour exécuter un traitement de compression de chaque élément de données dans un fonctionnement de réseau neuronal.
Applications Claiming Priority (12)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810164844.8A CN110197275B (zh) | 2018-02-27 | 2018-02-27 | 集成电路芯片装置及相关产品 |
| CN201810164331.7A CN110197269B (zh) | 2018-02-27 | 2018-02-27 | 集成电路芯片装置及相关产品 |
| CN201810161819.4A CN110197263B (zh) | 2018-02-27 | 2018-02-27 | 集成电路芯片装置及相关产品 |
| CN201810164843.3A CN110197274B (zh) | 2018-02-27 | 2018-02-27 | 集成电路芯片装置及相关产品 |
| CN201810164331.7 | 2018-02-27 | ||
| CN201810161819.4 | 2018-02-27 | ||
| CN201810161886.6 | 2018-02-27 | ||
| CN201810161820.7 | 2018-02-27 | ||
| CN201810161820.7A CN110197264B (zh) | 2018-02-27 | 2018-02-27 | 神经网络处理器板卡及相关产品 |
| CN201810161886.6A CN110197265B (zh) | 2018-02-27 | 2018-02-27 | 集成电路芯片装置及相关产品 |
| CN201810164843.3 | 2018-02-27 | ||
| CN201810164844.8 | 2018-02-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019165946A1 true WO2019165946A1 (fr) | 2019-09-06 |
Family
ID=67805195
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/076088 Ceased WO2019165946A1 (fr) | 2018-02-27 | 2019-02-25 | Dispositif à microcircuit intégré, carte de circuit imprimé et produit associé |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019165946A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021070006A1 (fr) * | 2019-10-11 | 2021-04-15 | International Business Machines Corporation | Parallélisme de modèle de données hybride pour un apprentissage profond efficace |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000039658A2 (fr) * | 1998-12-30 | 2000-07-06 | Irvine Sensors Corporation | Module neural de traitement a architectures d'entree optimisant l'utilisation d'un reseau pondere de synapses |
| WO2014085975A1 (fr) * | 2012-12-04 | 2014-06-12 | 中国科学院半导体研究所 | Système de traitement de réseau multi-données à instruction unique parallèle multi-étape, reconfigurable dynamiquement |
| CN107003989A (zh) * | 2014-12-19 | 2017-08-01 | 英特尔公司 | 用于人工神经网络中的分布式与协作计算的方法和装置 |
| CN107341544A (zh) * | 2017-06-30 | 2017-11-10 | 清华大学 | 一种基于可分割阵列的可重构加速器及其实现方法 |
| CN107578095A (zh) * | 2017-09-01 | 2018-01-12 | 中国科学院计算技术研究所 | 神经网络计算装置及包含该计算装置的处理器 |
-
2019
- 2019-02-25 WO PCT/CN2019/076088 patent/WO2019165946A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000039658A2 (fr) * | 1998-12-30 | 2000-07-06 | Irvine Sensors Corporation | Module neural de traitement a architectures d'entree optimisant l'utilisation d'un reseau pondere de synapses |
| WO2014085975A1 (fr) * | 2012-12-04 | 2014-06-12 | 中国科学院半导体研究所 | Système de traitement de réseau multi-données à instruction unique parallèle multi-étape, reconfigurable dynamiquement |
| CN107003989A (zh) * | 2014-12-19 | 2017-08-01 | 英特尔公司 | 用于人工神经网络中的分布式与协作计算的方法和装置 |
| CN107341544A (zh) * | 2017-06-30 | 2017-11-10 | 清华大学 | 一种基于可分割阵列的可重构加速器及其实现方法 |
| CN107578095A (zh) * | 2017-09-01 | 2018-01-12 | 中国科学院计算技术研究所 | 神经网络计算装置及包含该计算装置的处理器 |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021070006A1 (fr) * | 2019-10-11 | 2021-04-15 | International Business Machines Corporation | Parallélisme de modèle de données hybride pour un apprentissage profond efficace |
| CN114424214A (zh) * | 2019-10-11 | 2022-04-29 | 国际商业机器公司 | 用于高效深度学习的混合数据-模型并行性 |
| GB2604060A (en) * | 2019-10-11 | 2022-08-24 | Ibm | Hybrid data-model parallelism for efficient deep learning |
| JP2022552803A (ja) * | 2019-10-11 | 2022-12-20 | インターナショナル・ビジネス・マシーンズ・コーポレーション | ハイブリッド・データ-モデル並列処理方法、システム、プログラム |
| US11556450B2 (en) | 2019-10-11 | 2023-01-17 | International Business Machines Corporation | Hybrid data-model parallelism for efficient deep learning |
| JP7497946B2 (ja) | 2019-10-11 | 2024-06-11 | インターナショナル・ビジネス・マシーンズ・コーポレーション | ハイブリッド・データ-モデル並列処理方法、システム、プログラム |
| GB2604060B (en) * | 2019-10-11 | 2024-10-16 | Ibm | Hybrid data-model parallelism for efficient deep learning |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11748605B2 (en) | Integrated circuit chip device | |
| US12217162B2 (en) | Integrated circuit chip apparatus | |
| TWI793225B (zh) | 神經網絡訓練方法及相關產品 | |
| CN111105033B (zh) | 神经网络处理器板卡及相关产品 | |
| CN111242294B (zh) | 集成电路芯片装置及相关产品 | |
| TWI767097B (zh) | 集成電路芯片裝置及相關產品 | |
| CN110197264B (zh) | 神经网络处理器板卡及相关产品 | |
| TWI793224B (zh) | 集成電路芯片裝置及相關產品 | |
| WO2019165946A1 (fr) | Dispositif à microcircuit intégré, carte de circuit imprimé et produit associé | |
| CN110197267B (zh) | 神经网络处理器板卡及相关产品 | |
| CN109978152B (zh) | 集成电路芯片装置及相关产品 | |
| TWI768160B (zh) | 集成電路芯片裝置及相關產品 | |
| CN109977071A (zh) | 神经网络处理器板卡及相关产品 | |
| WO2019165940A1 (fr) | Appareil à microcircuit intégré, carte et produit associé | |
| CN109961137B (zh) | 集成电路芯片装置及相关产品 | |
| CN109978130A (zh) | 集成电路芯片装置及相关产品 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19760275 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19760275 Country of ref document: EP Kind code of ref document: A1 |