[go: up one dir, main page]

WO2019165946A1 - Integrated circuit chip device, board card and related product - Google Patents

Integrated circuit chip device, board card and related product Download PDF

Info

Publication number
WO2019165946A1
WO2019165946A1 PCT/CN2019/076088 CN2019076088W WO2019165946A1 WO 2019165946 A1 WO2019165946 A1 WO 2019165946A1 CN 2019076088 W CN2019076088 W CN 2019076088W WO 2019165946 A1 WO2019165946 A1 WO 2019165946A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
processing circuit
data
basic
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/076088
Other languages
French (fr)
Chinese (zh)
Inventor
刘少礼
宋新开
王秉睿
张尧
胡帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201810164844.8A external-priority patent/CN110197275B/en
Priority claimed from CN201810164331.7A external-priority patent/CN110197269B/en
Priority claimed from CN201810161819.4A external-priority patent/CN110197263B/en
Priority claimed from CN201810164843.3A external-priority patent/CN110197274B/en
Priority claimed from CN201810161820.7A external-priority patent/CN110197264B/en
Priority claimed from CN201810161886.6A external-priority patent/CN110197265B/en
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Publication of WO2019165946A1 publication Critical patent/WO2019165946A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device, a board, and related products.
  • ANN Artificial Neural Network
  • a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other.
  • the calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) to implement the operation of the neural network. Such calculations have a large amount of calculation and high power consumption.
  • Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed of the computing device and improve efficiency.
  • an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;
  • the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
  • an integrated circuit chip device in a second aspect, includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction, and the convolution kernel data block Dividing into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data block including the horizontal data block And/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the convolution instruction;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;
  • the main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.
  • an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, divide the input data block into horizontal data blocks according to the multiplication instruction, and divide the weight data block into vertical data blocks. Blocking; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block to obtain the processed first data block; the first data block includes the horizontal data block and/or the a vertical data block; transmitting, according to the multiplication instruction, the processed first data block to at least one of the basic processing circuits connected to the main processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, and perform the operation in the neural network in parallel according to the processed second data block. Obtaining an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is sent by the main processing circuit determined by the basic processing circuit Data block, the second data block is associated with the processed first data block;
  • the main processing circuit is configured to process the operation result to obtain an instruction result of the multiplication instruction.
  • a fourth aspect provides an integrated circuit chip device for performing a neural network forward operation, the neural network comprising n layers; the integrated circuit chip device comprising: a main processing circuit and a plurality of foundations Processing circuit; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second mapping circuit are both used to execute a neural network Compression processing of each data in the operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the main processing circuit is configured to receive a forward operation instruction, and parse the forward operation instruction to obtain a first operation instruction included in an ith layer of the forward operation instruction in the neural network forward operation, and the first processing instruction An input data block and a weight data block required for an operation instruction, wherein the value range of i is an integer greater than or equal to 1, and less than or equal to n, as the i is greater than or equal to 2, and the input data block is the ith -1 layer of output data blocks;
  • the main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction
  • the operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block;
  • the forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block.
  • the operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;
  • the main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.
  • an integrated circuit chip device for performing training of a neural network, the neural network includes n layers, and the value of n ranges from an integer of 2 or more, and the integrated circuit chip device includes a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second The mapping circuits are all used to perform compression processing of each data in the neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data.
  • the n-th layer forward operation obtains the nth output result of the forward operation;
  • the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
  • the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
  • An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
  • the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
  • the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
  • a neural network processor board includes: a neural network chip package structure, a first electrical and non-electrical connection device, and a first substrate; the neural network chip package structure The method includes: a neural network chip, a second electrical and non-electrical connection device, and a second substrate, the second substrate carries the neural network chip, and the second substrate passes the second electrical and non-electrical connection device Neural network chip connection;
  • the neural network chip includes the integrated circuit chip device provided by any of the above aspects to the fifth aspect.
  • a neural network computing device comprising the integrated circuit chip device provided by any one of the first to fifth aspects.
  • a combined processing apparatus includes: a neural network computing apparatus provided by the seventh aspect, a universal interconnection interface, and a general processing apparatus;
  • the neural network computing device is coupled to the general purpose processing device via the universal interconnect interface.
  • a chip is provided, the chip integrating the apparatus provided in any one of the first to eighth aspects above.
  • an electronic device comprising the chip of the ninth aspect.
  • a method for computing a neural network is provided, the method being applied to an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit of any one of the first aspect to the fifth aspect A chip device for performing an operation of a neural network.
  • the compression mapping circuit is provided to compress the data block and then perform operations, thereby saving transmission resources and computing resources, so that it has the advantages of low power consumption and small calculation amount.
  • 1a is a schematic structural view of an integrated circuit chip device.
  • FIG. 1b is a schematic structural view of another integrated circuit chip device.
  • Figure 1c is a schematic structural view of a basic processing circuit.
  • Figure 1d is a schematic diagram of the structure of a main processing circuit.
  • Figure 2a is a schematic diagram of the use of a basic processing circuit.
  • Figure 2b is a schematic diagram of the data transmitted by the main processing circuit.
  • Figure 2c is a schematic diagram of a matrix multiplied by a vector.
  • 2d is a schematic structural view of an integrated circuit chip device.
  • 2e is a schematic structural view of still another integrated circuit chip device.
  • Figure 2f is a schematic diagram of a matrix multiplied by a matrix.
  • Figure 3a is a schematic diagram of convolution input data.
  • Figure 3b is a schematic diagram of a convolution kernel.
  • Figure 3c is a schematic diagram of the operation window of a three-dimensional data block of input data.
  • Figure 3d is a schematic diagram of another operational window of a three-dimensional data block of input data.
  • Figure 3e is a schematic diagram of still another operational window of a three-dimensional data block of input data.
  • Figure 4a is a schematic diagram of a training method of a neural network.
  • Figure 4b is a schematic diagram of the forward operation of a neural network.
  • Figure 4c is a schematic diagram of a neural network operation.
  • 5a-5b are schematic structural diagrams of two mapping circuits provided by an embodiment of the present application.
  • Figure 6a is a flow chart of a method of multiplying a matrix by a matrix.
  • Figure 6b is a flow chart of a method of multiplying a matrix by a vector.
  • Figure 7a is a schematic diagram of neural network training.
  • Figure 7b is a schematic diagram of another neural network training.
  • Figure 7c is a schematic diagram of the forward and reverse operations of the neural network.
  • Figure 7d is a schematic diagram of a multi-layer structure of neural network training.
  • FIG. 8 is a schematic structural diagram of a neural network chip provided by the flow of the embodiment of the present disclosure.
  • FIG. 9a is a schematic structural diagram of a combined processing apparatus according to the present disclosure.
  • FIG. 9b is another schematic structural diagram of a combined processing device according to the present disclosure.
  • FIG. 10 is a schematic structural diagram of a neural network processor card provided by an embodiment of the present disclosure.
  • FIG. 10b is a schematic structural diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure.
  • 11a is a schematic diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure.
  • FIG. 11b is a schematic diagram of another neural network chip package structure provided by the flow of the embodiment of the present disclosure.
  • the apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least one of the plurality of basic processing circuits includes a second mapping circuit The first mapping circuit and the second mapping circuit are both configured to perform compression processing of each data in a neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;
  • the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
  • the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction; And dividing the horizontal data block and the pre-stored identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identifier data blocks associated with the basic data block; and the plurality of basic data blocks and An identification data block associated with each of the plurality of basic data blocks is distributed to a base processing circuit connected thereto; the vertical data block and the identification data block associated with the vertical data block are broadcasted to a base processing circuit connected thereto.
  • the identifier data block may be specifically represented by a direct index or a step index, and an optional List of Lists (LIL), a Coordinate list (COO), and a compressed sparse line (Compressed). Sparse Row (CSR), Compressed Sparse Column (CSC), (ELL Pack, ELL), and Hybrid (HyB) are not limited in this application.
  • the identification data block is represented by a direct index
  • the identification data block may specifically be a data block composed of 0 and 1, wherein 0 represents data included in the data block (such as weight or input nerve).
  • the absolute value of the element is less than or equal to the first threshold, and 1 indicates that the absolute value of the data (such as the weight or the input neuron) contained in the data block is greater than the first threshold, and the first threshold is a custom random setting on the user side or the device side. , such as 0.05, 0, and so on.
  • the target data in the plurality of basic data blocks and the plurality of basic data may be specifically
  • the identification data blocks associated with the blocks are distributed to the basic processing circuit connected thereto; optionally, the target data in the processed vertical data block and the identification data block associated with the vertical data block may also be broadcasted to the same.
  • the target data refers to data in which the absolute value of the data block is greater than the first threshold, or refers to non-zero data in the data block (here, specifically, the processed horizontal data block or the processed vertical data block). .
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the horizontal data block is a matrix of M 1 rows and N 1 columns
  • the basic data block is a matrix of M 2 rows and N 2 columns, where M 1 > M 2 , N 1 &gt ; N 2 .
  • the identification data block associated with the horizontal data block is also a matrix of M 1 rows and N 1 columns
  • the identification data block associated with the basic data block is also a matrix of M 2 rows and N 2 columns.
  • the first threshold is 0.05
  • the identifier data block associated with the basic data block is The processing of the data block with respect to the first mapping circuit and the second mapping circuit will be specifically described later.
  • the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identifier data block associated with the horizontal data block, the processed vertical data block, and the An identification data block associated with the vertical data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data associated with each of the basic data blocks Blocking, the identification data block associated with each of the plurality of basic data blocks and the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block and the identification data associated with the vertical data block are The block is broadcast to the base processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; and identify the data block according to the connection Processing the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation result, Transmitting the operation result to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
  • the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inner product operation.
  • connection identification data block is a data block obtained by performing an element-by-element operation on the identification data block associated with the basic data block and the identification data block associated with the partial vertical data block.
  • connection identifier data block is used to represent data in which two data blocks (specifically, basic data blocks and vertical data blocks) are larger than absolute values. The details will be described in detail later.
  • the identification data block associated with the horizontal data block is a 2*3 matrix.
  • the identification data block associated with some vertical data blocks is a 2*2 matrix Corresponding to the obtained connection identification data block is
  • the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block to obtain a processed horizontal data block and the identification data block associated with the horizontal data block, or to activate the first mapping circuit according to the pre-stored horizontal data block Processing, by the associated identifier data block, the horizontal data block to obtain a processed horizontal data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks And the identification data block associated with each of the basic data blocks, and the identification data blocks associated with each of the plurality of basic data blocks and the plurality of basic data blocks are distributed to a basic processing circuit connected thereto, and the vertical data is The block is broadcast to the base processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; Performing an inner product operation on the vertical data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain an instruction result.
  • the main processing circuit is further configured to split the vertical data block to obtain a plurality of partial vertical data blocks; and pass the plurality of partial vertical data blocks once or Broadcasting to the base processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to process the partial vertical data block according to the identifier data block associated with the basic data block to obtain a processed partial vertical data block; and the basic data block and the The processed partial vertical data block performs an inner product operation.
  • the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the vertical data block, obtaining the processed vertical data block and the identification data block associated with the vertical data block, or starting the first mapping circuit according to the pre-stored
  • the vertical data block is processed by the identification data block associated with the vertical data block to obtain a processed vertical data block; and the horizontal data block is split to obtain a plurality of basic data blocks;
  • the basic data block is distributed to the basic processing circuit connected thereto, and the processed vertical data block and the identification data block associated with the vertical data block are broadcasted to the basic processing circuit connected thereto;
  • the basic processing circuit is configured to start, by the second mapping circuit, the basic data block to be processed according to the identification data block associated with the vertical data block to obtain a processed basic data block; Performing an inner product operation on the data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain an instruction result.
  • the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inner product operation.
  • the main processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to be connected thereto by one broadcast.
  • the basic processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to be connected thereto by one broadcast.
  • the basic processing circuit is specifically configured to perform inner product of the basic data block (which may be the basic data block or the processed basic data block) and the vertical data block.
  • the processing obtains the inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main processing circuit.
  • the main processing circuit is configured to accumulate the operation result and obtain an accumulation result when the operation result is the result of the inner product processing, and arrange the accumulation result to obtain the The data block to be calculated and the instruction result of the operation instruction.
  • the main processing circuit is specifically configured to divide the vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the The basic processing circuit; the plurality of partial vertical data blocks are combined to form the vertical data block.
  • the basic processing circuit is specifically configured to execute the partial vertical data block (specifically, a partial vertical data block or a processed partial vertical data block) and the basic data block
  • the inner product processing result is obtained
  • the inner product processing result is accumulated to obtain a partial operation result
  • the partial operation result is sent to the main processing circuit.
  • the basic data block here takes the kernel 3*3 as an example.
  • the partial vertical data block takes the 3*3 matrix as an example, and performs the multiplication operation of the corresponding position on the 3*3 matrix and the core 3*3 respectively, then the corresponding data is obtained.
  • the inner product result has three inner product processing results, and the three inner product processing results are accumulated to obtain a partial operation result.
  • the results of the three inner product processing Out0 (the inner product of the 0th row of the 3*3 matrix and the 0th row of the core 3*3), the Out1 (the inner product of the 1st row of the 3*3 matrix and the 1st row of the core 3*3), Out2 (the inner product of the 2nd line of the 3*3 matrix and the 2nd line of the 3*3 core) can be specifically:
  • r of r00 represents a partial vertical data block
  • 00 represents a 0th column element of the 0th row
  • K0[0] k represents the basic data block, and 0[0] represents the 0th column element of the 0th row;
  • the basic processing circuit is specifically configured to multiplex the partial vertical data block n times to perform the partial vertical data block and the n basic data block inner product operations to obtain n partial processing
  • the n partial processing results are respectively accumulated to obtain n partial operation results
  • the n partial operation results are sent to the main processing circuit
  • the n is an integer greater than or equal to 2.
  • the basic data block takes p cores 3*3 as an example.
  • the partial vertical data block takes a 3*3 matrix as an example, and the p3 3*3 matrix is multiplexed with the core 3*3 to perform p-time corresponding position multiplication.
  • the sub-operation that is, the corresponding inner product result, has p inner product results, and the three inner product results form a set of inner product operation results, and the three inner product results of each group in the p group are accumulated to obtain p partial operation results.
  • the main processing circuit includes: a main register or a main on-chip buffer circuit;
  • the basic processing circuit includes a basic register or a basic on-chip buffer circuit.
  • the main processing circuit comprises: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a first mapping circuit or a data rearrangement circuit.
  • the basic processing circuit is further configured to forward the vertical data block and the basic data block to other basic processing circuits to perform data processing and then perform an inner product operation to obtain an operation result, and The operation result is sent to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the data block may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
  • the operation instruction is a multiplication instruction
  • the main processing circuit determines that the multiplier data block is a vertical data block, and the multiplicand data block is a horizontal data block;
  • the main processing circuit determines that the convolution input data block is a vertical data block, and the convolution kernel is a horizontal data block.
  • the operation of the neural network includes: a convolution operation, a matrix multiplication matrix operation, a matrix multiplication vector operation, an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation.
  • a convolution operation a matrix multiplication matrix operation
  • a matrix multiplication vector operation an offset operation
  • a full connection operation a GEMM operation
  • a GEMV operation a GEMV operation
  • an activation operation kind or any combination.
  • the operation instructions according to the present invention include, but are not limited to, a convolution instruction, a multiplication instruction, a forward operation instruction, and a training instruction.
  • the inner product operation referred to above may specifically be an operation indicated by the operation instruction.
  • the operation instruction is a convolution instruction
  • the inner product operation above is a convolution operation.
  • the operation instruction is a multiplication instruction
  • the inner product operation above is a multiplication operation.
  • the operation instruction is a forward operation instruction
  • the inner product operation above is a forward operation.
  • the operation instruction is a training instruction
  • the inner product operation above is a reverse operation.
  • the main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction. Dividing the convolution kernel data block into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; a data block including the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of a basic processing circuit connected to the main processing circuit according to the convolution instruction Basic processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;
  • the main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.
  • the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
  • the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a convolution operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain the instruction result.
  • the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
  • the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
  • the subsequent vertical data block and the processed basic data block perform a convolution operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
  • the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
  • the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a convolution operation.
  • the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
  • the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
  • the vertical data block and the processed basic data block perform an inner product operation to obtain an operation result, and the operation result is sent to the main processing circuit.
  • the main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, and divide the input data block according to the multiplication instruction. Forming a horizontal data block, dividing the weight data block into a vertical data block; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; The first data block includes the horizontal data block and/or the vertical data block; and the processed first data block is sent to a basic processing circuit connected to the main processing circuit according to the multiplication instruction.
  • At least one basic processing circuit configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, according to the processed second data block in a parallel manner Performing an operation in the neural network to obtain an operation result, and transmitting the operation result to the main unit through a basic processing circuit connected to the main processing circuit a processing circuit;
  • the second data block is a data block that is received by the basic processing circuit and received by the main processing circuit, and the second data block is associated with the processed first data block;
  • a circuit for processing the operation result to obtain an instruction result of the multiplication instruction.
  • the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
  • the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing multiplication on the processed vertical data block and the basic data block to obtain an operation result Transmitting the operation result to the main processing circuit;
  • the main processing circuit is configured to process the operation result to obtain the instruction result.
  • the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
  • the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
  • the subsequent vertical data block and the processed basic data block perform a multiplication operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
  • the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
  • the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a multiplication operation.
  • the main processing circuit is configured to receive a forward operation instruction, and the forward operation instruction is parsed to obtain the forward operation instruction in the a first operation instruction included in the i-th layer of the neural network forward operation and an input data block and a weight data block required by the first operation instruction, wherein the range of i is greater than or equal to 1, and less than or equal to n An integer, if the i is greater than or equal to 2, the input data block is an output data block of the i-1th layer;
  • the main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction
  • the operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block;
  • the forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block.
  • the operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;
  • the main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.
  • the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
  • the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a forward operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the forward operation includes but is not limited to a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;
  • the main processing circuit is configured to process the operation result to obtain the instruction result.
  • the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
  • the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
  • the subsequent vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
  • the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
  • the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a forward operation.
  • the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
  • the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
  • the vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is sent to the main processing circuit.
  • the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs a forward operation.
  • the operation of the i-th layer further comprises: one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.
  • the integrated circuit chip device is configured to receive a training instruction, and determine first layer input data and first layer weight group data according to the training instruction, The first layer of input data and the first layer of weight group data perform an n-th layer forward operation of the neural network to obtain an nth output result of the forward operation;
  • the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
  • the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
  • An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
  • the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
  • the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
  • the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block;
  • the horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing inverse operations on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the reverse operation includes, but is not limited to, a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;
  • the main processing circuit is configured to process the operation result to obtain the instruction result.
  • the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block.
  • the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block;
  • the subsequent vertical data block and the processed basic data block perform an inverse operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.
  • the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block;
  • the connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inverse operation.
  • the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;
  • the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block;
  • the vertical data block and the processed basic data block perform a reverse operation to obtain an operation result, and the operation result is sent to the main processing circuit.
  • the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.
  • the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inverse operation.
  • the inverse operation of the n layer further includes one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.
  • the nth output result gradient is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;
  • the nth layer input data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;
  • the n-layer weight group data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
  • FIG. 1a is an integrated circuit chip device according to the present disclosure.
  • the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, and the plurality of basic processing circuits are arranged in an array (m*n Array), wherein the values of m and n are integers greater than or equal to 1 and at least one of m and n is greater than or equal to 2.
  • each of the basic processing circuits is connected to an adjacent basic processing circuit, the main processing circuit connecting k basic processing circuits of the plurality of basic processing circuits, the k basics
  • the processing circuit may be: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column.
  • the main processing circuit includes a first mapping circuit for compressing data to obtain processed data and identification data.
  • the identification data is used to indicate whether the absolute value of the data is greater than a first threshold.
  • the main processing circuit may only send the processed data (specifically, the data whose absolute value is greater than the first threshold) and the identification data associated with the data to the basic processing circuit.
  • the advantage is that the amount of data sent to the basic processing circuit for data processing is reduced, and the data processing rate is improved.
  • the first threshold is customized on the user side or the device side, for example, 0.05, 0.5, etc., and is not limited.
  • the input data of the main processing circuit is a matrix data block.
  • the processed matrix data block can be obtained as The identification data block associated with the matrix data block is The specific processing regarding the first mapping circuit will be described later in detail.
  • the main processing circuit distributes data to the basic processing circuit, only two data of 1 and 0.5 may be transmitted, not the processed matrix data block and 8 data; and the identifier of the matrix data block is also required to be associated.
  • the data blocks are sent together to the basic processing circuit, so that the basic processing circuit correspondingly knows that the two data are located at the position of the original matrix data block according to the received identification data block and the received two data (1 and 0.5). That is, the basic processing circuit can correspondingly restore the processed matrix data block in the main processing circuit according to the received identification data block and the received data.
  • At least one of the plurality of base circuits may include a second mapping circuit.
  • a part of the basic processing circuits may include a second mapping circuit.
  • the k basic processing circuits may be configured with the second mapping circuit, so that the n basic processing circuits may be respectively responsible for The data processing steps of the m basic processing circuits of the column are performed. This setting can improve the operation efficiency and reduce the power consumption, because for the n basic processing circuits of the first row, since the data transmitted by the main processing circuit is received first, the compression of the received data can be reduced.
  • the calculation amount of the subsequent basic processing circuit and the amount of data transmission with the subsequent basic processing circuit are similar.
  • the configuration of the second mapping circuit for the m basic processing circuits of the first column also has the advantages of small calculation amount and low power consumption.
  • the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits of the first column, and the main processing circuit transmits and distributes to the n basic processing circuits of the first row. data.
  • the specific processing regarding the second mapping circuit will be described later in detail.
  • the main processing circuit is configured to perform each successive operation in the neural network operation and the basic processing circuit connected thereto to transmit data; the continuous operation is not limited to: an accumulation operation, an ALU operation, an activation operation, and the like. .
  • the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
  • the above-described parallel execution of operations in the neural network includes, but is not limited to, inner product operations, matrix or vector multiplication operations, and the like.
  • the main processing circuit may include: a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit may integrate a horizontal data distributing circuit and a vertical data distributing circuit.
  • the horizontal data distributing circuit and the vertical data distributing circuit are also Can be set separately.
  • horizontal data that is, data that needs to be sent to each of the basic processing circuits in the row direction (or horizontal direction)
  • the horizontal data is transmitted to the basic processing circuit in any one or more of the m rows as shown in FIG. 1a.
  • For vertical data that is, data that needs to be sent to part of the basic processing circuit in the column direction (or vertical direction), specifically, such as convolution operation, the convolution input data of the convolution operation needs to be sent to all the basic processing.
  • the manner in which the horizontal data is specifically selected to be sent to the basic processing circuit can be specifically determined by the main processing circuit according to the load and other allocation methods.
  • data can be sent to each basic processing circuit in broadcast form.
  • horizontal/vertical data is transmitted to each basic processing circuit by means of one broadcast, and horizontal/vertical data can also be transmitted to each basic processing circuit by means of multiple broadcasts, the present disclosure specifically The embodiment does not limit the number of times of the above broadcast).
  • the main processing circuit can also be selectively sent to a part of the basic processing circuit.
  • the main processing circuit may include a register and/or an on-chip buffer circuit, and the main processing circuit may further include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit) circuit, and accumulation. Circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in practical applications, the above main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. Wait for other circuits.
  • Each of the basic processing circuits may include a base register and/or a base on-chip buffer circuit; each of the base processing circuits may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be separately provided circuits.
  • the accumulator circuit of the nth basic processing circuit of the mth row can perform an accumulation operation of the inner product operation, because for the mth line basic processing circuit, it can receive the product of all the basic processing circuits of the column.
  • the accumulation operation of the inner product operation performs the accumulation operation of the inner product operation through the n basic processing circuits of the mth row, so that the calculation resources can be effectively allocated, and the power consumption is saved.
  • This technical solution is especially suitable for a large number of m.
  • the main processing circuit can allocate the executed circuit. Specifically, the executed circuit can be allocated by display or implicit manner. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the processing circuit receives the special indication or instruction, it determines to perform data compression processing. If the basic processing circuit does not receive a special indication or instruction, it determines that the compression processing of the data is not performed. As another example, it may be performed in a suggestive manner. For example, if the basic processing circuit receives the sparse data (ie, includes 0, or includes data smaller than a preset threshold greater than a preset amount) and determines that an inner product operation needs to be performed, the sparseness will be sparse The data is compressed.
  • the sparse data ie, includes 0, or includes data smaller than a preset threshold greater than a preset amount
  • the special instruction or indication may be configured with a descending sequence, the decrement sequence is decremented by one each time a basic processing circuit is passed, and the basic processing circuit reads the value of the decrementing sequence, and if the value is greater than zero, the data is executed.
  • the compression process if the value is equal to or less than zero, does not perform data compression processing.
  • This setting is configured according to the basic processing circuit of the array allocation. For example, for the m basic processing circuits of the i-th column, the main processing circuit needs the first five basic processing circuits to perform data compression processing, and the main processing circuit issues one.
  • the special instruction includes a descending sequence, and the initial value of the descending sequence may be 5, and the value of the descending sequence is decremented by 1 every time a basic processing circuit is passed, to the 5th basic processing circuit, the descending sequence The value is 1, and when the sixth basic processing circuit is used, the decrementing sequence is 0. At this time, the sixth basic processing circuit will not perform the data compression processing, which can enable the main processing circuit to dynamically configure data compression.
  • An embodiment of the present disclosure provides an integrated circuit chip device including a main processing circuit (also referred to as a main unit) and a plurality of basic processing circuits (also referred to as a base unit); the structure of the embodiment is as shown in FIG. 1b.
  • the dotted line frame is the internal structure of the neural network computing device;
  • the gray filled arrow indicates the data transmission path between the main processing circuit and the basic processing circuit array, and
  • the hollow arrows indicate the respective basic processing circuits in the basic processing circuit array ( Data transmission path between adjacent basic processing circuits).
  • the length and width of the basic processing circuit array may be different, that is, the values of m and n may be different, and may of course be the same. The disclosure does not limit the specific value of the above values.
  • the circuit structure of the basic processing circuit is shown in Figure 1c; the dotted line in the figure indicates the boundary of the basic processing circuit, and the thick arrow crossing the dotted frame indicates the data input and output channel (the input channel is pointed in the dotted line box, indicating that the dotted line frame is the output channel) ); the rectangular box in the dashed box indicates the memory cell circuit (register and / or on-chip buffer), including input data 1, input data 2, multiplication or inner product results, accumulate data; diamond box represents the operator circuit, including multiplication or internal Product operator, adder.
  • the neural network computing device includes a main processing circuit and 16 basic processing circuits (16 basic processing circuits are for illustrative purposes only, and other values may be used in practical applications);
  • the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to the unit in FIG. 1b) is referred to as input 0.
  • the vertical input interface (vertical arrow pointing to this unit in Figure 1b) is called input 1; each horizontal data output interface (the horizontal arrow indicated from this unit in Figure 1b) is called output 0, vertical
  • the data output interface (the vertical arrow indicated from this unit in Figure 1b) is called output 1.
  • each basic processing circuit can be respectively connected to different units, including a main processing circuit and other basic processing circuits;
  • the input processing 0 of the four basic processing circuits of the basic processing circuits 0, 4, 8, 12 (numbered as shown in FIG. 1b) is connected to the data output interface of the main processing circuit;
  • the input 1 of the four basic processing circuits of the basic processing circuits 0, 1, 2, 3 is connected to the data output interface of the main processing circuit;
  • the output 1 of the four basic processing circuits of the basic processing circuits 12, 13, 14, 15 is connected to the data input interface of the main processing circuit;
  • connection of the output interface of the basic processing circuit to the input interfaces of other basic processing circuits is shown in Figure 1b, and will not be enumerated one by one;
  • the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit will be able to receive data from its P1 interface that the S unit sends to its S1 interface.
  • the embodiment includes a main processing circuit, the main processing circuit is connected with an external device (both an input interface and an output interface), and a part of the data output interface of the main processing circuit is connected with a data input interface of a part of the basic processing circuit; the main processing circuit A part of the data input interface is connected to a data output interface of a part of the basic processing circuit.
  • the data involved in the usage method provided by the present disclosure may be compressed data.
  • the data in the present application may be an input neuron or a weight in a neural network, which may be a matrix data or a vector data, etc., which is not limited herein. That is, the data or data blocks set forth below in this application may be input neurons or weights in a neural network, which may be embodied in the form of a matrix or a vector.
  • the data compression processing involved in the present application is specifically performed in the first mapping circuit and the second mapping circuit described above. It should be understood that since the neural network is an algorithm with high computational complexity and high memory access, the more weights, the larger the calculation amount and the memory access amount. In particular, in the case where the weight is small (for example, 0, or less than the weight of the set value), in order to increase the calculation rate and reduce the overhead, it is necessary to compress the data with smaller weights. In practical applications, data compression processing is applied in sparse neural networks, and the effect is most obvious, such as reducing the workload of data calculation, reducing data overhead, and increasing data calculation rate.
  • the input data includes, but is not limited to, at least one input neuron and/or at least one weight.
  • the first mapping circuit After the first mapping circuit receives the first input data (specifically, the data block to be calculated sent by the main processing circuit, such as a horizontal data block or a vertical data block, etc.), the first mapping circuit may be the first The input data is processed to obtain the processed first input data and the identifier mask data associated with the first input data, the mask data is used to indicate whether the absolute value of the first input data is greater than a first threshold, such as 0.5, 0. and many more.
  • a first threshold such as 0.5, 0. and many more.
  • the input matrix data block is The first threshold is 0.05, and the processed matrix data block can be obtained after being processed by the first mapping circuit.
  • the identification data block also referred to as the mask matrix
  • the target data in the processed matrix data block may be transmitted (in this example, 1, 0.06) And 0.5) and the identification data block associated with the matrix data block.
  • the main processing circuit may distribute the target data in the processed matrix data block to the basic processing circuit according to a setting rule, for example, sequentially transmitting in a row order or sequentially in a column order, etc., the present application Not limited.
  • the basic processing circuit restores the target data block to the processed matrix data block according to a setting rule (for example, a row order).
  • the underlying processing circuitry can be based on received data (1, 0.06, and 0.5) and the identified data block. It can be known that the matrix data block corresponding to the data (that is, the matrix data block processed by the first mapping circuit in the main processing circuit) is
  • the first input data may be a horizontal data block and/or a vertical data block.
  • the second mapping circuit can process the second input data by using the identification data associated with the first input data, thereby obtaining the processed second input data; wherein the first input data is different from the second input data.
  • the second input data may be at least one input neuron; or, when the first input data is at least one input neuron, then The second input data can be at least one weight.
  • the second input data is different from the first input data, and the second input data may be any one of the following: a horizontal data block, a basic data block, a vertical data block, and a partial vertical To the data block.
  • the second input data is a partial vertical data block.
  • the second input data is a matrix data block
  • the obtained partial vertical data block is Since the dimension of the matrix data block involved in the input data is large in practical applications, the present application is merely illustrative and is not intended to be limiting.
  • the first mapping circuit is configured to process the first input data and the second input data to obtain the processed first input data and the first identifier mask data associated with the first input data, and the processed second Input data and second identification mask data associated with the second input data.
  • the first mask data or the second mask data is used to indicate whether the absolute value of the first or second input data is greater than a second threshold, and the second threshold is customized by the user side or the device side, for example, 0.05. 0 and so on.
  • the processed first input data or second input data may be processed input data or may be input data before processing.
  • the first input data is a horizontal data block, such as the matrix data block in the above example.
  • the processed horizontal data block can be obtained, and the processed horizontal data block can be the original matrix data block.
  • the processed input data (such as processed basic data block or partial vertical data block, etc.) should be compression processing.
  • the data sent by the main processing circuit to the basic processing circuit may be the target data in the processed input data, and the target data may be data with an absolute value greater than a preset threshold, or may be non-zero. Data and more.
  • the second mapping circuit may obtain connection identification data according to the first identification data associated with the first input data and the second identification data associated with the second input data; the connection identification data is used for And indicating that the absolute value of the first input data and the second input data are greater than a third threshold, wherein the third threshold is customized by the user side or the device side, such as 0.05, 0, and the like. Further, the second mapping circuit may process the received first input data and the second input data respectively according to the connection identification data, thereby obtaining the processed first input data and the processed second input data.
  • the first input data is a matrix data block
  • the second input data block is also a matrix data block Obtaining, by the first mapping circuit, the first identification data block associated with the first input data And the processed first input data block Correspondingly obtaining the second identification data block associated with the second input data
  • the processed second input data block is Correspondingly, in order to increase the data transmission rate, only the target data 1, 0.06 and 0.5 in the processed first input data block and the first identification data block associated with the first input data block may be sent to the main processing circuit.
  • the basic processing circuit at the same time, the target data 1, 1.1, 0.6, 0.3 and 0.5 in the processed second input data block and the second identification data block associated with the second input data block are sent to the basic processing circuit.
  • the basic processing circuit may perform the element-by-element operation on the first identification data block and the second identification data block by using the second mapping circuit to obtain the connection identification data block.
  • the second mapping circuit separately processes the processed first input data block and the processed second input data block by using the connection identification data block, respectively, so that the processed first input data block is obtained as The processed second input data block is
  • the first processing block is configured to determine, according to the first identification data block and the target data in the received first data block, the first data block corresponding to the target data (ie, the first processed by the first mapping circuit)
  • the first mapping circuit is not disposed in the main processing circuit, but the main processing circuit may send the third input data and the pre-stored third identification data associated with the third input data to the basic processing circuit connected thereto in.
  • a second mapping circuit is provided in the basic processing circuit. A specific embodiment of the data compression process involved in the second mapping circuit is explained below.
  • the third input data includes, but is not limited to, a basic data block, a partial vertical data block, a vertical data block, and the like.
  • the third input data may also be at least one weight, and/or at least one input nerve, which is not limited herein.
  • the second mapping circuit may process the third input data according to the third identification data associated with the received third input data, thereby obtaining the processed third input data, so as to be subsequently
  • the processed third input data performs related arithmetic operations, such as inner product operations.
  • the third input data received by the second mapping circuit is a matrix data block.
  • the third mapping circuit processes the third input data block according to the third identification data block, and the processed third input data block is specifically
  • the input neurons and output neurons mentioned in the embodiments of the present invention do not refer to neurons in the input layer of the entire neural network and neurons in the output layer, but to any adjacent two layers in the neural network.
  • the neurons in the lower layer of the network feedforward operation are the input neurons
  • the neurons in the upper layer of the network feedforward operation are the output neurons.
  • the layer, the neurons in the layer are the above input neurons, the K+1 layer is called the output layer, and the neurons in the layer are the above-mentioned output neurons, that is, each layer can be used as an input except the top layer.
  • Layer, the next layer is the corresponding output layer.
  • a mapping circuit is not provided in the main processing circuit, and a first mapping circuit and a second mapping circuit are disposed in the basic processing circuit.
  • the mapping circuit is not disposed in the basic processing circuit, and the first mapping circuit and the second mapping circuit are both disposed in the main processing circuit, and the first mapping circuit and the second mapping circuit are
  • the main processing circuit completes the compression processing of the data, and sends the processed input data to the basic processing circuit, so that the basic processing circuit utilizes the processed input data (specifically, the processed neurons and the processed weights) Perform the corresponding arithmetic operations.
  • mapping circuit involved in the present application.
  • Two possible mapping circuits are shown in Figures 5a and 5b.
  • the mapping circuit shown in FIG. 5a includes a comparator and a selector.
  • the application is not limited.
  • a comparator and two selectors are shown in Fig. 5a, wherein the comparator is used to determine whether the input data satisfies a preset condition.
  • the preset condition may be customized for the user side or the device side. For example, the absolute value of the input data described above in the application is greater than or equal to a preset threshold.
  • the comparator may determine to allow output of the input data, the input data corresponding to the associated identification data being 1; otherwise, it may be determined that the input data is not output, or the input data is 0 by default.
  • the identification data corresponding to the input data at this time is 0. That is, after the comparator is passed, the identification data associated with the input data can be known.
  • the obtained identification data may be input into the selector, so that the selector uses the identification data to determine whether to output the corresponding input data, that is, obtain the processing. After the input data.
  • a comparator can determine a predetermined condition for each data in the matrix data block, so that an identifier data block associated with the matrix data block can be obtained ( Mask matrix). Further, the matrix data block may be filtered by using the identifier data block in the first selector, and the data in the matrix data block whose absolute value is greater than or equal to a preset threshold (ie, a preset condition is satisfied) is reserved. The remaining data is deleted to output the processed matrix data block.
  • a preset threshold ie, a preset condition is satisfied
  • the identifier data block may also be used in the second selector to process other input data (for example, the second matrix data block), for example, performing an element-by-element AND operation to obtain an absolute value in the second matrix data block.
  • Data greater than or equal to the preset threshold is reserved to output the processed second matrix data block.
  • the specific structure of the first mapping circuit may include at least one comparator and at least one selector, such as the comparator and the first in FIG. 5a in the above example.
  • a selector such as the comparator and the first in FIG. 5a in the above example.
  • the specific result of the second mapping circuit may include one or more selectors, such as the second selector of Figure 5a in the above example.
  • the mapping circuit includes a selector, and the number of the selectors is not limited, and may be one or multiple.
  • the selector is configured to select the input data according to the input data associated with the input data to output the data whose absolute value is greater than or equal to a preset threshold. The data is deleted/not output, thereby obtaining processed input data.
  • the matrix data block and the identification data block associated with the matrix data block are input to the mapping circuit, and the selector may select the matrix data block according to the identification data block.
  • the data whose absolute value is greater than or equal to 0 is output, and the remaining data is not output, thereby outputting the processed matrix data block.
  • the structure shown in FIG. 5b can be applied to the second mapping circuit in the third embodiment described above, that is, the specific result of the second mapping circuit in the third embodiment described above may include at least one selector.
  • the first mapping circuit and the second mapping circuit designed in the main processing circuit and the basic processing circuit may be cross-combined or split according to the functional components shown in FIG. 5a and FIG. 5b, which is not limited in this application.
  • the main processing circuit first enables the first mapping circuit to process the first input data to obtain the processed first input data and the first identification data associated with the first input data; and then the processed first input data and The first identification data associated with the first input data is transmitted to a base processing circuit for operation.
  • the main processing circuit can process the data to be calculated (such as a horizontal block/vertical block) and then transmit the data to the basic processing circuit, which has the advantages of reducing the bit width of the transmitted data and reducing the total number of bits transmitted.
  • the basic processing circuit performs data operations with smaller bit widths and is also more efficient and consumes less power.
  • the basic processing circuit enables the second mapping circuit to process the received second input data by using the first identification data, obtain the processed second input data, and then perform correlation operations on the processed first input data and the second input data. operating.
  • the basic processing circuit receives the second input data (such as sparse data and vertical data blocks) transmitted by the main processing circuit, and performs compression processing on the first operation to improve the operation efficiency and reduce the power consumption.
  • the main processing circuit may first input the first input data (such as a basic data block), the first identification data associated with the first input data, the second input data (such as a partial vertical data block, etc.), and the second input data.
  • the associated second identification data is first transmitted to the basic processing circuit for operation.
  • the basic processing circuit may first enable the second mapping circuit to obtain the connection identification data block according to the first identification data and the second identification data, and then use the connection identification data to use the first input data and the second input data.
  • the processing is further performed, and the operation operation for the processed first input data and the second input data can be further completed in the basic processing circuit, and the benefits thereof can reduce the amount of data calculation, improve the operation efficiency, and reduce the power consumption.
  • the first identification data associated with the first input data sent by the main processing circuit and the second identification data associated with the second input data are pre-stored in the main processing circuit, or are enabled for the main processing circuit.
  • a mapping circuit is obtained by using the first/second input data, which is not limited in this application.
  • the main processing circuit receives input data to be calculated from outside the device
  • the main processing circuit performs arithmetic processing on the data by using various operation circuits, a vector operation circuit, an inner product operator circuit, an accumulator circuit, and the like of the unit;
  • the main processing circuit sends data to the basic processing circuit array (referred to as a basic processing circuit array) through the data output interface (as shown in FIG. 2b);
  • the method of sending data may be to directly send data to a part of the basic processing circuit, that is, multiple broadcast modes;
  • the method of transmitting data here may separately send different data to different basic processing circuits, that is, a distribution method
  • the basic processing circuit array calculates the data
  • the basic processing circuit performs an operation after receiving the input data
  • the basic processing circuit transmits the data from the data output interface of the unit after receiving the data; (transferred to other basic processing circuits that do not directly receive data from the main processing circuit.)
  • the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)
  • the main processing circuit receives the output data returned from the basic processing circuit array
  • the main processing circuit continues to process (eg, accumulate or activate the operation) the data received from the basic processing circuit array;
  • the main processing circuit is processed, and the processing result is transmitted from the data output interface to the outside of the device.
  • the tensor multiplying tensor operation is performed using the circuit device, the tensor being the same as the data block described above, which may be any of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
  • the data block described above may be any of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
  • a combination of terms or multiples; the specific implementation of matrix multiplication vectors and matrix multiplication matrix operations is shown in Figures 2c and 2f, respectively.
  • the matrix multiplication vector operation is performed using the circuit device; (the matrix multiplication vector may be an inner product operation of each row in the matrix with the vector, and the results are arranged into a vector in the order of the corresponding rows.)
  • This method uses all or a portion of the basic processing circuit of the neural network computing device, assuming that K basic processing circuits are used;
  • the main processing circuit transmits data in part or all of the rows of the matrix S to each of the k basic processing circuits;
  • the control circuit of the main processing circuit sends data of a certain number of rows in the matrix S to a certain basic processing circuit each time (for example, for a certain basic processing circuit, The first time, the first number of the third, fourth, and fifth lines is transmitted, the second time is the second, the third, fourth, and fifth lines are the second, and the third time is the third, fourth, and fifth lines.
  • the third number of lines ..., or the first two digits of the first, third, fourth, and fifth lines, the second, the third, fourth, and fifth lines, the third and fourth digits of each line, Send the 5th, 4th, and 5th lines, the 5th and 6th lines of each line.
  • the control circuit of the main processing circuit sequentially transmits the data in the vector P to the 0th basic processing circuit
  • the 0th basic processing circuit After receiving the data of the vector P, the 0th basic processing circuit sends the data to the next basic processing circuit connected thereto, that is, the basic processing circuit 1;
  • some basic processing circuits cannot obtain all the data required for calculation directly from the main processing circuit.
  • the basic processing circuit 1 in FIG. 2d has only one data input interface connected to the main processing circuit, so it can only directly
  • the main processing circuit obtains the data of the matrix S, and the data of the vector P needs to be output to the basic processing circuit 1 by the basic processing circuit 0.
  • the basic processing circuit 1 continues to output the data of the vector P to the basis. Processing circuit 2.
  • Each of the basic processing circuits performs operations on the received data, including but not limited to: inner product operations, multiplication operations, addition operations, and the like;
  • the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the result is transmitted from the data output interface (ie, transmitted to other basic processing circuits connected thereto);
  • the result of the calculation may be the final result or an intermediate result of the inner product operation
  • the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
  • the main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing may be an accumulation operation or an activation operation, etc.).
  • the plurality of basic processing circuits used in the method are arranged in the manner as shown in FIG. 2d or FIG. 2e as follows;
  • the main processing circuit can respectively obtain a mask matrix corresponding to each of the matrix S and the matrix P (ie, the identification data/identification data block described above).
  • the mask matrix corresponding to the matrix S and the matrix P may be pre-stored in the high-speed memory in the main processing circuit; or the main processing circuit enables the first mapping circuit to respectively obtain the corresponding corresponding according to the matrix S and the matrix P.
  • Mask matrix may be pre-stored in the high-speed memory in the main processing circuit.
  • the control circuit of the main processing unit divides the M rows of the matrix S into K groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is denoted as Ai); correspondingly, the main processing unit
  • the control circuit also divides the M-line data of the first mask matrix corresponding to the matrix S into K groups, and sends them to the corresponding basic processing circuit together with the newly formed matrix after the matrix S is divided into K groups, on the basis The processing operation of the related data is completed in the processing circuit.
  • the method of grouping M rows of data is any grouping method that does not repeatedly allocate;
  • the following allocation mode is adopted: the jth row is allocated to the j%K (% is a remainder operation) basic processing circuit;
  • a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
  • the control circuit of the main processing circuit sequentially sends the data in part or all of the rows in the matrix S to the corresponding basic processing circuit; correspondingly, the control circuit also corresponds the data in the matrix S to the first mask matrix.
  • the identification data in the data is sent together to the corresponding basic processing circuit.
  • the matrix S is a matrix data block of 50*50
  • the main processing circuit can divide the matrix S into 10 small matrices, each of which has a size of 5*50, and the main processing circuit can take the first small matrix.
  • S0 (5 rows and 50 columns) and the identification data block (5 rows and 50 columns) associated with the small matrix S0 are sent together to the first basic processing circuit to complete the arithmetic processing of the related data in the first basic processing circuit.
  • control circuit of the main processing circuit sends one or more data of one of the data of the i-th group of data Mi that it is responsible for, to the i-th basic processing circuit, the i-th data Mi It may be data in the matrix S, or may be data in the first mask matrix corresponding to the matrix S;
  • control circuit of the main processing circuit transmits one or more data of each of some or all of the i-th group of data Mi to which it is responsible to the i-th basic processing circuit;
  • the control circuit of the main processing circuit sequentially transmits the data in the vector P to the first basic processing circuit; correspondingly, the control circuit of the main processing circuit can sequentially send the data in the second mask matrix associated with the vector P to the first 1 basic processing circuit
  • control circuit of the main processing circuit can send one or more data in the second mask matrix associated with the vector P or the vector P each time;
  • the i-th basic processing circuit may also send the data to the i+1th basic processing circuit connected thereto;
  • Each basic processing circuit receives one or more data from a certain row or rows of the matrix S and one or more data from the vector P, and performs operations (including but not limited to multiplication or addition);
  • each of the basic processing circuits receives the data in the matrix S and the first identification data associated with the data in the first mask matrix, the data in the vector P, and the second associated with the data in the second mask data. After the data is identified, the connection identification data may be obtained according to the first identification data and the second identification data; and then the connection identification data is used to determine whether to perform a correlation operation on the data in the matrix P and the data in the vector P.
  • connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the vector P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of the same position in the matrix S and/or the data of the same position in the vector P are data whose absolute value is less than or equal to the preset threshold.
  • each of the basic processing circuits starts the second mapping circuit to perform correlation operations on the data in the matrix S and the vector P according to the first mask matrix of the matrix S and the second mask matrix of the vector P.
  • Operations such as multiplication, addition operations, and the like.
  • the first mask matrix and the second mask matrix are used to select data in the matrix S and the matrix P whose absolute value is greater than a preset threshold, and perform a correlation operation, such as a multiplication operation.
  • the basic processing circuit receives the data of two rows in the matrix S as a matrix. Corresponding first mask matrix associated with the matrix S 0 Some data received in vector P is vector P 0 [1 0.01 1.1 0.6] T , and the second mask vector [1 0 1 1] T associated with the vector P 0 ; further basic processing circuit can enable second mapping Circuit first Perform element-by-element operations with [1 0 1 1] T to obtain the connection mask matrix Further processing the received matrix S 0 and the vector P 0 by using the connection mask matrix to obtain the processed matrix And the processed vector P 0 [1 0 0 0.6] T , so that the basic processing circuit performs an associated arithmetic operation on the processed matrix S 0 and the processed vector P 0 .
  • each basic processing circuit specifically, the data block to be calculated, such as data of a certain row/column in the matrix S or the vector P and the identifier corresponding to the mask matrix
  • the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the vector data of the vector P and the data corresponding to the mask.
  • the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the data received by the basic processing circuit may also be an intermediate result, stored in a register or an on-chip buffer;
  • the basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected thereto;
  • only the output interface of the last basic processing circuit of each column is connected to the main processing circuit.
  • only the last basic processing circuit can directly be localized.
  • the calculation result is transmitted to the main processing circuit, and the calculation results of other basic processing circuits are transmitted to the next basic processing circuit, and the next basic processing circuit is transferred to the next basic processing circuit until all is transmitted to the last basic processing circuit.
  • the last basic processing circuit performs an accumulated calculation on the local calculation result and the results of other basic processing circuits received in the column to obtain an intermediate result, and sends the intermediate result to the main processing circuit; of course, it may be: the last basic processing circuit
  • the results of the other basic circuits in this column and the local processing results are sent directly to the main processing circuit.
  • each of the basic processing circuits has an output interface connected to the main processing circuit. In this case, each of the basic processing circuits directly transmits the local calculation result. Give the main processing circuit;
  • the basic processing circuit After receiving the calculation result transmitted by other basic processing circuits, the basic processing circuit transmits to the next basic processing circuit or main processing circuit connected thereto.
  • the main processing circuit receives the result of the M inner product operations as the operation result of the matrix multiplication vector.
  • the first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix. a second mask matrix corresponding to P;
  • the control circuitry of the main processing circuit transmits data in part or all of the rows of the matrix S to the underlying processing circuitry directly connected to the main processing circuitry via the lateral data input interface (eg, the top gray filled vertical data path in Figure 1b) At the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the first mask matrix to the base processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.
  • control circuit of the main processing circuit sends a certain number or part of the data of a row in the matrix S to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time) Send the first number in the third line, the second number in the third line of data, the third number in the third line in the third time... or the first two lines in the third line.
  • the third time sends the 3rd and 4th numbers in the 3rd line
  • the third time sends the 5th and 6th numbers in the 3rd line...;
  • control circuit also transmits one or a portion of the identification data to the underlying processing circuit each time the identification data corresponding to the row in the matrix S in the first mask matrix.
  • control circuit of the main processing circuit sends the data of a certain row in the matrix S and the identification data corresponding to the corresponding rows in the first mask matrix to each of the numbers or a part of the data to a certain base.
  • Processing circuit for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth lines is transmitted for the first time, and the second number of the third, fourth, and fifth lines for the second time is transmitted for the second time.
  • the third time, the third number of the third, fourth, and fifth lines is sent..., or the first two lines of the third, fourth, and fifth lines are sent, and the second time is the third and fourth. 5 lines per line 3rd and 4th, the third time 3, 4, 5 lines 5th and 6th per line...;)
  • the control circuitry of the main processing circuit sends data in some or all of the columns in the matrix P to those basic processing circuits that are directly connected to the main processing circuitry through the vertical data input interface (eg, to the left of the basic processing circuitry array in Figure 1b) The gray filled horizontal data path); at the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the second mask matrix to the base processing circuit connected thereto. For example, the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.
  • control circuit of the main processing circuit sends a certain number or part of the data of a column in the matrix P to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time)
  • the first number in the third column is transmitted
  • the second number in the third column data is transmitted in the second time
  • the third number in the third column is transmitted in the third time..., or the first two columns in the third column are transmitted.
  • the control circuit will also correspond to the row in the matrix P
  • the identification data of the corresponding row in the second mask matrix transmits one or a portion of the identification data to a certain basic processing circuit each time.
  • control circuit of the main processing circuit sends the data of a certain column in the matrix P and the identification data corresponding to the corresponding rows in the second mask matrix to each certain number or part of each time to a certain base.
  • Processing circuit for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth columns is transmitted for the first time, and the second number of the third, fourth, and fifth columns for the second time is transmitted for the second time.
  • the third time, the third number of the third, fourth, and fifth columns is sent..., or the first two numbers of the third, fourth, and fifth columns are sent for the first time, and the third and fourth times are sent for the second time. 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th columns in each column...;)
  • the basic processing circuit After receiving the data of the matrix S and the identification data of the first mask matrix associated with the matrix S, the basic processing circuit passes the data (specifically, the data of the matrix S and the identification data corresponding to the data in the first mask matrix) through the data
  • the horizontal data output interface is transmitted to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); after the basic processing circuit receives the data of the matrix P, the data is Through its vertical data output interface, it is transmitted to the next basic processing circuit (for example, the white filled vertical data path in the middle of the basic processing circuit array in FIG. 1b);
  • Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P.
  • connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold.
  • each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.
  • each basic processing circuit specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix
  • the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask.
  • the identification data in the matrix, etc. until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.
  • the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the result can be transmitted from the data output interface
  • the result of the calculation may be the final result or an intermediate result of the inner product operation
  • the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
  • the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
  • the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
  • the method uses a basic processing circuit array arranged in the manner shown in Figure 1b;
  • the first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix.
  • the second mask matrix corresponding to P optionally, may also obtain the processed matrix S and the matrix P, assuming that the processed matrix S has h rows, and the processed matrix P has w columns.
  • the control circuit of the main processing circuit divides the h row data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Hi); meanwhile, the control circuit will also The data corresponds to the identification data in some or all of the rows in the first mask matrix being sent to the underlying processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.
  • the method of grouping the h-line data is any grouping method that does not repeatedly allocate;
  • the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
  • a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
  • the control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Wi); accordingly, the control circuit also
  • the identification data corresponding to the column in the matrix P corresponding to the column in the second mask matrix is sent one or a part of the identification data to a certain basic processing circuit.
  • the method of grouping the W column data here is any grouping method that does not repeatedly allocate;
  • the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
  • a part of the columns may be equally distributed first, and the remaining columns may be allocated in an arbitrary manner.
  • the control circuit of the main processing circuit transmits data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;
  • control circuit of the main processing circuit transmits one or more of a row of data in the i-th data Hi that it is responsible for to the first basic processing circuit of the i-th row in the basic processing circuit array.
  • Data can be sent to the first basic processing circuit by using the same method to simultaneously identify the identification data of the i-th data Hi corresponding to the mask matrix;
  • control circuit of the main processing circuit transmits each of some or all of the i-th data Hi in its responsibility to the first basic processing circuit of the i-th row in the basic processing circuit array.
  • One or more data of the row can be used to send the identification data corresponding to the i-th data Hi in the mask matrix to the first basic processing circuit;
  • the control circuit of the main processing circuit sends the data in part or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array; at the same time, the control circuit will also correspond to the portion in the second mask matrix
  • the identification data in all or all of the rows is sent to the underlying processing circuitry connected to it.
  • the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.
  • control circuit of the main processing circuit transmits one or more of a column of data of the i-th data Wi that it is responsible to to the first basic processing circuit of the i-th column of the basic processing circuit array.
  • control circuit of the main processing circuit transmits each of some or all of the columns of the i-th data Ni that it is responsible for to the first basic processing circuit of the i-th column of the basic processing circuit array.
  • the basic processing circuit After receiving the data of the matrix S, the basic processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b). After receiving the data of the matrix P, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 1b) Vertical data path);
  • Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P.
  • connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold.
  • each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.
  • each basic processing circuit specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix
  • the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask.
  • the identification data in the matrix, etc. until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.
  • the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
  • the result can be transmitted from the data output interface
  • the result of the calculation may be the final result or an intermediate result of the inner product operation
  • the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, The following line of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
  • the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
  • the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
  • the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, and the matrix is multiplied by the device.
  • the weight matrix of the fully connected layer is used as the matrix S
  • the input vector is used as the matrix P
  • the weight of the fully connected layer The value matrix is used as the matrix P
  • the input vector is used as the matrix S, and the operation is performed by multiplying the matrix of the device by the matrix;
  • the convolution operation is performed using the circuit device:
  • the convolution operation is described below.
  • One block in the following figure represents a data
  • the input data is represented by Figure 3a (N samples, each sample has C channels, and the feature map of each channel has a height H and a width W).
  • the weight, that is, the convolution kernel is represented by Figure 3b (there are M convolution kernels, each convolution kernel has C channels, and the height and width are KH and KW, respectively).
  • the rules for convolution operations are the same for N samples of input data. The following is a process of convolution on a sample. On one sample, each of the M convolution kernels must perform the same.
  • each convolution kernel operation obtains a plane feature map, and M convolution kernels finally calculate M plane feature maps (for one sample, the convolution output is M feature maps), for a convolution kernel
  • M convolution kernels For one sample, the convolution output is M feature maps
  • FIG. 3c shows a convolution kernel performing inner product operations at the position of the lower right corner of a sample of input data.
  • Fig. 3d shows that the position of the convolution slides one space to the left
  • Fig. 3e shows that the position of the convolution slides up one space.
  • the first mapping circuit of the main processing circuit can process the data in part or all of the convolution kernel of the weight to obtain the corresponding mask data and the processed weight data (that is, part or all of the convolution of the processed weights) Data in the kernel).
  • the control circuit of the main processing circuit sends the data in part or all of the convolution kernel of the weight (the data can be the original weight data or the processed weight data) to be directly connected to the main processing circuit through the horizontal data input interface.
  • Those basic processing circuits for example, the uppermost gray filled vertical data path in FIG. 1b); at the same time, the control circuit sends the mask data associated with the data to the basic processing circuit connected to the main processing circuit. ;
  • control circuit of the main processing circuit sends a certain number or part of the data of a convolution kernel to a basic processing circuit each time (for example, for a basic processing circuit, The first time, the first number of the third line is transmitted, the second number of the third line of data is transmitted for the second time, the third number of the third line is transmitted for the third time, ..., or the third line is transmitted for the first time.
  • the control circuit takes a weight in one of the volumes
  • the mask data corresponding to the product core also uses the above-mentioned one or a part of data to be sent to the above basic processing circuit;
  • control circuit of the main processing circuit sends the data of some convolution kernels of the weights to each of the basic processing circuits each time (for example, For a certain basic processing circuit, the first number of lines in the 3rd, 4th, and 5th lines is transmitted for the first time, and the second number of the 3rd, 4th, and 5th lines is transmitted for the second time, and the third time is transmitted.
  • control circuit will be associated with some of the convolution kernels in the weight
  • the mask data also uses the same method described above to send one or a portion of the data to the basic processing circuit each time;
  • the control circuit of the main processing circuit divides the input data according to the position of the convolution, and the control circuit of the main processing circuit sends the data in part or all of the convolution position in the input data to the main processing circuit through the vertical data input interface.
  • the control circuit also performs the mask data associated with the input data according to the position of the convolution Dividing, correspondingly, the control circuit also sends the mask data corresponding to the data in some or all of the convolution positions of the input data to the basic processing circuit electrically connected to the main processing circuit;
  • control circuit of the main processing circuit sends a certain number or part of the data of a certain convolution position in the input data and the mask data associated with the data to a certain basic processing circuit; For example, for a certain basic processing circuit, the first number of the third column is transmitted for the first time, the second number of the third column of data is transmitted for the second time, and the third number of the third column is transmitted for the third time... , or send the first two numbers in the third column for the first time, the third and fourth numbers in the third column for the second time, and the fifth and sixth numbers in the third column for the third time...;
  • control circuit of the main processing circuit sends data of a certain convolution position in the input data and mask data associated with the data each time by a number or a part of the data.
  • a basic processing circuit for example, for a basic processing circuit, the first number of the third, fourth, and fifth columns is sent for the first time, and the second, third, fourth, and fifth columns are for the second 2 digits, the 3rd number of the 3rd, 4th, and 5th columns, the 3rd number of each column..., or the 1st transmission of the 3rd, 4th, and 5th columns, the first two digits, the second transmission of the third digit , 4, 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th, each column...;
  • the basic processing circuit receives the data of the weight (specifically, the data of the convolution kernel in the weight (referred to as weight data) or the mask data associated with the weight data), and then outputs the data through the horizontal data thereof.
  • the interface is transmitted to the next basic processing circuit (eg, a white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); the basic processing circuit receives the data (the data can be input to the main processing circuit) And the identification mask data associated with the input data, the data is transmitted through its vertical data output interface to the next underlying processing circuit connected thereto (eg, the white-filled vertical in the middle of the basic processing circuit array in FIG. 1b) Data path)
  • control circuit of the main processing circuit may send the input data and the mask data associated with the input data to the basic processing circuit, and the basic processing circuit receives the input data and the mask data associated with the input data;
  • Each of the basic processing circuits operates on the received data; specifically, the basic processing circuit can enable the second mapping circuit to associate the mask data associated with the input data with the mask data associated with the weight data (ie, the convolution kernel associated with the weights)
  • the mask data is obtained by the connection identification data; and the connection identification data is used to select the input data and the data in which the absolute value of the weight data is greater than a preset threshold is multiplied;
  • each basic processing circuit specifically, the data block to be calculated, such as the data in the convolution kernel in the weight and the mask data, the input data, or the data associated with the data
  • the basic processing circuit will no longer receive new input data, such as data in some convolution kernels of the weights subsequently sent by the main processing circuit and The data corresponds to the associated mask data and the like until the base processing circuit has sufficient buffer/storage space, and then receives the newly transmitted data of the main processing circuit.
  • the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on a register and/or an on-chip buffer;
  • the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and/or the on-chip buffer;
  • the result can be transmitted from the data output interface
  • the result of the calculation may be the final result or an intermediate result of the inner product operation
  • the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
  • the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
  • the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
  • the steps of neural network training include:
  • Each layer in a (multi-layer) neural network performs a forward operation in sequence
  • the entire training process requires repeated execution (ie multiple iterations) to process this process multiple times.
  • the device is used for training of a neural network, the neural network includes n layers, and the n value ranges from an integer of 2 or more, characterized in that the integration
  • the circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and The second mapping circuit is configured to perform compression processing of each data in the neural network operation;
  • the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;
  • the integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data.
  • the n-th layer forward operation obtains the nth output result of the forward operation;
  • the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction
  • the block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;
  • the plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block
  • An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit;
  • the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;
  • the main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data.
  • the integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.
  • each layer uses its own input data and weights to calculate corresponding output data according to an operation rule specified by a layer type;
  • the forward operation process (also called inference) of the neural network is a process of processing the input data of each layer layer by layer, and obtaining the output data after a certain calculation, which has the following characteristics:
  • the input to a layer can be the input data of the neural network
  • the input of one layer can be the output of other layers;
  • the input of a layer can be the output of the layer at a moment (corresponding to the case of a cyclic neural network);
  • a layer can simultaneously acquire input from a plurality of the above input sources
  • the output of a layer can be used as the output of the neural network
  • the output of one layer can be the input of other layers
  • the output of a layer can be the input of this layer at the next moment (in the case of a cyclic neural network);
  • the output of a layer may output a result to the plurality of output directions described above;
  • the types of operations of the layers in the neural network include, but are not limited to, the following:
  • Normalized (regularized) layer includes LRN (Local Response Normalization) layer, BN (Batch Normalization) layer, and the like;
  • Activation layer including but not limited to the following types of Sigmoid layer, ReLU layer, PReLu layer, LeakyReLu layer, Tanh layer;
  • the inverse operation of the layer the inverse operation of each layer needs to perform a two-part operation: one part is to calculate the weight gradient using the output data gradient which may be sparse representation and the input data which may be sparse representation (for the weight
  • the value update step updates the weight of this layer, and the other part uses the output data gradient, which may be sparse representation, and the weight of the sparse representation, to calculate the input data gradient (used as the next layer in the inverse operation).
  • Output a data gradient for its inverse operation
  • the inverse operation reverses the gradient from the last layer in the reverse order of the forward operation.
  • the inverse of the output data gradient calculated by a layer can be derived from:
  • Input data gradient at a moment on this layer (corresponding to the case of a cyclic neural network);
  • a layer can simultaneously obtain an output data gradient from a plurality of the above sources
  • the gradient of the weights of the layers is calculated.
  • the first input buffer and the second input buffer of the device are respectively used to store the weights of the layer.
  • the gradient of the weights and then use the weight gradient to update the weights in the arithmetic unit;
  • the operations mentioned above are all operations of a layer in a neural network.
  • the implementation process is that, in the forward operation, when the previous artificial neural network is executed, the next layer of operations is performed.
  • the instruction will operate the output data calculated in the operation unit as the input data of the next layer (or perform some operations on the output data as the input data of the next layer), and at the same time, replace the weight with the next one.
  • the weight of one layer; in the reverse operation when the reverse operation of the upper artificial neural network is completed, the next layer of operation instructions will use the input data gradient calculated in the operation unit as the output data of the next layer.
  • the tensor multiplying tensor operation is performed as shown in FIG. 1a, and the tensor is the same as the data block described above, which may be a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
  • the data block described above may be a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block.
  • an operation of multiplying a matrix by a matrix when the forward operation indicated by the first operation instruction is a matrix multiplication matrix operation, the input data is a first matrix of the matrix multiplication matrix operation, The weight is the second matrix of the matrix multiplication matrix operation.
  • Step S401b the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received data in an on-chip buffer and/or a register;
  • the data of the matrix S is processed data.
  • the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S.
  • the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S.
  • each row data in the processed matrix S and the row data corresponding to the identification data corresponding to the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit.
  • the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
  • the control circuit of the main processing circuit respectively distributes one row of the S matrix to the M basic processing circuits; optionally, the corresponding row is also sent by the row.
  • the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
  • the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;
  • the Mi line in S is distributed to the i-th base processing circuit, and the set of Mi lines is called Ai, and the calculation to be performed on the i-th basic processing circuit is shown in Fig. 2e.
  • each of the basic processing circuits such as the i-th basic processing circuit:
  • the received matrix Ai distributed by the main processing circuit stores the matrix Ai in the i-th basic processing circuit register and/or the on-chip buffer; the advantage is that the amount of data transmission afterwards is reduced, the calculation efficiency is improved, and the power consumption is reduced.
  • Step S402b the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast manner;
  • the data (parts) of the matrix P can be processed data.
  • the main processing circuit enables the first mapping circuit to process the matrix P, thereby obtaining the processed matrix P and a second matrix of the matrix associated with the matrix P.
  • the first mapping circuit of the main processing circuit processes the matrix P according to the second mask matrix associated with the pre-stored matrix P to obtain the processed matrix P.
  • the data in the processed matrix P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in.
  • the main processing circuit sends data to the basic processing circuit, specifically, the data in the processed matrix P whose absolute value is greater than a preset threshold or non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
  • each part of the matrix P can be broadcast only once to the registers of the respective basic processing circuits or the on-chip buffer, and the i-th basic processing circuit fully multiplexes the data of the matrix P obtained this time.
  • the inner product operation corresponding to each row in the matrix Ai is completed; the multiplexing in this embodiment may be repeatedly used in the calculation by the basic processing circuit, for example, the multiplexing of the data of the matrix P, and may be the data of the matrix P. use many times.
  • control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
  • the data is not multiplexed, and the inner product operations corresponding to each row in the matrix Ai are completed in stages;
  • control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
  • the data is partially multiplexed to complete an inner product operation corresponding to each row in the matrix Ai;
  • each of the basic processing circuits calculates an inner product of the data of the matrix Ai and the data of the matrix P;
  • Step S403b the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.
  • step S403b the inner product operator of the basic processing circuit needs to calculate the inner product of the data of the matrix S and the matrix P.
  • the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. .
  • the basic processing circuit enables the second mapping circuit to process the data of the received matrix P according to the identification data in the received first mask matrix to obtain the data of the processed matrix P.
  • the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation.
  • the basic processing circuit receives the data in the processed matrix P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. .
  • the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S.
  • the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed matrix P and the processed data in the matrix S to obtain a result of the inner product operation.
  • the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. And the data corresponds to the identification data associated in the second mask matrix.
  • the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix
  • the data in S and the data in the matrix P are processed to obtain the data of the processed matrix S and the data of the processed matrix P.
  • the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation.
  • the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the matrix P, and the second identification matrix associated with the matrix P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix.
  • the relationship identification matrix is used to process the matrix Ai and the matrix P simultaneously or separately to obtain the processed matrix Ai and the processed matrix P.
  • the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed matrix P.
  • the basic processing circuit may accumulate the portion obtained by performing the inner product operation each time and transmit back to the main processing circuit;
  • the portion obtained by the inner product operation performed by each basic processing circuit may be stored in a register and/or an on-chip buffer of the basic processing circuit, and then accumulated and returned to the main processing circuit;
  • FIG. 6b provides a method for implementing a matrix multiplication vector, which may specifically include:
  • Step S401 the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received distribution data in an on-chip buffer of the basic processing circuit and/or In the register;
  • the data of the matrix S is processed data.
  • the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S.
  • the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S.
  • each row data in the processed matrix S and the row data corresponding to the corresponding associated identification data in the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit.
  • the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.
  • the set of rows in the matrix S processed in the i-th basic processing circuit is Ai, and the Mi rows are shared; correspondingly, the identification matrix Bi corresponding to Ai is also distributed, and Bi is the first mask matrix. Part of a total of greater than or equal to the Mi line.
  • the control circuit of the main processing circuit separately distributes one row of the S matrix to the K basic processing circuits; optionally, the corresponding row is also sent by the row.
  • the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
  • the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;
  • the set of rows in S distributed to the i-th base processing circuit is Ai, sharing a total of Mi rows, as shown in Figure 2c, which is to be performed on the i-th base processing circuit.
  • the received distribution data such as the matrix Ai
  • the received distribution data may be stored in a register and/or an on-chip buffer of the i-th base processing circuit.
  • Step S402 the control circuit of the main processing circuit transmits the parts in the vector P to the K basic processing circuits in a broadcast manner;
  • the data (parts) of the vector P may be processed data.
  • the main processing circuit enables the first mapping circuit to process the vector P to obtain a processed vector P and a second mask matrix associated with the vector P.
  • the first mapping circuit of the main processing circuit processes the vector P according to the second mask matrix associated with the pre-stored vector P to obtain the processed vector P.
  • the data in the processed vector P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in.
  • the main processing circuit sends data to the basic processing circuit, the data of the processed vector P having an absolute value greater than a preset threshold or non-zero data may be specifically sent to the basic processing circuit to reduce the data transmission amount.
  • control circuit of the main processing circuit can broadcast each part of the vector P only once to the register or the on-chip buffer of each basic processing circuit, and the i-th basic processing circuit obtains the vector P of this time.
  • the data is fully multiplexed to complete the inner product operation corresponding to each row in the matrix Ai.
  • control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
  • the data is not multiplexed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of the vector P of the single transmission inside the basic processing circuit is reduced, and the basic processing circuit buffer and the lower limit can be reduced. / or the capacity of the register to improve execution efficiency, reduce transmission power consumption, and reduce costs.
  • control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
  • the data is partially multiplexed to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that the data transmission amount from the main processing circuit to the basic processing circuit is reduced, and the data transmission amount inside the basic processing circuit is also reduced, and the execution is improved. Efficiency, reducing transmission power consumption.
  • Step S403 the inner product operator circuit of the K basic processing circuits calculates an inner product of the data of the matrix S and the vector P, for example, the i-th basic processing circuit, and calculates an inner product of the data of the matrix Ai and the data of the vector P;
  • the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. .
  • the basic processing circuit enables the second mapping circuit to process the data of the received vector P according to the identification data in the received first mask matrix to obtain the data of the processed vector P.
  • the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation.
  • the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, and the vector P; at this time, the second mapping circuit can be enabled to process the vector P by using the Bi to obtain the processed vector P;
  • the product operator circuit performs an inner product operation on the matrix Ai and the processed vector P.
  • the basic processing circuit receives the data in the processed vector P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. .
  • the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S.
  • the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed vector P and the processed data in the matrix S to obtain a result of the inner product operation.
  • the i-th basic processing circuit receives the matrix Ai, the processed vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to process the Ai by using the second identification matrix.
  • the matrix Ai; the inner product operator circuit is further enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.
  • the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. And the data corresponds to the identification data associated in the second mask matrix.
  • the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix
  • the data in S and the data in the vector P are processed to obtain the data of the processed matrix S and the data of the processed vector P.
  • the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation.
  • the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix.
  • the relationship identification matrix is used to process the matrix Ai and the vector P simultaneously or separately to obtain the processed matrix Ai and the processed vector P.
  • the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.
  • Step S404 The accumulator circuit of the K basic processing circuits accumulates the result of the inner product operation to obtain an accumulated result, and transmits the accumulated result to the main processing circuit in a fixed point type.
  • each part of the basic processing circuit may perform an inner product operation to obtain a part and (partial and part of the accumulated result, for example, the accumulated result is: F1*G1+F2*G2+F3*G3+F4 *G4+F5*G5, then the part and can be: the value of F1*G1+F2*G2+F3*G3) is transferred back to the main processing circuit for accumulation; the advantage is that the calculation amount inside the basic processing circuit is reduced, and the basis is improved. The computational efficiency of the processing circuit.
  • the part obtained by the inner product operation performed by each basic processing circuit and the register and/or the on-chip buffer stored in the basic processing circuit may be transferred to the main processing circuit after the accumulation is completed;
  • the data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.
  • the main processing circuit performs accumulation, and is transferred back to the main processing circuit after the accumulation is completed; the advantage is that the data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the data transmission power consumption is reduced, and the basic processing is reduced.
  • the amount of calculation inside the circuit improves the computational efficiency of the basic processing circuit.
  • All or part of the data involved in the neural network training process may be processed data.
  • the foregoing embodiment which is obtained by the first mapping circuit and/or the second mapping circuit, and is not described here.
  • a data block ie, multiple input data blocks, an output data block), or a sub-block of a different portion divided in the same data block, may refer to a processed data block.
  • Figure 4c shows the specific calculation of neural network training for single-layer operation.
  • Figure 4c shows the forward operation of single-layer neural network.
  • the inverse of the single layer neural network is shown by the dashed line in Figure 4c.
  • the forward operation of the layer is performed according to the input data and the weight or the parameter, and the output data is obtained, and then the preset rule is calculated according to the output data (the preset rule can be set by the manufacturer according to its own needs, here)
  • the specific operation step of the preset rule operation is not limited to obtain the output data gradient of the layer.
  • the inverse operation of the layer neural network can be performed according to the input data, the weight or the parameter of the layer, and the output data gradient, and the gradient of the input data and the gradient of the weight or the parameter of the layer can be obtained, and the weight obtained by the calculation can be used.
  • the gradient of the parameter corresponds to updating the weight or parameter of the layer, that is, the neural network training of the layer is completed.
  • the data involved in the forward operation or the reverse operation may be the processed data.
  • the technical solution provided by the embodiment of the present application may be an inverse operation according to the layer.
  • the instruction determines whether to initiate a correlation mapping circuit (specifically, the first mapping circuit and/or the second mapping circuit) to process the input data and/or the weight, and then perform the layer operation using the processed input data and/or weights .
  • a correlation mapping circuit specifically, the first mapping circuit and/or the second mapping circuit
  • the layer operation using the processed input data and/or weights .
  • the following is a structural diagram of neural network training for matrix multiplication and convolution, taking FIG. 7a and FIG. 7b as examples.
  • the calculation mode of the layer shown in FIG. 7a is matrix multiplication
  • the operation mode of the layer shown in FIG. 7b is a convolution operation, assuming that the input data and the weight of the layer are all matrices, and the input data is matrix I for convenience.
  • the larger the dimension can be understood as the larger the number of columns of the matrix I and the matrix W and the sum of the number of rows, that is, the matrix I and the matrix W can be considered to occupy a larger space in the memory and/or the register and the calculation amount is larger.
  • the amount of data calculation is large; in order to improve the data processing efficiency, the matrix I and the matrix W need to be processed, and then the matrix multiplication operation is performed.
  • the matrix I is a sparse matrix of 1000*1000
  • the matrix W is also a sparse matrix of 1000*1000.
  • the sum of the number of columns and the number of rows is 2000, the number of which is large, and the corresponding calculation amount is larger, matrix multiplication
  • the multiplication operation of the inner product of the matrix is 109 times.
  • the data processing of the two sparse matrices can greatly reduce the dimension of the matrix I and the matrix W (ie, the amount of data), thereby greatly reducing the amount of data transmitted and Calculate the amount, which in turn reduces transmission overhead and computational overhead.
  • FIG. 7c A schematic diagram of the specific structure of the multilayer neural network training is shown in Figures 7c and 7d.
  • the direction of the dashed arrow shows a reverse operation.
  • the output of the inverse operation is the output data gradient; when the output data gradient is the last layer of the iterative calculation of the multilayer neural network, the output data gradient is specifically the last layer of the iterative calculation
  • the output data is subjected to a preset operation (the preset operation can be set by the manufacturer according to his own needs, and the specific operation steps of the preset operation are not limited herein); if the output data gradient is not an iterative for the multilayer neural network
  • the last layer of the calculation for example, the output data gradient is the nth layer of the iterative calculation, then the output data gradient of the nth layer can be the input data gradient calculated by the n+1th layer inverse operation.
  • FIG. 7d can be understood.
  • FIG. 7d can be a schematic diagram of multi-layer convolutional neural network training (including forward operation and reverse operation), and other operations in the figure can be expressed as layers other than the convolution layer. Or the operation between the layers, not limited.
  • the present disclosure also provides an integrated circuit chip device for performing training of a neural network, the neural network comprising a plurality of layers, the integrated circuit chip device comprising: a processing circuit and an external interface;
  • the external interface is configured to receive a training instruction
  • the processing circuit is configured to determine first layer input data and first layer weight data according to the training instruction, and perform n-th layer forward operation of the neural network by using the first layer input data and the first layer weight data to obtain the nth Output result
  • the processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of the nth layer reverse operation and the nth reverse operation instruction according to the training instruction.
  • the nth layer input data and the nth layer weight group data performing n-layer reversal of the neural network according to the nth reverse operation instruction, the nth output result gradient, the nth layer input data, and the nth layer weight group data The operation obtains n weight gradients of the n-layer operation;
  • the processing circuit is further configured to update the n weights of the n-layer operation by applying the n weight gradients.
  • the present disclosure also discloses a neural network computing device including one or more chips as shown in FIG. 8 for acquiring data to be processed and control information from other processing devices, performing specified neural network operations, and executing results. Passed to the peripheral device through the I/O interface. Peripherals such as cameras, monitors, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip as shown in FIG. 8 is included, the chips shown in FIG. 8 can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus to support a larger scale. The operation of the neural network. At this point, you can share the same control system, or you can have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.
  • the neural network computing device has high compatibility and can be connected to various types of servers through a PCIE interface.
  • a method of performing an offset operation using the circuit device
  • the vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices
  • the vector operator circuit of the main processing circuit can be used to add a vector to each row of a matrix, or to the function on each column.
  • the matrix may be derived from the result of the matrix multiplication matrix operation performed by the apparatus;
  • the vector may be from a result of the device performing a matrix multiplication vector operation
  • the matrix may be derived from externally accepted data from the main processing circuitry of the device.
  • the vector may be from externally accepted data from the main processing circuitry of the device.
  • the method of activating the function operation is performed using the circuit device:
  • the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a value and the output is also a value), and a value is output to the output vector. Corresponding position
  • the activation function can be a piecewise linear function
  • the activation function can be any function that inputs a number and outputs a number.
  • the source of the input vector is (including but not limited to):
  • the input data is from the device for performing a matrix multiplication vector
  • the input data is from the device for performing a matrix multiplication matrix
  • the main processing circuit of the device calculates a result
  • the input data is derived from the calculations after the device main processing circuit implements the offset.
  • GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
  • auxiliary integers as parameters to describe the width and height of the matrix A and B;
  • the main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;
  • the conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix P;
  • op can be a transposition operation of the matrix; using a vector operation function of the main processing circuit or a data rearrangement function (the circuit in which the main processing circuit has data rearrangement is mentioned), and the matrix transposition operation is implemented.
  • the above OP can also be directly realized by a conversion circuit, for example, when the matrix transposition operation is performed, the OP operation is directly realized by the matrix transposition circuit;
  • the op of a certain matrix may be empty, and the OP operation is not performed
  • GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
  • the main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;
  • the conversion circuit of the main processing circuit performs a corresponding op operation on the input matrix S;
  • the op can be a transposition operation of the matrix; the matrix transposition operation is implemented by the matrix transposition circuit of the main processing circuit;
  • the op of a certain matrix may be empty, and the op operation is not performed
  • the matrix-vector multiplication calculation between the matrix op(S) and the vector P is completed by the matrix multiplication vector calculation method;
  • the vector updater circuit of the main processing circuit is used to implement the weight update function in the neural network training process.
  • the weight update refers to a method of updating the weight using the gradient of the weight.
  • the vector operator circuit of the main processing circuit is used to add and subtract the two vectors of the weight and the weight gradient to obtain an operation result, and the operation result is an update weight.
  • the vector operator circuit using the main processing circuit multiplies or divides the weight and the weight gradient by a number to obtain an intermediate weight and an intermediate weight gradient value, and the vector operator circuit pairs the intermediate weight And the intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the update weight.
  • a set of momentum can be calculated by using the gradient of the weights, and then the updated weights are obtained by adding and subtracting the momentum and the weights.
  • the reverse operation of the fully connected layer can be divided into two parts. As shown in the following figure, the solid arrow indicates the forward calculation process of the fully connected layer, and the broken line indicates the reverse calculation process of the fully connected layer.
  • the method of performing matrix multiplication operation using the device can complete the inverse operation of the fully connected layer
  • the inverse operation of the convolutional layer can be divided into two parts. As shown in Fig. 9a, the solid arrow indicates the forward calculation process of the convolutional layer, as shown in Fig. 9b, which represents the reverse calculation process of the convolutional layer.
  • the reverse operation of the convolutional layer can be accomplished using the apparatus shown in Figure 1b using the apparatus shown in Figure 1a.
  • a forward operation or a reverse operation it is actually a plurality of operations of the neural network, including but not limited to: one of a matrix multiplication matrix, a matrix multiplication vector, a convolution operation, an activation operation, and the like.
  • the manner of the above operations may be described in the present disclosure and will not be described herein.
  • Embodiments of the present disclosure provide a neural network processor board that can be used in numerous general purpose or dedicated computing system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, smart homes, home appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics devices, personal computers (personal computers) , PC), small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and so on.
  • FIG. 10a is a schematic structural diagram of a neural network processor card according to an embodiment of the present disclosure.
  • the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.
  • the disclosure is not limited to the specific structure of the neural network chip package structure 11.
  • the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a first Two substrates 113.
  • the specific form of the neural network chip 111 involved in the disclosure is not limited.
  • the above neural network chip 111 includes, but is not limited to, a neural network chip integrated with a neural network processor, and the above silicon wafer may be made of silicon material, germanium material, quantum material or molecule. Made of materials, etc.
  • the neural network wafer can be packaged according to actual conditions (for example, a more severe environment) and different application requirements, so that most of the neural network wafer is wrapped, and the pins on the neural network wafer are passed through the gold wire.
  • the conductors are connected to the outside of the package structure for electrical connection to the outer layer.
  • the specific structure of the neural network chip 111 is not limited in this disclosure. Alternatively, please refer to the device shown in FIG. 1a or 1b.
  • the present disclosure is not limited to the types of the first substrate 13 and the second substrate 113, and may be a printed circuit board (PCB) or a printed wiring board (PWB), and may be other circuit boards. There are no restrictions on the materials used to make the PCB.
  • PCB printed circuit board
  • PWB printed wiring board
  • the second substrate 113 of the present disclosure is used to carry the neural network chip 111, and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112.
  • the neural network chip 111 is protected to further encapsulate the neural network chip package structure 11 and the first substrate 13.
  • FCBGAP flip chip Flip Chip Ball Grid Array Package
  • LQFP Low-profile Quad Flat Package
  • HQFP Quad Flat Package with Heat Sink
  • FCBGAP flip chip Flip Chip Ball Grid Array Package
  • LQFP Low-profile Quad Flat Package
  • HQFP Quad Flat Package with Heat Sink
  • FBGA Fine-Pitch Ball Grid Package
  • Flip Chip is suitable for the case where the area after packaging is required to be high or sensitive to the inductance of the wire and the transmission time of the signal.
  • wire bonding can be used to reduce the cost and increase the flexibility of the package structure.
  • Ball Grid Array which can provide more pins, and the average lead length of the pins is short, which has the function of transmitting signals at high speed.
  • the package can be packaged in Pin Grid Array (PGA). Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA), etc.
  • ZIF Zero Insertion Force
  • SECC Single Edge Contact Connection
  • LGA Land Grid Array
  • the neural network chip 111 and the second substrate 113 are encapsulated by using a flip chip ball grid array (Flip Chip Ball Grid Array).
  • a flip chip ball grid array flip chip Ball Grid Array
  • FIG. 11a the neural network chip package structure includes a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a lead 26.
  • the pad 22 is connected to the neural network chip 21, and the solder ball 23 is soldered between the pad 22 and the connection point 25 on the second substrate 24 to connect the neural network chip 21 and the second substrate 24, thereby realizing The package of the neural network chip 21.
  • the pin 26 is used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), and can realize transmission of external data and internal data, facilitating the neural network chip 21 or the neural network chip 21.
  • the corresponding neural network processor processes the data.
  • the type and number of pins are not limited in this disclosure. Different pin types may be selected according to different packaging technologies, and are arranged according to certain rules.
  • the neural network chip package structure further includes an insulating filler disposed in the gap between the pad 22, the solder ball 23 and the connection point 25 for preventing interference between the solder ball and the solder ball.
  • the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductance interference and the like.
  • the neural network chip package structure further includes a heat dissipation device for dissipating heat of the neural network chip 21 during operation.
  • the heat sink may be a piece of metal with good thermal conductivity, a heat sink or a heat sink, for example, a fan.
  • the neural network chip package structure 11 includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a pin 26, Insulating filler 27, thermal grease 28 and metal housing fins 29.
  • the heat dissipation paste 28 and the metal case heat sink 29 are used to dissipate the heat of the neural network chip 21 during operation.
  • the neural network chip package structure 11 further includes a reinforcing structure, is connected to the pad 22, and is embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22.
  • the reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.
  • the present disclosure is not limited to the specific form of the first electrical and non-electrical device 12, and the description of the second electrical and non-electrical device 112 may be referred to, that is, the neural network chip package structure 11 may be packaged by soldering, or may be connected.
  • the manner of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging and unplugging facilitates subsequent replacement of the first substrate 13 or the neural network chip package structure 11.
  • the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example, a Synchronous Dynamic Random Access Memory (SDRAM), and a Double Rate Rate SDRAM (Double Rate Rate SDRAM). DDR), etc., improve the processing power of the neural network processor by expanding the memory.
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDR Double Rate Rate SDRAM
  • the first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, and an Ethernet interface. Controller Area Network (CAN) interface, etc., used for data transmission between the package structure and external circuits, which can improve the operation speed and convenience of operation.
  • PCI-E or PCIe Peripheral Component Interconnect-Express
  • SFP Small Form-factor Pluggable
  • Ethernet interface Ethernet interface.
  • Controller Area Network (CAN) interface etc., used for data transmission between the package structure and external circuits, which can improve the operation speed and convenience of operation.
  • CAN Controller Area Network
  • the neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, and the neural network chip package structure 11 is packaged as a neural network processor board 10, through an interface on the board (The slot or ferrule) performs data interaction with an external circuit (for example, a computer motherboard), that is, directly implements the function of the neural network processor by using the neural network processor board 10, and protects the neural network chip 111.
  • other modules can be added to the neural network processor board 10, which improves the application range and computational efficiency of the neural network processor.
  • the present disclosure discloses an electronic device that includes the neural network processor board 10 or neural network chip package structure 11 described above.
  • Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

An integrated circuit chip device and a related product. The integrated circuit chip device comprises: a main processing circuit and a plurality of basic processing circuits; the main processing circuit or at least one basic processing circuit of the plurality of basic processing circuits comprises: a compression mapping circuit, the compression mapping circuit being configured to perform compression processing of each data in a neural network operation.

Description

集成电路芯片装置、板卡及相关产品Integrated circuit chip devices, boards and related products 技术领域Technical field

本披露涉及神经网络领域,尤其涉及一种集成电路芯片装置、板卡及相关产品。The present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device, a board, and related products.

背景技术Background technique

人工神经网络(Artificial Neural Network,即ANN),是20世纪80年代以来人工智能领域兴起的研究热点。它从信息处理角度对人脑神经元网络进行抽象,建立某种简单模型,按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成。现有的神经网络的运算基于CPU(Central Processing Unit,中央处理器)或GPU(英文:Graphics Processing Unit,图形处理器)来实现神经网络的运算,此种运算的计算量大,功耗高。Artificial Neural Network (ANN) is a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes a simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to directly as a neural network or a neural network. A neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other. The calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) to implement the operation of the neural network. Such calculations have a large amount of calculation and high power consumption.

发明内容Summary of the invention

本披露实施例提供了一种集成电路芯片装置及相关产品,可提升计算装置的处理速度,提高效率。Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed of the computing device and improve efficiency.

第一方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;In a first aspect, an integrated circuit chip device is provided, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;The main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;

所述多个基础处理电路,用于依据传输的数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。The plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.

第二方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;In a second aspect, an integrated circuit chip device is provided. The integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述主处理电路,用于获取输入数据块、卷积核数据块以及卷积指令,依据所述卷积指令将所述输入数据块划分为竖向数据块,将所述卷积核数据块划分为横向数据块;依据所述卷积指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述卷积指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction, and the convolution kernel data block Dividing into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data block including the horizontal data block And/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the convolution instruction;

所述多个基础处理电路,用于依据所述卷积指令的运算控制确定是否启动第二映射电路对第二数据 块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;

所述主处理电路,用于将所述运算结果处理得到所述卷积指令的指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.

第三方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;In a third aspect, an integrated circuit chip device is provided, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, wherein at least one of the plurality of basic processing circuits A circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in a neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述主处理电路,用于获取输入数据块、权值数据块以及乘法指令,依据所述乘法指令将所述输入数据块划分成横向数据块,将所述权值数据块划分成竖向数据块;依据所述乘法指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述乘法指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, divide the input data block into horizontal data blocks according to the multiplication instruction, and divide the weight data block into vertical data blocks. Blocking; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block to obtain the processed first data block; the first data block includes the horizontal data block and/or the a vertical data block; transmitting, according to the multiplication instruction, the processed first data block to at least one of the basic processing circuits connected to the main processing circuit;

所述多个基础处理电路,用于依据所述乘法指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, and perform the operation in the neural network in parallel according to the processed second data block. Obtaining an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is sent by the main processing circuit determined by the basic processing circuit Data block, the second data block is associated with the processed first data block;

所述主处理电路,用于将所述运算结果处理得到所述乘法指令的指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result of the multiplication instruction.

第四方面,提供一种集成电路芯片装置,所述集成电路芯片装置用于执行神经网络正向运算,所述神经网络包含n层;所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;A fourth aspect provides an integrated circuit chip device for performing a neural network forward operation, the neural network comprising n layers; the integrated circuit chip device comprising: a main processing circuit and a plurality of foundations Processing circuit; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second mapping circuit are both used to execute a neural network Compression processing of each data in the operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述主处理电路,用于接收正向运算指令,解析所述正向运算指令得到所述正向运算指令在所述神经网络正向运算中第i层包含的第一运算指令以及所述第一运算指令所需的输入数据块和权值数据块,所述i的取值范围为大于等于1,且小于等于n的整数,如所述i大于等于2,所述输入数据块为第i-1层的输出数据块;The main processing circuit is configured to receive a forward operation instruction, and parse the forward operation instruction to obtain a first operation instruction included in an ith layer of the forward operation instruction in the neural network forward operation, and the first processing instruction An input data block and a weight data block required for an operation instruction, wherein the value range of i is an integer greater than or equal to 1, and less than or equal to n, as the i is greater than or equal to 2, and the input data block is the ith -1 layer of output data blocks;

所述主处理电路,还用于依据所述第一运算指令将所述输入数据块划分为竖向数据块,将所述权值数据块划分为横向数据块;根据所述第一运算指令的运算控制确定是否启动第一映射电路对第一数据块进行处理,以得到处理后的第一数据块,所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述正向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction The operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block; The forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;

所述多个基础处理电路,用于依据所述第一运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. The operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;

所述主处理电路,用于将所述运算结果进行处理得到所述第一运算指令的指令结果,完成所述第i层包含的所述第一运算指令的运算。The main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.

第五方面,提供一种集成电路芯片装置,所述装置用于执行神经网络的训练,该神经网络包含n层,所述n取值范围为大于等于2的整数,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;In a fifth aspect, an integrated circuit chip device is provided for performing training of a neural network, the neural network includes n layers, and the value of n ranges from an integer of 2 or more, and the integrated circuit chip device includes a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and the second The mapping circuits are all used to perform compression processing of each data in the neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述集成电路芯片装置,用于接收训练指令,依据该训练指令确定第一层输入数据和第一层权值组数据,对第一层输入数据和第一层权值组数据执行神经网络的n层正向运算得到正向运算的第n输出结果;The integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data. The n-th layer forward operation obtains the nth output result of the forward operation;

所述主处理电路,还用于依据所述第n输出结果得到第n输出结果梯度,依据所述训练指令获取第n层反向运算的第n反向运算指令以及所述第n反向运算指令所需的第n层输入数据以及第n层权值组数据;依据所述第n反向运算指令将所述第n输出结果梯度、第n层输入数据以及第n层权值组数据划分为竖向数据块和横向数据块;依据所述第n反向运算指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述第n反向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction The nth layer input data and the nth layer weight group data required by the instruction; dividing the nth output result gradient, the nth layer input data, and the nth layer weight group data according to the nth reverse operation instruction a vertical data block and a horizontal data block; determining, according to the operation control of the nth reverse operation instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data The block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;

所述多个基础处理电路,用于依据所述第n反向运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;

所述主处理电路,还用于对该运算结果进行处理得到第n层权值组梯度和第n层输入数据梯度,应用所述第n层权值组梯度对第n层权值组数据进行更新;The main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data. Update

所述集成电路芯片装置,还用于将第n层输入数据梯度作为第n-1层的第n-1输出结果梯度执行n-1层反向运算得到n-1层权值组梯度,应用n-1层权值组梯度更新对应层的权值组数据,所述权值组数据包括至少二个权值。The integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.

第六方面,提供一种神经网络处理器板卡,所述神经网络处理器板卡包括:神经网络芯片封装结构、第一电气及非电气连接装置和第一基板;所述神经网络芯片封装结构包括:神经网络芯片、第二电气及非电气连接装置和第二基板,所述第二基板承载所述神经网络芯片,所述第二基板通过所述第二电气及非电气连接装置与所述神经网络芯片连接;According to a sixth aspect, a neural network processor board is provided, the neural network processor board includes: a neural network chip package structure, a first electrical and non-electrical connection device, and a first substrate; the neural network chip package structure The method includes: a neural network chip, a second electrical and non-electrical connection device, and a second substrate, the second substrate carries the neural network chip, and the second substrate passes the second electrical and non-electrical connection device Neural network chip connection;

所述神经网络芯片包括上述第一方面至五方面中的任一方面所提供的集成电路芯片装置。The neural network chip includes the integrated circuit chip device provided by any of the above aspects to the fifth aspect.

第七方面,提供一种神经网络运算装置,所述神经网络运算装置包括上述第一方面至五方面中的任一方面所提供的集成电路芯片装置。According to a seventh aspect, a neural network computing device is provided, the neural network computing device comprising the integrated circuit chip device provided by any one of the first to fifth aspects.

第八方面,提供一种组合处理装置,所述组合处理装置包括:第七方面提供的神经网络运算装置、通用互联接口和通用处理装置;According to an eighth aspect, a combined processing apparatus is provided, where the combined processing apparatus includes: a neural network computing apparatus provided by the seventh aspect, a universal interconnection interface, and a general processing apparatus;

所述神经网络运算装置通过所述通用互联接口与所述通用处理装置连接。The neural network computing device is coupled to the general purpose processing device via the universal interconnect interface.

第九方面,提供一种芯片,所述芯片集成上述第一方面至第八方面中的任一方面所提供的装置。In a ninth aspect, a chip is provided, the chip integrating the apparatus provided in any one of the first to eighth aspects above.

第十方面,提供一种电子设备,所述电子设备包括第九方面的芯片。In a tenth aspect, an electronic device is provided, the electronic device comprising the chip of the ninth aspect.

第十一方面,提供一种神经网络的运算方法,所述方法应用在集成电路芯片装置内,所述集成电路芯片装置包括:第一方面至第五方面中的任一方面所述的集成电路芯片装置,所述集成电路芯片装置用于执行神经网络的运算。In an eleventh aspect, a method for computing a neural network is provided, the method being applied to an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit of any one of the first aspect to the fifth aspect A chip device for performing an operation of a neural network.

可以看出,通过本披露实施例,提供压缩映射电路将数据块压缩处理后再进行运算,节省了传输资源以及计算资源,所以其具有功耗低,计算量小的优点。It can be seen that, by using the embodiment of the present disclosure, the compression mapping circuit is provided to compress the data block and then perform operations, thereby saving transmission resources and computing resources, so that it has the advantages of low power consumption and small calculation amount.

附图说明DRAWINGS

图1a是一种集成电路芯片装置结构示意图。1a is a schematic structural view of an integrated circuit chip device.

图1b是另一种集成电路芯片装置结构示意图。FIG. 1b is a schematic structural view of another integrated circuit chip device.

图1c是一种基础处理电路的结构示意图。Figure 1c is a schematic structural view of a basic processing circuit.

图1d是一种主处理电路的结构示意图。Figure 1d is a schematic diagram of the structure of a main processing circuit.

图2a是一种基础处理电路的使用方法示意图。Figure 2a is a schematic diagram of the use of a basic processing circuit.

图2b是一种主处理电路传输数据示意图。Figure 2b is a schematic diagram of the data transmitted by the main processing circuit.

图2c是矩阵乘以向量的示意图。Figure 2c is a schematic diagram of a matrix multiplied by a vector.

图2d是一种集成电路芯片装置结构示意图。2d is a schematic structural view of an integrated circuit chip device.

图2e是又一种集成电路芯片装置结构示意图。2e is a schematic structural view of still another integrated circuit chip device.

图2f是矩阵乘以矩阵的示意图。Figure 2f is a schematic diagram of a matrix multiplied by a matrix.

图3a为卷积输入数据示意图。Figure 3a is a schematic diagram of convolution input data.

图3b为卷积核示意图。Figure 3b is a schematic diagram of a convolution kernel.

图3c为输入数据的一个三维数据块的运算窗口示意图。Figure 3c is a schematic diagram of the operation window of a three-dimensional data block of input data.

图3d为输入数据的一个三维数据块的另一运算窗口示意图。Figure 3d is a schematic diagram of another operational window of a three-dimensional data block of input data.

图3e为输入数据的一个三维数据块的又一运算窗口示意图。Figure 3e is a schematic diagram of still another operational window of a three-dimensional data block of input data.

图4a是一种神经网络的训练方法示意图。Figure 4a is a schematic diagram of a training method of a neural network.

图4b是一种神经网络的正向运算示意图。Figure 4b is a schematic diagram of the forward operation of a neural network.

图4c是一种神经网络运算的示意图。Figure 4c is a schematic diagram of a neural network operation.

图5a-图5b为本申请实施例提供的两种映射电路的结构示意图。5a-5b are schematic structural diagrams of two mapping circuits provided by an embodiment of the present application.

图6a为矩阵乘以矩阵的方法流程图。Figure 6a is a flow chart of a method of multiplying a matrix by a matrix.

图6b为矩阵乘以向量的方法流程图。Figure 6b is a flow chart of a method of multiplying a matrix by a vector.

图7a为一种神经网络训练示意图。Figure 7a is a schematic diagram of neural network training.

图7b为另一种神经网络训练示意图。Figure 7b is a schematic diagram of another neural network training.

图7c为神经网络正向与反向运算示意图。Figure 7c is a schematic diagram of the forward and reverse operations of the neural network.

图7d为神经网络训练多层结构示意图。Figure 7d is a schematic diagram of a multi-layer structure of neural network training.

图8为本披露实施例流提供的一种神经网络芯片的结构示意图。FIG. 8 is a schematic structural diagram of a neural network chip provided by the flow of the embodiment of the present disclosure.

图9a为本披露还揭露了一个组合处理装置结构示意图。FIG. 9a is a schematic structural diagram of a combined processing apparatus according to the present disclosure.

图9b为本披露还揭露了一个组合处理装置另一种结构示意图。FIG. 9b is another schematic structural diagram of a combined processing device according to the present disclosure.

图10a为本披露实施例提供的一种神经网络处理器板卡的结构示意图。FIG. 10 is a schematic structural diagram of a neural network processor card provided by an embodiment of the present disclosure.

图10b为本披露实施例流提供的一种神经网络芯片封装结构的结构示意图。FIG. 10b is a schematic structural diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure.

图11a为本披露实施例流提供的一种神经网络芯片封装结构的示意图;11a is a schematic diagram of a neural network chip package structure provided by the flow of the embodiment of the present disclosure;

图11b为本披露实施例流提供的另一种神经网络芯片封装结构的示意图。FIG. 11b is a schematic diagram of another neural network chip package structure provided by the flow of the embodiment of the present disclosure.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本披露方案,下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present disclosure. It is a partial embodiment of the disclosure, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without departing from the inventive scope are the scope of the disclosure.

在第一方面提供的装置中,该装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;In the apparatus provided by the first aspect, the apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, and at least one of the plurality of basic processing circuits includes a second mapping circuit The first mapping circuit and the second mapping circuit are both configured to perform compression processing of each data in a neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;The main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;

所述多个基础处理电路,用于依据传输的数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。The plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.

在第一方面提供的装置中,所述主处理电路用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;将所述横向数据块和预存的所述横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块关联的标识数据块;将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路;将所述竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路。其中,所述标识数据块具体可用直接索引或者步长索引的方式来表示,可选的还可用列表的列表(List of Lists,LIL)、坐标列表(Coordinate list,COO)、压缩稀疏行(Compressed Sparse Row,CSR)、压缩稀疏列(Compressed Sparse Column,CSC)、(ELL Pack,ELL)以及混合(Hybird,HYB)等方式表示,本申请不做限定。In the apparatus provided by the first aspect, the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction; And dividing the horizontal data block and the pre-stored identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identifier data blocks associated with the basic data block; and the plurality of basic data blocks and An identification data block associated with each of the plurality of basic data blocks is distributed to a base processing circuit connected thereto; the vertical data block and the identification data block associated with the vertical data block are broadcasted to a base processing circuit connected thereto. The identifier data block may be specifically represented by a direct index or a step index, and an optional List of Lists (LIL), a Coordinate list (COO), and a compressed sparse line (Compressed). Sparse Row (CSR), Compressed Sparse Column (CSC), (ELL Pack, ELL), and Hybrid (HyB) are not limited in this application.

以所述标识数据块用直接索引的方式表示为例,所述标识数据块具体可为是由0和1构成的数据块,其中,0表示数据块中包含的数据(如权值或输入神经元)的绝对值小于或等于第一阈值,1表示数据块中包含的数据(如权值或输入神经元)的绝对值大于第一阈值,第一阈值为用户侧或装置侧自定义随机设置的,例如0.05、0等等。For example, the identification data block is represented by a direct index, and the identification data block may specifically be a data block composed of 0 and 1, wherein 0 represents data included in the data block (such as weight or input nerve). The absolute value of the element is less than or equal to the first threshold, and 1 indicates that the absolute value of the data (such as the weight or the input neuron) contained in the data block is greater than the first threshold, and the first threshold is a custom random setting on the user side or the device side. , such as 0.05, 0, and so on.

为节省数据传输量、提高数据传输效率,在所述主处理电路向所述基础处理电路发送数据的过程中,具体可将所述多个基本数据块中的目标数据以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路;可选,还可将所述处理后的竖向数据块中的目标数据以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路。其中,所述目标数据是指数据块中绝对值大于第一阈值的数据,或者是指数据块(这里具体可为处理后的横向数据块或处理后的竖向数据块)中的非0数据。In order to save the data transmission amount and improve the data transmission efficiency, in the process of the main processing circuit transmitting the data to the basic processing circuit, the target data in the plurality of basic data blocks and the plurality of basic data may be specifically The identification data blocks associated with the blocks are distributed to the basic processing circuit connected thereto; optionally, the target data in the processed vertical data block and the identification data block associated with the vertical data block may also be broadcasted to the same. The basic processing circuit. The target data refers to data in which the absolute value of the data block is greater than the first threshold, or refers to non-zero data in the data block (here, specifically, the processed horizontal data block or the processed vertical data block). .

相应地,所述基础处理电路用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块;根据所述连接标识数据块对所述竖向数据块和所述 基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

例如,横向数据块为M 1行N 1列的矩阵,基本数据块为M 2行N 2列的矩阵,其中M 1>M 2,N 1>N 2。相应地,该横向数据块关联的标识数据块同样也为M 1行N 1列的矩阵,该基本数据块关联的标识数据块同样为M 2行N 2列的矩阵。以基本数据块为2*2的矩阵为例,设为

Figure PCTCN2019076088-appb-000001
第一阈值为0.05,则该基本数据块关联的标识数据块为
Figure PCTCN2019076088-appb-000002
关于第一映射电路和第二映射电路对数据块的处理将在后文进行具体阐述。 For example, the horizontal data block is a matrix of M 1 rows and N 1 columns, and the basic data block is a matrix of M 2 rows and N 2 columns, where M 1 > M 2 , N 1 &gt ; N 2 . Correspondingly, the identification data block associated with the horizontal data block is also a matrix of M 1 rows and N 1 columns, and the identification data block associated with the basic data block is also a matrix of M 2 rows and N 2 columns. Take a matrix with a basic data block of 2*2 as an example.
Figure PCTCN2019076088-appb-000001
The first threshold is 0.05, and the identifier data block associated with the basic data block is
Figure PCTCN2019076088-appb-000002
The processing of the data block with respect to the first mapping circuit and the second mapping circuit will be specifically described later.

在第一方面提供的装置中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identifier data block associated with the horizontal data block, the processed vertical data block, and the An identification data block associated with the vertical data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data associated with each of the basic data blocks Blocking, the identification data block associated with each of the plurality of basic data blocks and the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block and the identification data associated with the vertical data block are The block is broadcast to the base processing circuit connected thereto;

所述基础处理电路,用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块;根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block; and identify the data block according to the connection Processing the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing an inner product operation on the processed vertical data block and the basic data block to obtain an operation result, Transmitting the operation result to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块或处理后的竖向数据块以及该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块以及所述基本数据块关联的标识数据块得到连接标识数据块;根据所述连接标识数据对所述部分竖向数据块以及所述基本数据块进行处理得到处理后的部分竖向数据块以及处理后的基本数据块;对所述处理后的部分竖向数据块以及所述处理后的基本数据块执行内积运算。Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; The connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inner product operation.

其中,该连接标识数据块是通过对所述基本数据块关联的标识数据块和所述部分竖向数据块关联的标识数据块进行逐元素与操作而获得的数据块。可选的,该连接标识数据块用于表示两个数据块(具体为基本数据块以及竖向数据块)中数据均大于绝对值的数据。具体在后文进行详述。The connection identification data block is a data block obtained by performing an element-by-element operation on the identification data block associated with the basic data block and the identification data block associated with the partial vertical data block. Optionally, the connection identifier data block is used to represent data in which two data blocks (specifically, basic data blocks and vertical data blocks) are larger than absolute values. The details will be described in detail later.

例如,横向数据块关联的标识数据块为2*3的矩阵

Figure PCTCN2019076088-appb-000003
部分竖向数据块关联的标识数据块 为2*2的矩阵
Figure PCTCN2019076088-appb-000004
则对应获得的连接标识数据块为
Figure PCTCN2019076088-appb-000005
For example, the identification data block associated with the horizontal data block is a 2*3 matrix.
Figure PCTCN2019076088-appb-000003
The identification data block associated with some vertical data blocks is a 2*2 matrix
Figure PCTCN2019076088-appb-000004
Corresponding to the obtained connection identification data block is
Figure PCTCN2019076088-appb-000005

在第一方面提供的装置中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the horizontal data block to obtain a processed horizontal data block and the identification data block associated with the horizontal data block, or to activate the first mapping circuit according to the pre-stored horizontal data block Processing, by the associated identifier data block, the horizontal data block to obtain a processed horizontal data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks And the identification data block associated with each of the basic data blocks, and the identification data blocks associated with each of the plurality of basic data blocks and the plurality of basic data blocks are distributed to a basic processing circuit connected thereto, and the vertical data is The block is broadcast to the base processing circuit connected thereto;

所述基础处理电路,用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; Performing an inner product operation on the vertical data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块进行拆分处理得到多个部分竖向数据块;将所述多个部分竖向数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block to obtain a plurality of partial vertical data blocks; and pass the plurality of partial vertical data blocks once or Broadcasting to the base processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路具体用于根据所述基本数据块关联的标识数据块对所述部分竖向数据块进行处理得到处理后的部分竖向数据块;对所述基本数据块以及所述处理后的部分竖向数据块执行内积运算。Correspondingly, the basic processing circuit is configured to process the partial vertical data block according to the identifier data block associated with the basic data block to obtain a processed partial vertical data block; and the basic data block and the The processed partial vertical data block performs an inner product operation.

在第一方面提供的装置中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction. And starting the first mapping circuit to process the vertical data block, obtaining the processed vertical data block and the identification data block associated with the vertical data block, or starting the first mapping circuit according to the pre-stored The vertical data block is processed by the identification data block associated with the vertical data block to obtain a processed vertical data block; and the horizontal data block is split to obtain a plurality of basic data blocks; The basic data block is distributed to the basic processing circuit connected thereto, and the processed vertical data block and the identification data block associated with the vertical data block are broadcasted to the basic processing circuit connected thereto;

所述基础处理电路,用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start, by the second mapping circuit, the basic data block to be processed according to the identification data block associated with the vertical data block to obtain a processed basic data block; Performing an inner product operation on the data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result.

在可选实施例中,所述主处理电路,还具体用于将所述处理后的竖向数据块和该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路具体用于根据所述部分竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的基本数据块以及所述部分竖向数据块执行内积运算。Correspondingly, the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inner product operation.

在第一方面提供的装置中,所述主处理电路,具体用于将该竖向数据块(具体可为所述竖向数据块 或者处理后的竖向数据块)通过一次广播发送至与其连接的所述基础处理电路。In the apparatus provided by the first aspect, the main processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to be connected thereto by one broadcast. The basic processing circuit.

在第一方面提供的装置中,所述基础处理电路,具体用于将该基本数据块(同理可为所述基本数据块或处理后的基本数据块)与该竖向数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主处理电路。In the apparatus provided by the first aspect, the basic processing circuit is specifically configured to perform inner product of the basic data block (which may be the basic data block or the processed basic data block) and the vertical data block. The processing obtains the inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main processing circuit.

在第一方面提供的装置中,所述主处理电路,用于在如所述运算结果为内积处理的结果时,对所述运算结果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。In the apparatus provided by the first aspect, the main processing circuit is configured to accumulate the operation result and obtain an accumulation result when the operation result is the result of the inner product processing, and arrange the accumulation result to obtain the The data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的装置中,所述主处理电路,具体用于将所述竖向数据块分成多个部分竖向数据块,将所述多个部分竖向数据块通过多次广播至所述基础处理电路;所述多个部分竖向数据块组合形成所述竖向数据块。In the apparatus provided by the first aspect, the main processing circuit is specifically configured to divide the vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the The basic processing circuit; the plurality of partial vertical data blocks are combined to form the vertical data block.

在第一方面提供的装置中,所述基础处理电路,具体用于将该部分竖向数据块(具体可为部分竖向数据块或者处理后的部分竖向数据块)与该基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主处理电路。上述内积处理具体可以为:如部分广播数据块的元素为矩阵B的前2个元素,即为b10和b11;该基本数据块为输入数据矩阵A的第一行的前2个元素,即a10和a11,则内积=a10*b10+a11*b11。In the apparatus provided by the first aspect, the basic processing circuit is specifically configured to execute the partial vertical data block (specifically, a partial vertical data block or a processed partial vertical data block) and the basic data block After the inner product processing, the inner product processing result is obtained, the inner product processing result is accumulated to obtain a partial operation result, and the partial operation result is sent to the main processing circuit. The inner product processing may be specifically as follows: the element of the partial broadcast data block is the first two elements of the matrix B, that is, b10 and b11; the basic data block is the first two elements of the first row of the input data matrix A, that is, For a10 and a11, the inner product = a10*b10+a11*b11.

又如这里基本数据块以核3*3为例,该部分竖向数据块以3*3矩阵为例,分别对3*3矩阵与核3*3执行对应位置的乘法运算,那么其对应获得的内积结果即有3个内积处理结果,将该3个内积处理结果累加得到部分运算结果。3个内积处理结果Out0(3*3矩阵第0行与核3*3第0行的内积)、Out1(3*3矩阵第1行与核3*3第1行的内积)、Out2(3*3矩阵第2行与核3*3第2行的内积)具体可以为:For example, the basic data block here takes the kernel 3*3 as an example. The partial vertical data block takes the 3*3 matrix as an example, and performs the multiplication operation of the corresponding position on the 3*3 matrix and the core 3*3 respectively, then the corresponding data is obtained. The inner product result has three inner product processing results, and the three inner product processing results are accumulated to obtain a partial operation result. The results of the three inner product processing Out0 (the inner product of the 0th row of the 3*3 matrix and the 0th row of the core 3*3), the Out1 (the inner product of the 1st row of the 3*3 matrix and the 1st row of the core 3*3), Out2 (the inner product of the 2nd line of the 3*3 matrix and the 2nd line of the 3*3 core) can be specifically:

Out0=r00*k0[0]+r01*k0[1]+r02*k0[2]Out0=r00*k0[0]+r01*k0[1]+r02*k0[2]

Out1=r10*k1[0]+r11*k1[1]+r12*k1[2]Out1=r10*k1[0]+r11*k1[1]+r12*k1[2]

Out2=r20*k2[0]+r21*k2[1]+r22*k2[2]Out2=r20*k2[0]+r21*k2[1]+r22*k2[2]

其中,r00的r表示部分竖向数据块,00表示第0行的第0列元素。Where r of r00 represents a partial vertical data block, and 00 represents a 0th column element of the 0th row.

k0[0],的k表示基本数据块,0[0]表示第0行的第0列元素;K0[0], k represents the basic data block, and 0[0] represents the 0th column element of the 0th row;

部分运算结果=Out0+Out1+Out2。Partial operation result = Out0+Out1+Out2.

在第一方面提供的装置中,所述基础处理电路,具体用于复用n次该部分竖向数据块执行该部分竖向数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主处理电路,所述n为大于等于2的整数。In the apparatus provided by the first aspect, the basic processing circuit is specifically configured to multiplex the partial vertical data block n times to perform the partial vertical data block and the n basic data block inner product operations to obtain n partial processing As a result, the n partial processing results are respectively accumulated to obtain n partial operation results, and the n partial operation results are sent to the main processing circuit, and the n is an integer greater than or equal to 2.

这里基本数据块以p个核3*3为例,该部分竖向数据块以3*3矩阵为例,复用p次3*3矩阵分别与核3*3执行p次对应位置乘法,每次运算即对应的内积结果即有p个内积结果,3个内积结果组成一组内积运算结果,将p组中每组的3个内积结果累加得到p个部分运算结果。Here, the basic data block takes p cores 3*3 as an example. The partial vertical data block takes a 3*3 matrix as an example, and the p3 3*3 matrix is multiplexed with the core 3*3 to perform p-time corresponding position multiplication. The sub-operation, that is, the corresponding inner product result, has p inner product results, and the three inner product results form a set of inner product operation results, and the three inner product results of each group in the p group are accumulated to obtain p partial operation results.

在第一方面提供的装置中,所述主处理电路包括:主寄存器或主片上缓存电路;In the apparatus provided by the first aspect, the main processing circuit includes: a main register or a main on-chip buffer circuit;

所述基础处理电路包括:基本寄存器或基本片上缓存电路。The basic processing circuit includes a basic register or a basic on-chip buffer circuit.

在第一方面提供的装置中,所述主处理电路包括:向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路、第一映射电路或数据重排电路中的一种或任意组合。In the apparatus provided by the first aspect, the main processing circuit comprises: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a first mapping circuit or a data rearrangement circuit. One or any combination of the above.

在第一方面提供的装置中,所述基础处理电路,还具体用于将该竖向数据块和基本数据块转发给其他基础处理电路以先进行数据处理再执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;In the apparatus provided by the first aspect, the basic processing circuit is further configured to forward the vertical data block and the basic data block to other basic processing circuits to perform data processing and then perform an inner product operation to obtain an operation result, and The operation result is sent to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的装置中,所述数据块可用张量表示,其具体可为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。In the apparatus provided by the first aspect, the data block may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.

在第一方面提供的装置中,如所述运算指令为乘法指令,所述主处理电路确定乘数数据块为竖向数据块,被乘数数据块为横向数据块;In the apparatus provided by the first aspect, the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a vertical data block, and the multiplicand data block is a horizontal data block;

或如所述运算指令为卷积指令,所述主处理电路确定卷积输入数据块为竖向数据块,卷积核为横向数据块。Or if the operation instruction is a convolution instruction, the main processing circuit determines that the convolution input data block is a vertical data block, and the convolution kernel is a horizontal data block.

在第六方面提供的方法中,所述神经网络的运算包括:卷积运算、矩阵乘矩阵运算、矩阵乘向量运算、偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。In the method provided by the sixth aspect, the operation of the neural network includes: a convolution operation, a matrix multiplication matrix operation, a matrix multiplication vector operation, an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation. Kind or any combination.

本发明涉及的运算指令包括但不限于卷积指令、乘法指令、正向运算指令以及训练指令等,上文涉及的內积运算具体可为该运算指令所指示的运算。例如,当运算指令为卷积指令时,上文的內积运算为卷积运算。当运算指令为乘法指令时,上文的內积运算为乘法运算。当运算指令为正向运算指令时,上文的內积运算为正向运算。当运算指令为训练指令时,上文的內积运算为反向运算。The operation instructions according to the present invention include, but are not limited to, a convolution instruction, a multiplication instruction, a forward operation instruction, and a training instruction. The inner product operation referred to above may specifically be an operation indicated by the operation instruction. For example, when the operation instruction is a convolution instruction, the inner product operation above is a convolution operation. When the operation instruction is a multiplication instruction, the inner product operation above is a multiplication operation. When the operation instruction is a forward operation instruction, the inner product operation above is a forward operation. When the operation instruction is a training instruction, the inner product operation above is a reverse operation.

具体的,当运算指令为卷积指令时。在第一方面提供的装置中,所述主处理电路,用于获取输入数据块、卷积核数据块以及卷积指令,依据所述卷积指令将所述输入数据块划分为竖向数据块,将所述卷积核数据块划分为横向数据块;依据所述卷积指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述卷积指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;Specifically, when the operation instruction is a convolution instruction. In the apparatus provided by the first aspect, the main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction. Dividing the convolution kernel data block into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; a data block including the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of a basic processing circuit connected to the main processing circuit according to the convolution instruction Basic processing circuit;

所述多个基础处理电路,用于依据所述卷积指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block;

所述主处理电路,用于将所述运算结果处理得到所述卷积指令的指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction.

在第一方面提供的装置中,当所述第一数据块包括横向数据块和竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block and a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block; The horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行卷积运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a convolution operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到所述指令结果。The main processing circuit is configured to process the operation result to obtain the instruction result.

在第一方面提供的装置中,当所述第一数据块包括横向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进 行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block, the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block. And dividing the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data And the identification data block associated with each of the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block is broadcasted to a basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行卷积运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The subsequent vertical data block and the processed basic data block perform a convolution operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块或处理后的竖向数据块以及该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块以及所述基本数据块关联的标识数据块得到连接标识数据块;根据所述连接标识数据对所述部分竖向数据块以及所述基本数据块进行处理得到处理后的部分竖向数据块以及处理后的基本数据块;对所述处理后的部分竖向数据块以及所述处理后的基本数据块执行卷积运算。Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; The connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a convolution operation.

在第一方面提供的装置中,当所述第一数据块包括竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; The vertical data block and the processed basic data block perform an inner product operation to obtain an operation result, and the operation result is sent to the main processing circuit.

当运算指令为乘法指令时,在第一方面提供的装置中,所述主处理电路,用于获取输入数据块、权值数据块以及乘法指令,依据所述乘法指令将所述输入数据块划分成横向数据块,将所述权值数据块划分成竖向数据块;依据所述乘法指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述乘法指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;所述多个基础处理电路,用于依据所述乘法指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;所述主处理电路,用于将所述运算结果处理得到所述乘法指令的指令结果。When the operation instruction is a multiplication instruction, in the apparatus provided by the first aspect, the main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, and divide the input data block according to the multiplication instruction. Forming a horizontal data block, dividing the weight data block into a vertical data block; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; The first data block includes the horizontal data block and/or the vertical data block; and the processed first data block is sent to a basic processing circuit connected to the main processing circuit according to the multiplication instruction. At least one basic processing circuit, configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, according to the processed second data block in a parallel manner Performing an operation in the neural network to obtain an operation result, and transmitting the operation result to the main unit through a basic processing circuit connected to the main processing circuit a processing circuit; the second data block is a data block that is received by the basic processing circuit and received by the main processing circuit, and the second data block is associated with the processed first data block; And a circuit for processing the operation result to obtain an instruction result of the multiplication instruction.

在第一方面提供的装置中,当所述第一数据块包括横向数据块和竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据 块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block and a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block; The horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行乘法运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing multiplication on the processed vertical data block and the basic data block to obtain an operation result Transmitting the operation result to the main processing circuit;

所述主处理电路,用于对所述运算结果处理得到所述指令结果。The main processing circuit is configured to process the operation result to obtain the instruction result.

在第一方面提供的装置中,当所述第一数据块包括横向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block, the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block. And dividing the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data And the identification data block associated with each of the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block is broadcasted to a basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行乘法运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The subsequent vertical data block and the processed basic data block perform a multiplication operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块或处理后的竖向数据块以及该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块以及所述基本数据块关联的标识数据块得到连接标识数据块;根据所述连接标识数据对所述部分竖向数据块以及所述基本数据块进行处理得到处理后的部分竖向数据块以及处理后的基本数据块;对所述处理后的部分竖向数据块以及所述处理后的基本数据块执行乘法运算。Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; The connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a multiplication operation.

当运算指令为正向运算指令时,在第一方面提供的装置中,所述主处理电路,用于接收正向运算指令,解析所述正向运算指令得到所述正向运算指令在所述神经网络正向运算中第i层包含的第一运算指令以及所述第一运算指令所需的输入数据块和权值数据块,所述i的取值范围为大于等于1,且小于等于n的整数,如所述i大于等于2,所述输入数据块为第i-1层的输出数据块;When the operation instruction is a forward operation instruction, in the apparatus provided by the first aspect, the main processing circuit is configured to receive a forward operation instruction, and the forward operation instruction is parsed to obtain the forward operation instruction in the a first operation instruction included in the i-th layer of the neural network forward operation and an input data block and a weight data block required by the first operation instruction, wherein the range of i is greater than or equal to 1, and less than or equal to n An integer, if the i is greater than or equal to 2, the input data block is an output data block of the i-1th layer;

所述主处理电路,还用于依据所述第一运算指令将所述输入数据块划分为竖向数据块,将所述权值数据块划分为横向数据块;根据所述第一运算指令的运算控制确定是否启动第一映射电路对第一数据块进行处理,以得到处理后的第一数据块,所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述正向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一 个基础处理电路;The main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction The operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block; The forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit;

所述多个基础处理电路,用于依据所述第一运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. The operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block;

所述主处理电路,用于将所述运算结果进行处理得到所述第一运算指令的指令结果,完成所述第i层包含的所述第一运算指令的运算。The main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer.

在第一方面提供的装置中,当所述第一数据块包括横向数据块和竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block and a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block; The horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行正向运算得到运算结果,将所述运算结果发送至所述主处理电路;其中,所述正向运算包括但不限于以下中的任一项或多项的组合:卷积运算(即内积运算)、乘积运算、偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing a forward operation on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the forward operation includes but is not limited to a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;

所述主处理电路,用于对所述运算结果处理得到所述指令结果。The main processing circuit is configured to process the operation result to obtain the instruction result.

在第一方面提供的装置中,当所述第一数据块包括横向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block, the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block. And dividing the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data And the identification data block associated with each of the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block is broadcasted to a basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行正向运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The subsequent vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块或处理后的竖向数据块以及该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块以及所述基本数据块关联的标识数据块得到连接标识数据块;根据所述连接标识数据对所述部分 竖向数据块以及所述基本数据块进行处理得到处理后的部分竖向数据块以及处理后的基本数据块;对所述处理后的部分竖向数据块以及所述处理后的基本数据块执行正向运算。Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; The connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs a forward operation.

在第一方面提供的装置中,当所述第一数据块包括竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行正向运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; The vertical data block and the processed basic data block perform a forward operation to obtain an operation result, and the operation result is sent to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述处理后的竖向数据块和该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路具体用于根据所述部分竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的基本数据块以及所述部分竖向数据块执行正向运算。Correspondingly, the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs a forward operation.

在第一方面提供的装置中,所述第i层的运算还包括:偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。In the apparatus provided by the first aspect, the operation of the i-th layer further comprises: one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.

当运算指令为训练指令时,在第一方面提供的装置中,所述集成电路芯片装置,用于接收训练指令,依据该训练指令确定第一层输入数据和第一层权值组数据,对第一层输入数据和第一层权值组数据执行神经网络的n层正向运算得到正向运算的第n输出结果;When the operation instruction is a training instruction, in the apparatus provided by the first aspect, the integrated circuit chip device is configured to receive a training instruction, and determine first layer input data and first layer weight group data according to the training instruction, The first layer of input data and the first layer of weight group data perform an n-th layer forward operation of the neural network to obtain an nth output result of the forward operation;

所述主处理电路,还用于依据所述第n输出结果得到第n输出结果梯度,依据所述训练指令获取第n层反向运算的第n反向运算指令以及所述第n反向运算指令所需的第n层输入数据以及第n层权值组数据;依据所述第n反向运算指令将所述第n输出结果梯度、第n层输入数据以及第n层权值组数据划分为竖向数据块和横向数据块;依据所述第n反向运算指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述第n反向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction The nth layer input data and the nth layer weight group data required by the instruction; dividing the nth output result gradient, the nth layer input data, and the nth layer weight group data according to the nth reverse operation instruction a vertical data block and a horizontal data block; determining, according to the operation control of the nth reverse operation instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data The block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;

所述多个基础处理电路,用于依据所述第n反向运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;

所述主处理电路,还用于对该运算结果进行处理得到第n层权值组梯度和第n层输入数据梯度,应用所述第n层权值组梯度对第n层权值组数据进行更新;The main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data. Update

所述集成电路芯片装置,还用于将第n层输入数据梯度作为第n-1层的第n-1输出结果梯度执行n-1层反向运算得到n-1层权值组梯度,应用n-1层权值组梯度更新对应层的权值组数据,所述权值组数据包括至少二个权值。The integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.

在第一方面提供的装置中,当所述第一数据块包括横向数据块和竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block and a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to the horizontal data block and Processing the vertical data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, the processed vertical data block, and the identification data block associated with the vertical data block; The horizontal data block and the identification data block associated with the horizontal data block are subjected to split processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data blocks and the plurality of The identification data blocks respectively associated with the basic data blocks are distributed to the basic processing circuit connected thereto, and the processed vertical data blocks and the identification data blocks associated with the vertical data blocks are broadcasted to the basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行反向运算得到运算结果,将所述运算结果发送至所述主处理电路;其中,所述反向运算包括但不限于以下中的任一项或多项的组合:卷积运算(即内积运算)、乘积运算、偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; performing inverse operations on the processed vertical data block and the basic data block to obtain an operation As a result, the operation result is sent to the main processing circuit; wherein the reverse operation includes, but is not limited to, a combination of any one or more of the following: a convolution operation (ie, an inner product operation), a product operation One or any combination of offset operation, full connection operation, GEMM operation, GEMV operation, and activation operation;

所述主处理电路,用于对所述运算结果处理得到所述指令结果。The main processing circuit is configured to process the operation result to obtain the instruction result.

在第一方面提供的装置中,当所述第一数据块包括横向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a horizontal data block, the main processing circuit is specifically configured to start processing, by the first mapping circuit, the horizontal data block to be processed. And the horizontal data block and the identification data block associated with the horizontal data block, or the first mapping circuit is configured to process the horizontal data block according to the pre-stored identification data block associated with the horizontal data block to obtain the processed horizontal data block. And dividing the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and the plurality of basic data And the identification data block associated with each of the plurality of basic data blocks is distributed to a basic processing circuit connected thereto, and the vertical data block is broadcasted to a basic processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行反向运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The subsequent vertical data block and the processed basic data block perform an inverse operation to obtain an operation result, and the operation result is transmitted to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述竖向数据块或处理后的竖向数据块以及该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain multiple a partial vertical data block and an identification data block associated with each of the plurality of partial vertical data blocks; and the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are once passed Or broadcasting to the basic processing circuit a plurality of times; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块以及所述基本数据块关联的标识数据块得到连接标识数据块;根据所述连接标识数据对所述部分竖向数据块以及所述基本数据块进行处理得到处理后的部分竖向数据块以及处理后的基本数据块;对所述处理后的部分竖向数据块以及所述处理后的基本数据块执行反向运算。Correspondingly, the basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; The connection identification data processes the partial vertical data block and the basic data block to obtain a processed partial vertical data block and the processed basic data block; and the processed partial vertical data block and The processed basic data block performs an inverse operation.

在第一方面提供的装置中,当所述第一数据块包括竖向数据块时,所述主处理电路,具体用于启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识 数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;In the apparatus provided by the first aspect, when the first data block includes a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block, to obtain Processing the vertical data block and the identification data block associated with the vertical data block, or initiating the first mapping circuit to process the vertical data block according to the pre-stored identification data block associated with the vertical data block Obtaining a processed vertical data block; performing split processing on the horizontal data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected thereto, and performing the processed vertical Broadcasting to the data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto;

所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行反向运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; The vertical data block and the processed basic data block perform a reverse operation to obtain an operation result, and the operation result is sent to the main processing circuit.

在可选实施例中,所述主处理电路,还具体用于将所述处理后的竖向数据块和该竖向数据块关联的标识数据块进行拆分处理得到多个部分竖向数据块以及所述多个部分竖向数据块关联的标识数据块;将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过一次或多次广播给所述基础处理电路;其中,所述多个部分竖向数据块组合形成所述竖向数据块或处理后的竖向数据块。In an optional embodiment, the main processing circuit is further configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks. And the identification data block associated with the plurality of partial vertical data blocks; and the identification data blocks associated with the plurality of partial vertical data blocks and the plurality of partial vertical data blocks are broadcast to the station one or more times The basic processing circuit; wherein the plurality of partial vertical data blocks are combined to form the vertical data block or the processed vertical data block.

相应地,所述基础处理电路具体用于根据所述部分竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的基本数据块以及所述部分竖向数据块执行反向运算。Correspondingly, the basic processing circuit is configured to process the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; and the processed basic data block and The partial vertical data block performs an inverse operation.

在第一方面提供的装置中,所述n层的反向运算还包括:偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。In the apparatus provided by the first aspect, the inverse operation of the n layer further includes one of an offset operation, a full connection operation, a GEMM operation, a GEMV operation, and an activation operation, or any combination thereof.

在第一方面提供的装置中,所述第n输出结果梯度为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合;In the apparatus provided by the first aspect, the nth output result gradient is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;

所述第n层输入数据可用张量表示,其具体可为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合;The nth layer input data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block;

所述n层权值组数据可用张量表示,其具体可为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。The n-layer weight group data may be represented by a tensor, which may specifically be: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.

参阅图1a,图1a为本披露提供的一种集成电路芯片装置,该集成电路芯片装置包括:主处理电路和多个基础处理电路,所述多个基础处理电路呈阵列排布(m*n阵列),其中,m、n的取值范围为大于等于1的整数且m、n中至少有一个值大于等于2。对于m*n阵列分布的多个基础处理电路,每个基础处理电路与相邻的基础处理电路连接,所述主处理电路连接多个基础处理电路的k个基础处理电路,所述k个基础处理电路可以为:第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路。如图1a所示的集成电路芯片装置,主处理电路包括第一映射电路,所述第一映射电路用于对数据进行压缩处理,以获得处理后的数据以及标识数据。该标识数据用于指示该数据的绝对值是否大于第一阈值。进一步地,所述主处理电路可仅将处理后的数据(具体可为绝对值大于第一阈值的数据)以及该数据关联的标识数据发送给基础处理电路。优点是:减少发送至基础处理电路中进行数据处理的数据量,提升数据处理速率。该第一阈值为用户侧或装置侧自定义设置的,例如0.05、0.5等等,不做限定。Referring to FIG. 1a, FIG. 1a is an integrated circuit chip device according to the present disclosure. The integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, and the plurality of basic processing circuits are arranged in an array (m*n Array), wherein the values of m and n are integers greater than or equal to 1 and at least one of m and n is greater than or equal to 2. For a plurality of basic processing circuits distributed by the m*n array, each of the basic processing circuits is connected to an adjacent basic processing circuit, the main processing circuit connecting k basic processing circuits of the plurality of basic processing circuits, the k basics The processing circuit may be: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column. As shown in FIG. 1a, the main processing circuit includes a first mapping circuit for compressing data to obtain processed data and identification data. The identification data is used to indicate whether the absolute value of the data is greater than a first threshold. Further, the main processing circuit may only send the processed data (specifically, the data whose absolute value is greater than the first threshold) and the identification data associated with the data to the basic processing circuit. The advantage is that the amount of data sent to the basic processing circuit for data processing is reduced, and the data processing rate is improved. The first threshold is customized on the user side or the device side, for example, 0.05, 0.5, etc., and is not limited.

例如,主处理电路的输入数据为矩阵数据块

Figure PCTCN2019076088-appb-000006
经过第一映射电路处理后可获得处理后的矩阵数据块为
Figure PCTCN2019076088-appb-000007
该矩阵数据块关联的标识数据块为
Figure PCTCN2019076088-appb-000008
关 于第一映射电路的具体处理将在后文进行详述。 For example, the input data of the main processing circuit is a matrix data block.
Figure PCTCN2019076088-appb-000006
After processing by the first mapping circuit, the processed matrix data block can be obtained as
Figure PCTCN2019076088-appb-000007
The identification data block associated with the matrix data block is
Figure PCTCN2019076088-appb-000008
The specific processing regarding the first mapping circuit will be described later in detail.

相应地,在主处理电路向基础处理电路分发数据时,可仅发送1和0.5这两个数据,并非发送处理后的矩阵数据块,8个数据;同时还需将该矩阵数据块关联的标识数据块一起发送给基础处理电路,以便基础处理电路根据接收的标识数据块和接收的两个数据(1和0.5),对应获知这两个数据位于原矩阵数据块的位置。即是,所述基础处理电路可根据接收的标识数据块以及接收的数据,对应还原出主处理电路中处理后的矩阵数据块。Correspondingly, when the main processing circuit distributes data to the basic processing circuit, only two data of 1 and 0.5 may be transmitted, not the processed matrix data block and 8 data; and the identifier of the matrix data block is also required to be associated. The data blocks are sent together to the basic processing circuit, so that the basic processing circuit correspondingly knows that the two data are located at the position of the original matrix data block according to the received identification data block and the received two data (1 and 0.5). That is, the basic processing circuit can correspondingly restore the processed matrix data block in the main processing circuit according to the received identification data block and the received data.

多个基础电路中的至少一个基础处理电路(即部分或者全部基础处理电路)均可包括第二映射电路。具体的,多个基础处理电路中可以有部分基础处理电路包括第二映射电路,例如在可选方案中,可将k个基础处理电路配置第二映射电路,这样n个基础处理电路可以分别负责对本列的m个基础处理电路的数据进行压缩处理步骤。此设置能够提高运算效率,降低功耗,因为对于第1行的n个基础处理电路来说,由于其最先接收到主处理电路发送的数据,那么将该接收到的数据进行压缩处理可以减少后续基础处理电路的计算量以及与后续基础处理电路的数据传输的量,同理,对于第一列的m个基础处理电路配置第二映射电路也具有计算量小和功耗低的优点。另外,依据该结构,主处理电路可以采用动态的数据发送策略,例如,主处理电路向第1列的m个基础处理电路广播数据,主处理电路向第1行的n个基础处理电路发送分发数据。关于第二映射电路的具体处理将在后文进行详述。At least one of the plurality of base circuits (ie, some or all of the base processing circuits) may include a second mapping circuit. Specifically, a part of the basic processing circuits may include a second mapping circuit. For example, in an alternative, the k basic processing circuits may be configured with the second mapping circuit, so that the n basic processing circuits may be respectively responsible for The data processing steps of the m basic processing circuits of the column are performed. This setting can improve the operation efficiency and reduce the power consumption, because for the n basic processing circuits of the first row, since the data transmitted by the main processing circuit is received first, the compression of the received data can be reduced. The calculation amount of the subsequent basic processing circuit and the amount of data transmission with the subsequent basic processing circuit are similar. The configuration of the second mapping circuit for the m basic processing circuits of the first column also has the advantages of small calculation amount and low power consumption. In addition, according to the configuration, the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits of the first column, and the main processing circuit transmits and distributes to the n basic processing circuits of the first row. data. The specific processing regarding the second mapping circuit will be described later in detail.

所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;上述连续的运算但不限于:累加运算、ALU运算、激活运算等等运算。The main processing circuit is configured to perform each successive operation in the neural network operation and the basic processing circuit connected thereto to transmit data; the continuous operation is not limited to: an accumulation operation, an ALU operation, an activation operation, and the like. .

所述多个基础处理电路,用于依据传输的数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。上述并行方式执行神经网络中的运算包括但不限于:内积运算、矩阵或向量乘法运算等等。The plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit. The above-described parallel execution of operations in the neural network includes, but is not limited to, inner product operations, matrix or vector multiplication operations, and the like.

主处理电路可以包括:数据发送电路、数据接收电路或接口,该数据发送电路可以集成横向数据分发电路以及竖向数据分发电路,当然在实际应用中,横向数据分发电路以及竖向数据分发电路也可以分别设置。对于横向数据,即需要按照行方向(或者横向)发送给每个基础处理电路的数据,如图1a中将横向数据发送给m行中的任一行或多行中的基础处理电路。对于竖向数据,即需要按照列方向(或竖向)有选择的发送给部分基础处理电路的数据,具体的,如卷积运算,卷积运算的卷积输入数据需要发送给所有的基础处理电路,所有其为竖向数据,卷积核需要有选择的发送给部分基础处理电路,所以卷积核为横向数据。横向数据具体的选择发送给哪个基础处理电路的方式可以由主处理电路依据负载以及其他分配方式进行具体的确定。对于竖向数据或横向数据的发送方式,可将数据以广播形式发送至每个基础处理电路。(在实际应用中,通过一次广播的方式将横向/竖向数据发送至每个基础处理电路,也可以通过多次广播的方式将横向/竖向数据发送至每个基础处理电路,本披露具体实施方式并不限制上述广播的次数)。可选的,针对上述横向/竖向数据,主处理电路也可有选择的发送给部分基础处理电路。The main processing circuit may include: a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit may integrate a horizontal data distributing circuit and a vertical data distributing circuit. Of course, in practical applications, the horizontal data distributing circuit and the vertical data distributing circuit are also Can be set separately. For horizontal data, that is, data that needs to be sent to each of the basic processing circuits in the row direction (or horizontal direction), the horizontal data is transmitted to the basic processing circuit in any one or more of the m rows as shown in FIG. 1a. For vertical data, that is, data that needs to be sent to part of the basic processing circuit in the column direction (or vertical direction), specifically, such as convolution operation, the convolution input data of the convolution operation needs to be sent to all the basic processing. The circuit, all of which is vertical data, the convolution kernel needs to be selectively sent to some of the basic processing circuits, so the convolution kernel is horizontal data. The manner in which the horizontal data is specifically selected to be sent to the basic processing circuit can be specifically determined by the main processing circuit according to the load and other allocation methods. For vertical data or horizontal data transmission, data can be sent to each basic processing circuit in broadcast form. (In practical applications, horizontal/vertical data is transmitted to each basic processing circuit by means of one broadcast, and horizontal/vertical data can also be transmitted to each basic processing circuit by means of multiple broadcasts, the present disclosure specifically The embodiment does not limit the number of times of the above broadcast). Optionally, for the above horizontal/vertical data, the main processing circuit can also be selectively sent to a part of the basic processing circuit.

主处理电路(如图1d所示)可以包括寄存器和/或片上缓存电路,该主处理电路还可以包括:控制电路、向量运算器电路、ALU(arithmetic and logic unit,算数逻辑单元)电路、累加器电路、DMA(Direct Memory Access,直接内存存取)电路等电路,当然在实际应用中,上述主处理电路还可以添加,转换电路(例如矩阵转置电路)、数据重排电路或激活电路等等其他的电路。The main processing circuit (as shown in FIG. 1d) may include a register and/or an on-chip buffer circuit, and the main processing circuit may further include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit) circuit, and accumulation. Circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in practical applications, the above main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. Wait for other circuits.

每个基础处理电路可以包括基础寄存器和/或基础片上缓存电路;每个基础处理电路还可以包括:内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。上述内积运算器电路、向量运算器电路、累加器电路都可以是集成电路,上述内积运算器电路、向量运算器电路、累加器电路也可以为单独设置的电路。Each of the basic processing circuits may include a base register and/or a base on-chip buffer circuit; each of the base processing circuits may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be separately provided circuits.

可选的,对于第m行n个基础处理电路的累加器电路可以执行内积运算的累加运算,因为对于第m行基础处理电路来说,其能够接收到本列所有的基础处理电路的乘积结果,而将内积运算的累加运算通过第m行的n个基础处理电路执行内积运算的累加运算,这样能够对计算资源进行有效的分配,具有节省功耗的优点。此技术方案尤其对于m数量较大时更为适用。Optionally, the accumulator circuit of the nth basic processing circuit of the mth row can perform an accumulation operation of the inner product operation, because for the mth line basic processing circuit, it can receive the product of all the basic processing circuits of the column. As a result, the accumulation operation of the inner product operation performs the accumulation operation of the inner product operation through the n basic processing circuits of the mth row, so that the calculation resources can be effectively allocated, and the power consumption is saved. This technical solution is especially suitable for a large number of m.

对于数据的压缩处理可以由主处理电路来分配执行的电路,具体的,可以通过显示或隐式的方式来分配执行的电路,对于显示方式,主处理电路可以配置一个特殊指示或指令,当基础处理电路接收到该特殊指示或指令时,确定执行数据压缩处理,如基础处理电路未接收到特殊指示或指令时,确定不执行数据的压缩处理。又如,可以以暗示的方式来执行,例如,基础处理电路接收到稀疏数据(即含0,或包括小于预设阈值的数据大于预设数量)且确定需要执行内积运算时,将对稀疏数据进行压缩处理。对于显示配置的方式,特殊指令或指示可以配置一个递减序列,该递减序列每经过一个基础处理电路,数值减1,基础处理电路读取该递减序列的值,如该值大于零,则执行数据压缩处理,如该值等于或小于零,则不执行数据压缩处理。此设置是依据阵列分配的基础处理电路所配置的,例如对于第i列的m个基础处理电路来说,主处理电路需要前面5个基础处理电路执行数据压缩处理,则主处理电路下发一个特殊指令,该特殊指令包含有递减序列,该递减序列的初始值可以为5,则每经过一个基础处理电路,递减序列的值即减1,到第5个基础处理电路时,该递减序列的值为1,到第6个基础处理电路时,该递减序列为0,此时第6个基础处理电路将不再执行该数据压缩处理,此种方式可以使得主处理电路可以动态的配置数据压缩处理的执行主体以及执行次数。For the compression processing of the data, the main processing circuit can allocate the executed circuit. Specifically, the executed circuit can be allocated by display or implicit manner. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the processing circuit receives the special indication or instruction, it determines to perform data compression processing. If the basic processing circuit does not receive a special indication or instruction, it determines that the compression processing of the data is not performed. As another example, it may be performed in a suggestive manner. For example, if the basic processing circuit receives the sparse data (ie, includes 0, or includes data smaller than a preset threshold greater than a preset amount) and determines that an inner product operation needs to be performed, the sparseness will be sparse The data is compressed. For the manner in which the configuration is displayed, the special instruction or indication may be configured with a descending sequence, the decrement sequence is decremented by one each time a basic processing circuit is passed, and the basic processing circuit reads the value of the decrementing sequence, and if the value is greater than zero, the data is executed. The compression process, if the value is equal to or less than zero, does not perform data compression processing. This setting is configured according to the basic processing circuit of the array allocation. For example, for the m basic processing circuits of the i-th column, the main processing circuit needs the first five basic processing circuits to perform data compression processing, and the main processing circuit issues one. a special instruction, the special instruction includes a descending sequence, and the initial value of the descending sequence may be 5, and the value of the descending sequence is decremented by 1 every time a basic processing circuit is passed, to the 5th basic processing circuit, the descending sequence The value is 1, and when the sixth basic processing circuit is used, the decrementing sequence is 0. At this time, the sixth basic processing circuit will not perform the data compression processing, which can enable the main processing circuit to dynamically configure data compression. The execution body of the processing and the number of executions.

本披露一个实施例提供一种集成电路芯片装置,包括一个主处理电路(也可以称为主单元)和多个基础处理电路(也可以称为基础单元);实施例的结构如图1b所示;其中,虚线框中是所述神经网络运算装置的内部结构;灰色填充的箭头表示主处理电路和基础处理电路阵列之间的数据传输通路,空心箭头表示基础处理电路阵列中各个基础处理电路(相邻基础处理电路)之间的数据传输通路。其中,基础处理电路阵列的长宽长度可以不同,即m、n的取值可以不同,当然也可以相同,本披露并不限制上述取值的具体值。An embodiment of the present disclosure provides an integrated circuit chip device including a main processing circuit (also referred to as a main unit) and a plurality of basic processing circuits (also referred to as a base unit); the structure of the embodiment is as shown in FIG. 1b. Wherein, the dotted line frame is the internal structure of the neural network computing device; the gray filled arrow indicates the data transmission path between the main processing circuit and the basic processing circuit array, and the hollow arrows indicate the respective basic processing circuits in the basic processing circuit array ( Data transmission path between adjacent basic processing circuits). The length and width of the basic processing circuit array may be different, that is, the values of m and n may be different, and may of course be the same. The disclosure does not limit the specific value of the above values.

基础处理电路的电路结构如图1c所示;图中虚线框表示基础处理电路的边界,与虚线框交叉的粗箭头表示数据输入输出通道(指向虚线框内是输入通道,指出虚线框是输出通道);虚线框中的矩形框表示存储单元电路(寄存器和/或片上缓存),包括输入数据1,输入数据2,乘法或内积结果,累加数据;菱形框表示运算器电路,包括乘法或内积运算器,加法器。The circuit structure of the basic processing circuit is shown in Figure 1c; the dotted line in the figure indicates the boundary of the basic processing circuit, and the thick arrow crossing the dotted frame indicates the data input and output channel (the input channel is pointed in the dotted line box, indicating that the dotted line frame is the output channel) ); the rectangular box in the dashed box indicates the memory cell circuit (register and / or on-chip buffer), including input data 1, input data 2, multiplication or inner product results, accumulate data; diamond box represents the operator circuit, including multiplication or internal Product operator, adder.

本实施例中,所述神经网络运算装置包括一个主处理电路和16个基础处理电路(16个基础处理电路仅仅为了举例说明,在实际应用中,可以采用其他的数值);In this embodiment, the neural network computing device includes a main processing circuit and 16 basic processing circuits (16 basic processing circuits are for illustrative purposes only, and other values may be used in practical applications);

本实施例中,基础处理电路有两个数据输入接口,两个数据输出接口;在本例的后续描述中,将横向的输入接口(图1b中指向本单元的横向箭头)称作输入0,竖向的输入接口(图1b中指向本单元的竖向箭头)称作输入1;将每一个横向的数据输出接口(图1b中从本单元指出的横向箭头)称作输出0,竖向的数据输出接口(图1b中从本单元指出的竖向箭头)称作输出1。In this embodiment, the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to the unit in FIG. 1b) is referred to as input 0. The vertical input interface (vertical arrow pointing to this unit in Figure 1b) is called input 1; each horizontal data output interface (the horizontal arrow indicated from this unit in Figure 1b) is called output 0, vertical The data output interface (the vertical arrow indicated from this unit in Figure 1b) is called output 1.

每一个基础处理电路的数据输入接口和数据输出接口可以分别连接不同的单元,包括主处理电路与其他基础处理电路;The data input interface and the data output interface of each basic processing circuit can be respectively connected to different units, including a main processing circuit and other basic processing circuits;

本例中,基础处理电路0,4,8,12(编号见图1b)这四个基础处理电路的输入0与主处理电路的数据输出接口连接;In this example, the input processing 0 of the four basic processing circuits of the basic processing circuits 0, 4, 8, 12 (numbered as shown in FIG. 1b) is connected to the data output interface of the main processing circuit;

本例中,基础处理电路0,1,2,3这四个基础处理电路的输入1与主处理电路的数据输出接口连接;In this example, the input 1 of the four basic processing circuits of the basic processing circuits 0, 1, 2, 3 is connected to the data output interface of the main processing circuit;

本例中,基础处理电路12,13,14,15这四个基础处理电路的输出1与主处理电路的数据输入接口相连;In this example, the output 1 of the four basic processing circuits of the basic processing circuits 12, 13, 14, 15 is connected to the data input interface of the main processing circuit;

本例中,基础处理电路输出接口与其他基础处理电路输入接口相连接的情况见图1b所示,不再一一列举;In this example, the connection of the output interface of the basic processing circuit to the input interfaces of other basic processing circuits is shown in Figure 1b, and will not be enumerated one by one;

具体地,S单元的输出接口S1与P单元的输入接口P1相连接,表示P单元将可以从其P1接口接收到S单元发送到其S1接口的数据。Specifically, the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit will be able to receive data from its P1 interface that the S unit sends to its S1 interface.

本实施例包含一个主处理电路,主处理电路与外部装置相连接(既有输入接口也有输出接口),主处理电路的一部分数据输出接口与一部分基础处理电路的数据输入接口相连接;主处理电路的一部分数据输入接口与一部分基础处理电路的数据输出接口相连。The embodiment includes a main processing circuit, the main processing circuit is connected with an external device (both an input interface and an output interface), and a part of the data output interface of the main processing circuit is connected with a data input interface of a part of the basic processing circuit; the main processing circuit A part of the data input interface is connected to a data output interface of a part of the basic processing circuit.

集成电路芯片装置的使用方法Method of using integrated circuit chip device

本披露提供的使用方法中所涉及到的数据可以是经过压缩处理后的数据。需要说明的是,本申请中的数据可以是神经网络中的输入神经元或权值,其具体可为矩阵数据或向量数据等,本申请不做限定。也即是本申请下文阐述的数据或数据块可为神经网络中的输入神经元或权值,它们可以矩阵或向量等形式体现。The data involved in the usage method provided by the present disclosure may be compressed data. It should be noted that the data in the present application may be an input neuron or a weight in a neural network, which may be a matrix data or a vector data, etc., which is not limited herein. That is, the data or data blocks set forth below in this application may be input neurons or weights in a neural network, which may be embodied in the form of a matrix or a vector.

本申请涉及的数据压缩处理具体在前文所述的第一映射电路和第二映射电路中执行。应理解的,由于神经网络是一个高计算量和高访存的算法,权值越多,计算量和访存量都会增大。特别是,针对权值较小(如为0,或小于设定数值的权值)的情况下,为提高计算速率、减小开销需对这些权值较小的数据进行压缩处理。在实际应用中,数据压缩处理在稀疏神经网络中应用,效果最为明显,如减小数据计算的工作量、减小数据额外开销,提高数据计算速率等。The data compression processing involved in the present application is specifically performed in the first mapping circuit and the second mapping circuit described above. It should be understood that since the neural network is an algorithm with high computational complexity and high memory access, the more weights, the larger the calculation amount and the memory access amount. In particular, in the case where the weight is small (for example, 0, or less than the weight of the set value), in order to increase the calculation rate and reduce the overhead, it is necessary to compress the data with smaller weights. In practical applications, data compression processing is applied in sparse neural networks, and the effect is most obvious, such as reducing the workload of data calculation, reducing data overhead, and increasing data calculation rate.

以输入数据为例,阐述数据压缩处理涉及的具体实施例。所述输入数据包括但不限于至少一个输入神经元和/或至少一个权值。Taking the input data as an example, a specific embodiment involved in the data compression processing will be explained. The input data includes, but is not limited to, at least one input neuron and/or at least one weight.

第一实施例中:In the first embodiment:

第一映射电路接收到第一输入数据(具体可为主处理电路发送的待计算的数据块,如横向数据块或者竖向数据块等)后,所述第一映射电路可对所述第一输入数据进行处理,以获得处理后的第一输入数据以及该第一输入数据关联的标识mask数据,该mask数据用于指示该第一输入数据的绝对值是否大于第一阈值,如0.5、0等等。After the first mapping circuit receives the first input data (specifically, the data block to be calculated sent by the main processing circuit, such as a horizontal data block or a vertical data block, etc.), the first mapping circuit may be the first The input data is processed to obtain the processed first input data and the identifier mask data associated with the first input data, the mask data is used to indicate whether the absolute value of the first input data is greater than a first threshold, such as 0.5, 0. and many more.

具体的,当所述第一输入数据的绝对值大于第一阈值,则保留该输入数据;否则删除该第一输入数据或将该第一输入数据置为0。例如,输入的矩阵数据块为

Figure PCTCN2019076088-appb-000009
第一阈值为0.05,则经过第一映射电路处理后可获得处理后的矩阵数据块
Figure PCTCN2019076088-appb-000010
与该矩阵数据块关联的标识数据块(也可称为mask矩阵)为
Figure PCTCN2019076088-appb-000011
Specifically, when the absolute value of the first input data is greater than the first threshold, the input data is retained; otherwise, the first input data is deleted or the first input data is set to 0. For example, the input matrix data block is
Figure PCTCN2019076088-appb-000009
The first threshold is 0.05, and the processed matrix data block can be obtained after being processed by the first mapping circuit.
Figure PCTCN2019076088-appb-000010
The identification data block (also referred to as the mask matrix) associated with the matrix data block is
Figure PCTCN2019076088-appb-000011

进一步地,为减少数据传输量,所述主处理电路再向与其连接的基础处理电路中分发数据时,可发送所述处理后的矩阵数据块中的目标数据(本例中即为1,0.06和0.5)以及该矩阵数据块关联的标识数据块。具体实施时,所述主处理电路可按照设定规则将所述处理后的矩阵数据块中的目标数据分发至基础处理电路中,例如按照行顺序依次发送或者按照列顺序依次等等,本申请不做限定。相应地,基础处理电路在接收到所述目标数据以及该目标数据对应关联的标识数据块后,按照设定规则(例如行顺序) 将其还原为处理后的矩阵数据块。例如本例中,基础处理电路可根据接收的数据(1,0.06和0.5)以及标识数据块

Figure PCTCN2019076088-appb-000012
可获知该数据对应的矩阵数据块(即主处理电路中第一映射电路处理后的矩阵数据块)为
Figure PCTCN2019076088-appb-000013
Further, in order to reduce the amount of data transmission, when the main processing circuit distributes data to the basic processing circuit connected thereto, the target data in the processed matrix data block may be transmitted (in this example, 1, 0.06) And 0.5) and the identification data block associated with the matrix data block. In a specific implementation, the main processing circuit may distribute the target data in the processed matrix data block to the basic processing circuit according to a setting rule, for example, sequentially transmitting in a row order or sequentially in a column order, etc., the present application Not limited. Correspondingly, after receiving the target data and the associated data block corresponding to the target data, the basic processing circuit restores the target data block to the processed matrix data block according to a setting rule (for example, a row order). For example, in this example, the underlying processing circuitry can be based on received data (1, 0.06, and 0.5) and the identified data block.
Figure PCTCN2019076088-appb-000012
It can be known that the matrix data block corresponding to the data (that is, the matrix data block processed by the first mapping circuit in the main processing circuit) is
Figure PCTCN2019076088-appb-000013

在本发明实施例中,该第一输入数据可为横向数据块和/或竖向数据块。In an embodiment of the invention, the first input data may be a horizontal data block and/or a vertical data block.

相应地,第二映射电路可利用第一输入数据关联的标识数据对第二输入数据进行处理,从而获得处理后的第二输入数据;其中第一输入数据与所述第二输入数据不同。例如当所述第一输入数据为至少一个权值时,则所述第二输入数据可为至少一个输入神经元;或者,当所述第一输入数据为至少一个输入神经元时,则所述第二输入数据可为至少一个权值。Correspondingly, the second mapping circuit can process the second input data by using the identification data associated with the first input data, thereby obtaining the processed second input data; wherein the first input data is different from the second input data. For example, when the first input data is at least one weight, the second input data may be at least one input neuron; or, when the first input data is at least one input neuron, then The second input data can be at least one weight.

在本发明实施例中,该第二输入数据与所述第一输入数据不同,所述第二输入数据可为以下中的任一个:横向数据块、基本数据块、竖向数据块以及部分竖向数据块。In the embodiment of the present invention, the second input data is different from the first input data, and the second input data may be any one of the following: a horizontal data block, a basic data block, a vertical data block, and a partial vertical To the data block.

例如,当所述第一输入数据为横向数据块时,则第二输入数据为部分竖向数据块。假设第二输入数据为矩阵数据块

Figure PCTCN2019076088-appb-000014
相应地利用上例中mask矩阵
Figure PCTCN2019076088-appb-000015
处理后,获得处理后的部分竖向数据块为
Figure PCTCN2019076088-appb-000016
由于在实际应用中,输入数据涉及的矩阵数据块维数较大,本申请这里仅为示意,本不构成限定。 For example, when the first input data is a horizontal data block, then the second input data is a partial vertical data block. Assume that the second input data is a matrix data block
Figure PCTCN2019076088-appb-000014
Correspondingly use the mask matrix in the above example
Figure PCTCN2019076088-appb-000015
After processing, the obtained partial vertical data block is
Figure PCTCN2019076088-appb-000016
Since the dimension of the matrix data block involved in the input data is large in practical applications, the present application is merely illustrative and is not intended to be limiting.

第二实施例中:In the second embodiment:

所述第一映射电路可用于对第一输入数据和第二输入数据进行处理,以得到处理后的第一输入数据以及所述第一输入数据关联的第一标识mask数据、处理后的第二输入数据以及所述第二输入数据关联的第二标识mask数据。其中,所述第一mask数据或者第二mask数据用于指示第一或第二输入数据的绝对值是否大于第二阈值,该第二阈值为用户侧或装置侧自定义设置的,例如0.05、0等等。The first mapping circuit is configured to process the first input data and the second input data to obtain the processed first input data and the first identifier mask data associated with the first input data, and the processed second Input data and second identification mask data associated with the second input data. The first mask data or the second mask data is used to indicate whether the absolute value of the first or second input data is greater than a second threshold, and the second threshold is customized by the user side or the device side, for example, 0.05. 0 and so on.

所述处理后的第一输入数据或第二输入数据可为处理后的输入数据,也可为未处理前的输入数据。例如,第一输入数据为横向数据块,如上述例子中的矩阵数据块

Figure PCTCN2019076088-appb-000017
经过第一映射电路处理后可获得处理后的横向数据块,这里处理后的横向数据块可为原矩阵数据块
Figure PCTCN2019076088-appb-000018
也可为压缩处理后的矩阵数据块
Figure PCTCN2019076088-appb-000019
应理解的,本申请为减少数据量的传输以及基础处理电路中数据处理效率,优选地所述处理后的输入数据(如处理后的基本数据块或部分竖向数据块等)应为压缩处理后的数据。优选地,主处理电路向基础处理电路中发送的数据,具体可为所述处理后的输入数据中的目标数据,该目标数据具体可为绝对值大于预设阈值的数据,也可为非0数据等等。 The processed first input data or second input data may be processed input data or may be input data before processing. For example, the first input data is a horizontal data block, such as the matrix data block in the above example.
Figure PCTCN2019076088-appb-000017
After processing by the first mapping circuit, the processed horizontal data block can be obtained, and the processed horizontal data block can be the original matrix data block.
Figure PCTCN2019076088-appb-000018
Can also be a compressed matrix data block
Figure PCTCN2019076088-appb-000019
It should be understood that, in order to reduce the transmission of data amount and the data processing efficiency in the basic processing circuit, it is preferable that the processed input data (such as processed basic data block or partial vertical data block, etc.) should be compression processing. After the data. Preferably, the data sent by the main processing circuit to the basic processing circuit may be the target data in the processed input data, and the target data may be data with an absolute value greater than a preset threshold, or may be non-zero. Data and more.

相应地在基础处理电路中,第二映射电路可根据所述第一输入数据关联的第一标识数据以及所述第二输入数据关联的第二标识数据得到连接标识数据;该连接标识数据用于指示所述第一输入数据和所述第二输入数据中绝对值均大于第三阈值的数据,其中第三阈值为用户侧或装置侧自定义设置的,如0.05、0等。进一步地,所述第二映射电路可根据所述连接标识数据分别对接收的第一输入数据和第二输入数 据进行处理,从而获得处理后的第一输入数据和处理后的第二输入数据。Correspondingly, in the basic processing circuit, the second mapping circuit may obtain connection identification data according to the first identification data associated with the first input data and the second identification data associated with the second input data; the connection identification data is used for And indicating that the absolute value of the first input data and the second input data are greater than a third threshold, wherein the third threshold is customized by the user side or the device side, such as 0.05, 0, and the like. Further, the second mapping circuit may process the received first input data and the second input data respectively according to the connection identification data, thereby obtaining the processed first input data and the processed second input data.

例如,第一输入数据为矩阵数据块

Figure PCTCN2019076088-appb-000020
第二输入数据块同样也为矩阵数据块
Figure PCTCN2019076088-appb-000021
经过第一映射电路处理后可获得该第一输入数据关联的第一标识数据块
Figure PCTCN2019076088-appb-000022
以及处理后的第一输入数据块
Figure PCTCN2019076088-appb-000023
相应地获得该第二输入数据关联的第二标识数据块
Figure PCTCN2019076088-appb-000024
处理后的第二输入数据块为
Figure PCTCN2019076088-appb-000025
相应地,为提高数据传输速率,主处理电路中仅可将处理后的第一输入数据块中的目标数据1,0.06和0.5、以及该第一输入数据块关联的第一标识数据块发送给基础处理电路;同时,将处理后的第二输入数据块中的目标数据1,1.1,0.6,0.3和0.5,以及该第二输入数据块关联的第二标识数据块发送给基础处理电路。 For example, the first input data is a matrix data block
Figure PCTCN2019076088-appb-000020
The second input data block is also a matrix data block
Figure PCTCN2019076088-appb-000021
Obtaining, by the first mapping circuit, the first identification data block associated with the first input data
Figure PCTCN2019076088-appb-000022
And the processed first input data block
Figure PCTCN2019076088-appb-000023
Correspondingly obtaining the second identification data block associated with the second input data
Figure PCTCN2019076088-appb-000024
The processed second input data block is
Figure PCTCN2019076088-appb-000025
Correspondingly, in order to increase the data transmission rate, only the target data 1, 0.06 and 0.5 in the processed first input data block and the first identification data block associated with the first input data block may be sent to the main processing circuit. The basic processing circuit; at the same time, the target data 1, 1.1, 0.6, 0.3 and 0.5 in the processed second input data block and the second identification data block associated with the second input data block are sent to the basic processing circuit.

相应地,基础处理电路在接收到上述数据后,可通过第二映射电路对上述第一标识数据块和第二标识数据块进行逐元素与操作,得到连接标识数据块

Figure PCTCN2019076088-appb-000026
相应地,第二映射电路利用该连接标识数据块分别对所述处理后的第一输入数据块和处理后的第二输入数据块分别进行处理,从而获得处理后的第一输入数据块为
Figure PCTCN2019076088-appb-000027
处理后的第二输入数据块为
Figure PCTCN2019076088-appb-000028
其中,在基础处理电路中可根据第一标识数据块以及接收的第一数据块中的目标数据,确定出该目标数据对应所在的第一数据块(即经过第一映射电路处理后的第一数据块);相应地,根据第二标识数据块以及接收的第二数据块中的目标数据,确定出该目标数据对应所在的第二数据块(即经过第一映射电路处理后的第二数据块);然后,在第二映射电路获知连接标识数据块后,利用该连接标识数据块分别与确定的第一数据块和确定的第二数据块进行逐元素与操作,以获得经由第二映射电路处理后的第一数据块和处理后的第二数据块。 Correspondingly, after receiving the foregoing data, the basic processing circuit may perform the element-by-element operation on the first identification data block and the second identification data block by using the second mapping circuit to obtain the connection identification data block.
Figure PCTCN2019076088-appb-000026
Correspondingly, the second mapping circuit separately processes the processed first input data block and the processed second input data block by using the connection identification data block, respectively, so that the processed first input data block is obtained as
Figure PCTCN2019076088-appb-000027
The processed second input data block is
Figure PCTCN2019076088-appb-000028
The first processing block is configured to determine, according to the first identification data block and the target data in the received first data block, the first data block corresponding to the target data (ie, the first processed by the first mapping circuit) Correspondingly, determining, according to the second identification data block and the target data in the received second data block, the second data block corresponding to the target data (ie, the second data processed by the first mapping circuit) Blocking; then, after the second mapping circuit learns the connection identification data block, performing the element-by-element AND operation with the determined first data block and the determined second data block by using the connection identification data block, respectively, to obtain the second mapping The first data block after the circuit processing and the second data block after processing.

第三实施例中:In the third embodiment:

所述主处理电路中并不会设置第一映射电路,但所述主处理电路可将第三输入数据以及预存的所述第三输入数据关联的第三标识数据发送至与其连接的基础处理电路中。该基础处理电路中设置有第二映射电路。下面阐述第二映射电路涉及的数据压缩处理的具体实施例。The first mapping circuit is not disposed in the main processing circuit, but the main processing circuit may send the third input data and the pre-stored third identification data associated with the third input data to the basic processing circuit connected thereto in. A second mapping circuit is provided in the basic processing circuit. A specific embodiment of the data compression process involved in the second mapping circuit is explained below.

应理解的,所述第三输入数据包括但不限于基本数据块、部分竖向数据块、竖向数据块等。同样地,在神经网络处理器中,该第三输入数据也可为至少一个权值,和/或至少一个输入神经,本申请不做限定。It should be understood that the third input data includes, but is not limited to, a basic data block, a partial vertical data block, a vertical data block, and the like. Similarly, in the neural network processor, the third input data may also be at least one weight, and/or at least one input nerve, which is not limited herein.

在第二映射电路中,所述第二映射电路可根据接收的第三输入数据关联的第三标识数据对所述第三输入数据进行处理,从而获得处理后的第三输入数据,以便后续对处理后的第三输入数据执行相关运算操作,如内积运算等。In the second mapping circuit, the second mapping circuit may process the third input data according to the third identification data associated with the received third input data, thereby obtaining the processed third input data, so as to be subsequently The processed third input data performs related arithmetic operations, such as inner product operations.

例如,第二映射电路接收的第三输入数据为矩阵数据块

Figure PCTCN2019076088-appb-000029
相应地预存的该第三输入数据关联的第三标识数据块(也成mask矩阵数据块)为
Figure PCTCN2019076088-appb-000030
进一步地,第二映射 电路根据第三标识数据块对第三输入数据块进行处理得到处理后的第三输入数据块具体为
Figure PCTCN2019076088-appb-000031
For example, the third input data received by the second mapping circuit is a matrix data block.
Figure PCTCN2019076088-appb-000029
Correspondingly storing the third identification data block associated with the third input data (also into a mask matrix data block) is
Figure PCTCN2019076088-appb-000030
Further, the third mapping circuit processes the third input data block according to the third identification data block, and the processed third input data block is specifically
Figure PCTCN2019076088-appb-000031

此外,本发明实施例中提到的输入神经元和输出神经元并非是指整个神经网络的输入层中的神经元和输出层中的神经元,而是对于神经网络中任意相邻的两层神经元,处于网络前馈运算下层中的神经元即为输入神经元,处于网络前馈运算上层中的神经元即为输出神经元。以卷积神经网络为例,假设一个卷积神经网络有L层,K=1,2,3…L-1,对于第K层和第K+1层来说,第K层被称为输入层,该层中的神经元为上述输入神经元,第K+1层被称为输出层,该层中的神经元为上述输出神经元,即除了顶层之外,每一层都可以作为输入层,其下一层为对应的输出层。Furthermore, the input neurons and output neurons mentioned in the embodiments of the present invention do not refer to neurons in the input layer of the entire neural network and neurons in the output layer, but to any adjacent two layers in the neural network. Neurons, the neurons in the lower layer of the network feedforward operation are the input neurons, and the neurons in the upper layer of the network feedforward operation are the output neurons. Taking a convolutional neural network as an example, suppose a convolutional neural network has an L layer, K=1, 2, 3...L-1. For the Kth and K+1th layers, the Kth layer is called an input. The layer, the neurons in the layer are the above input neurons, the K+1 layer is called the output layer, and the neurons in the layer are the above-mentioned output neurons, that is, each layer can be used as an input except the top layer. Layer, the next layer is the corresponding output layer.

第四实施中:In the fourth implementation:

所述主处理电路中并不设置映射电路,在所述基础处理电路中设置有第一映射电路和第二映射电路。关于所述第一映射电路和第二映射电路的数据处理具体可参见前述第一实施例至第三实施例所述,这里不再赘述。A mapping circuit is not provided in the main processing circuit, and a first mapping circuit and a second mapping circuit are disposed in the basic processing circuit. For details about the data processing of the first mapping circuit and the second mapping circuit, refer to the foregoing first to third embodiments, and details are not described herein again.

可选的,还存在第五实施例。第五实施例中,所述基础处理电路中并不设置映射电路,将所述第一映射电路和第二映射电路均设置在主处理电路中,关于所述第一映射电路和第二映射电路的数据处理具体可参见前述第一实施例至第三实施例所述,这里不再赘述。即是,主处理电路中完成数据的压缩处理,将处理后的输入数据发送给基础处理电路,以便基础处理电路利用处理后的输入数据(具体可为处理后的神经元和处理后权值)执行相应地的运算操作。Alternatively, there is also a fifth embodiment. In the fifth embodiment, the mapping circuit is not disposed in the basic processing circuit, and the first mapping circuit and the second mapping circuit are both disposed in the main processing circuit, and the first mapping circuit and the second mapping circuit are For details of the data processing, refer to the foregoing first to third embodiments, and details are not described herein again. That is, the main processing circuit completes the compression processing of the data, and sends the processed input data to the basic processing circuit, so that the basic processing circuit utilizes the processed input data (specifically, the processed neurons and the processed weights) Perform the corresponding arithmetic operations.

下面阐述本申请涉及映射电路的具体结构示意图。如图5a和5b示出两种可能的映射电路。其中,如图5a所示的映射电路包括比较器和选择器。关于所述比较器和选择器的数量本申请不做限定。如图5a示出一个比较器和两个选择器,其中,所述比较器用于判定输入数据是否满足预设条件。该预设条件可为用户侧或设备侧自定义设置的,例如本申请上述的所述输入数据的绝对值大于或等于预设阈值。如果满足预设条件,则比较器可确定允许输出该输入数据,该输入数据对应关联的标识数据为1;否则可确定不输出该输入数据,或者默认该输入数据为0。相应地,此时该输入数据对应关联的标识数据为0。也即是,经过该比较器后,可获知输入数据关联的标识数据。The following is a detailed structural diagram of the mapping circuit involved in the present application. Two possible mapping circuits are shown in Figures 5a and 5b. Among them, the mapping circuit shown in FIG. 5a includes a comparator and a selector. Regarding the number of the comparators and selectors, the application is not limited. A comparator and two selectors are shown in Fig. 5a, wherein the comparator is used to determine whether the input data satisfies a preset condition. The preset condition may be customized for the user side or the device side. For example, the absolute value of the input data described above in the application is greater than or equal to a preset threshold. If the preset condition is met, the comparator may determine to allow output of the input data, the input data corresponding to the associated identification data being 1; otherwise, it may be determined that the input data is not output, or the input data is 0 by default. Correspondingly, the identification data corresponding to the input data at this time is 0. That is, after the comparator is passed, the identification data associated with the input data can be known.

进一步地,所述比较器对输入数据进行预设条件的判定后,可将获得的标识数据输入至选择器中,以便选择器利用该标识数据来决定是否输出相应地的输入数据,即获得处理后的输入数据。Further, after the comparator determines the preset condition of the input data, the obtained identification data may be input into the selector, so that the selector uses the identification data to determine whether to output the corresponding input data, that is, obtain the processing. After the input data.

如图5a,以所述输入数据为矩阵数据块为例,经过比较器可对该矩阵数据块中的每个数据进行预设条件的判定,从而可获得该矩阵数据块关联的标识数据块(mask矩阵)。进一步地,在第一选择器中可利用该标识数据块对所述矩阵数据块进行筛选,将所述矩阵数据块中绝对值大于或等于预设阈值(即满足预设条件)的数据进行保留,其余数据进行删除,以输出处理后的矩阵数据块。可选的,在第二选择器中还可利用该标识数据块对其他输入数据(例如第二矩阵数据块)进行处理,例如进行逐元素与操作,以将该第二矩阵数据块中绝对值大于或等于预设阈值的数据进行保留,以输出处理后的第二矩阵数据块。As shown in FIG. 5a, taking the input data as a matrix data block as an example, a comparator can determine a predetermined condition for each data in the matrix data block, so that an identifier data block associated with the matrix data block can be obtained ( Mask matrix). Further, the matrix data block may be filtered by using the identifier data block in the first selector, and the data in the matrix data block whose absolute value is greater than or equal to a preset threshold (ie, a preset condition is satisfied) is reserved. The remaining data is deleted to output the processed matrix data block. Optionally, the identifier data block may also be used in the second selector to process other input data (for example, the second matrix data block), for example, performing an element-by-element AND operation to obtain an absolute value in the second matrix data block. Data greater than or equal to the preset threshold is reserved to output the processed second matrix data block.

应理解的,对应于上述第一和第二实施例中,所述第一映射电路的具体结构可包括至少一个比较器和至少一个选择器,例如上例中图5a中的比较器和第一选择器;所述第二映射电路的具体结果可包括一个或多个选择器,例如上例中图5a的第二选择器。It should be understood that, corresponding to the above first and second embodiments, the specific structure of the first mapping circuit may include at least one comparator and at least one selector, such as the comparator and the first in FIG. 5a in the above example. a selector; the specific result of the second mapping circuit may include one or more selectors, such as the second selector of Figure 5a in the above example.

如图5b,示出另一种映射电路的结构示意图。如图5b,所述映射电路包括选择器,所述选择器的数量不做限定,可为一个,也可为多个。具体的,所述选择器用于根据输入的输入数据所关联的标识数据来对输入的所述输入数据进行选择,以将所述输入数据中绝对值大于或等于预设阈值的数据进行输出,其余数据进行删除/不输出,从而获得处理后的输入数据。As shown in FIG. 5b, a schematic structural diagram of another mapping circuit is shown. As shown in FIG. 5b, the mapping circuit includes a selector, and the number of the selectors is not limited, and may be one or multiple. Specifically, the selector is configured to select the input data according to the input data associated with the input data to output the data whose absolute value is greater than or equal to a preset threshold. The data is deleted/not output, thereby obtaining processed input data.

以所述输入数据为矩阵数据块为例,向所述映射电路输入该矩阵数据块以及该矩阵数据块关联的标 识数据块,选择器可根据该标识数据块对所述矩阵数据块进行选择,将其绝对值大于或等于0的数据进行输出,其余数据不予输出,从而输出处理后的矩阵数据块。Taking the input data as a matrix data block as an example, the matrix data block and the identification data block associated with the matrix data block are input to the mapping circuit, and the selector may select the matrix data block according to the identification data block. The data whose absolute value is greater than or equal to 0 is output, and the remaining data is not output, thereby outputting the processed matrix data block.

应理解的,如图5b所示的结构可应用于上述第三实施例中的第二映射电路,即是上述第三实施例中的第二映射电路的具体结果可包括至少一个选择器。同理,对于主处理电路和基础处理电路中设计的第一映射电路和第二映射电路可按照如图5a和图5b所示的功能部件进行交叉组合或部件拆分,本申请不做限定。It should be understood that the structure shown in FIG. 5b can be applied to the second mapping circuit in the third embodiment described above, that is, the specific result of the second mapping circuit in the third embodiment described above may include at least one selector. For the same reason, the first mapping circuit and the second mapping circuit designed in the main processing circuit and the basic processing circuit may be cross-combined or split according to the functional components shown in FIG. 5a and FIG. 5b, which is not limited in this application.

基于前述实施例,下面具体阐述主处理电路以及基础处理电路中需完成的操作处理,可使用如下方法进行:Based on the foregoing embodiments, the operation processing to be completed in the main processing circuit and the basic processing circuit will be specifically described below, and the following methods can be used:

主处理电路先启用第一映射电路对第一输入数据进行处理,以获得处理后的第一输入数据以及该第一输入数据关联的第一标识数据;然后再将处理后的第一输入数据以及该第一输入数据关联的第一标识数据传输给基础处理电路运算。例如,主处理电路可以将待计算的数据(如横向数据块/竖向数据块)进行处理后再传输给基础处理电路,其优点是可以减少传输数据的位宽,减少传输的总比特数量,基础处理电路执行位宽较小的数据运算的效率也更高,功耗更低。The main processing circuit first enables the first mapping circuit to process the first input data to obtain the processed first input data and the first identification data associated with the first input data; and then the processed first input data and The first identification data associated with the first input data is transmitted to a base processing circuit for operation. For example, the main processing circuit can process the data to be calculated (such as a horizontal block/vertical block) and then transmit the data to the basic processing circuit, which has the advantages of reducing the bit width of the transmitted data and reducing the total number of bits transmitted. The basic processing circuit performs data operations with smaller bit widths and is also more efficient and consumes less power.

基础处理电路启用第二映射电路利用该第一标识数据对接收的第二输入数据进行处理,得到处理后的第二输入数据然后再对处理后的第一输入数据和第二输入数据执行相关运算操作。例如,基础处理电路收到主处理电路传输过来的第二输入数据(如稀疏数据,竖向数据块),先对其进行压缩处理再进行运算,提高运算效率,降低功耗。The basic processing circuit enables the second mapping circuit to process the received second input data by using the first identification data, obtain the processed second input data, and then perform correlation operations on the processed first input data and the second input data. operating. For example, the basic processing circuit receives the second input data (such as sparse data and vertical data blocks) transmitted by the main processing circuit, and performs compression processing on the first operation to improve the operation efficiency and reduce the power consumption.

可选的,主处理电路可先将第一输入数据(如基本数据块)、第一输入数据关联的第一标识数据、第二输入数据(如部分竖向数据块等)以及第二输入数据关联的第二标识数据先传输给基础处理电路运算。Optionally, the main processing circuit may first input the first input data (such as a basic data block), the first identification data associated with the first input data, the second input data (such as a partial vertical data block, etc.), and the second input data. The associated second identification data is first transmitted to the basic processing circuit for operation.

相应地,基础处理电路接收数据后,可先启用第二映射电路根据第一标识数据和第二标识数据获得连接标识数据块,然后再利用该连接标识数据对第一输入数据和第二输入数据进行处理,进一步地在基础处理电路中还能完成针对所述处理后的第一输入数据和第二输入数据的运算操作,其好处能减少数据运算量,提高运算效率,降低功耗。Correspondingly, after receiving the data, the basic processing circuit may first enable the second mapping circuit to obtain the connection identification data block according to the first identification data and the second identification data, and then use the connection identification data to use the first input data and the second input data. The processing is further performed, and the operation operation for the processed first input data and the second input data can be further completed in the basic processing circuit, and the benefits thereof can reduce the amount of data calculation, improve the operation efficiency, and reduce the power consumption.

可选的,主处理电路发送的第一输入数据关联的第一标识数据以及第二输入数据关联的第二标识数据为预先存储在该主处理电路中的,或者为所述主处理电路启用第一映射电路通过所述第一/第二输入数据获得的,本申请不做限定。Optionally, the first identification data associated with the first input data sent by the main processing circuit and the second identification data associated with the second input data are pre-stored in the main processing circuit, or are enabled for the main processing circuit. A mapping circuit is obtained by using the first/second input data, which is not limited in this application.

基础处理电路的使用方法(如图2a);The use of the basic processing circuit (Figure 2a);

主处理电路从装置外部接收待计算的输入数据;The main processing circuit receives input data to be calculated from outside the device;

可选地,主处理电路利用本单元的各种运算电路,向量运算电路,内积运算器电路、累加器电路等对数据进行运算处理;Optionally, the main processing circuit performs arithmetic processing on the data by using various operation circuits, a vector operation circuit, an inner product operator circuit, an accumulator circuit, and the like of the unit;

主处理电路通过数据输出接口向基础处理电路阵列(把所有基础处理电路的集合称作基础处理电路阵列)发送数据(如图2b所示);The main processing circuit sends data to the basic processing circuit array (referred to as a basic processing circuit array) through the data output interface (as shown in FIG. 2b);

此处的发送数据的方式可以是向一部分基础处理电路直接发送数据,即多次广播方式;Here, the method of sending data may be to directly send data to a part of the basic processing circuit, that is, multiple broadcast modes;

此处发送数据的方式可以向不同的基础处理电路分别发送不同的数据,即分发方式;The method of transmitting data here may separately send different data to different basic processing circuits, that is, a distribution method;

基础处理电路阵列对数据进行计算;The basic processing circuit array calculates the data;

基础处理电路接收到输入数据后进行运算;The basic processing circuit performs an operation after receiving the input data;

可选地,基础处理电路接收到数据后将该数据从本单元的数据输出接口传输出去;(传输给其他没有直接从主处理电路接收到数据的基础处理电路。)Optionally, the basic processing circuit transmits the data from the data output interface of the unit after receiving the data; (transferred to other basic processing circuits that do not directly receive data from the main processing circuit.)

可选地,基础处理电路将运算结果从数据输出接口传输出去;(中间计算结果或者最终计算结果)Optionally, the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)

主处理电路接收到从基础处理电路阵列返回的输出数据;The main processing circuit receives the output data returned from the basic processing circuit array;

可选地,主处理电路对从基础处理电路阵列接收到的数据继续进行处理(例如累加或激活操作);Optionally, the main processing circuit continues to process (eg, accumulate or activate the operation) the data received from the basic processing circuit array;

主处理电路处理完毕,将处理结果从数据输出接口传输给装置外部。The main processing circuit is processed, and the processing result is transmitted from the data output interface to the outside of the device.

使用所述电路装置完成张量乘张量运算,所述张量和前文所述的数据块相同,其可为矩阵、向量、三维数据块、四位数据块以及高维数据块中的任一项或多项的组合;下面如图2c和2f分别示出矩阵乘向量和矩阵乘矩阵运算的具体实现方法。The tensor multiplying tensor operation is performed using the circuit device, the tensor being the same as the data block described above, which may be any of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block. A combination of terms or multiples; the specific implementation of matrix multiplication vectors and matrix multiplication matrix operations is shown in Figures 2c and 2f, respectively.

使用所述电路装置完成矩阵乘向量运算;(矩阵乘向量可以是矩阵中的每一行分别与向量进行内积运算,并将这些结果按对应行的顺序摆放成一个向量。)The matrix multiplication vector operation is performed using the circuit device; (the matrix multiplication vector may be an inner product operation of each row in the matrix with the vector, and the results are arranged into a vector in the order of the corresponding rows.)

下面描述计算尺寸是M行L列的矩阵S和长度是L的向量P的乘法的运算,如下图2c所示。The operation of calculating the multiplication of the matrix S of the M rows and L columns and the vector P of the length L is described below, as shown in Fig. 2c.

此方法用到所述神经网络计算装置的全部或者一部分基础处理电路,假设用到了K个基础处理电路;This method uses all or a portion of the basic processing circuit of the neural network computing device, assuming that K basic processing circuits are used;

主处理电路将矩阵S的部分或全部行中的数据发送到k个基础处理电路中的每个基础处理电路;The main processing circuit transmits data in part or all of the rows of the matrix S to each of the k basic processing circuits;

在一种可选的方案中,主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于每次发送一个数,可以为对于某一个基础处理电路,第1次发送第3行第1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者对于每次发送一部分数,第1次发送第3行前两个数(即第1、2个数),第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)In an optional solution, the control circuit of the main processing circuit sends a certain number or a part of the data of a row in the matrix S to a certain basic processing circuit each time; (for example, for each number sent, it may be For a certain basic processing circuit, the first number of the third line is transmitted for the first time, the second number of the third line of data is transmitted for the second time, and the third number of the third line is transmitted for the third time..., or For each part of the transmission, the first two lines of the third line (ie, the first and second numbers) are sent for the first time, the third and third numbers of the third line are sent for the second time, and the third line is sent for the third time. 5th and 6th numbers......;)

在一种可选的方案中,主处理电路的控制电路将矩阵S中某几行的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……。)In an optional solution, the control circuit of the main processing circuit sends data of a certain number of rows in the matrix S to a certain basic processing circuit each time (for example, for a certain basic processing circuit, The first time, the first number of the third, fourth, and fifth lines is transmitted, the second time is the second, the third, fourth, and fifth lines are the second, and the third time is the third, fourth, and fifth lines. The third number of lines..., or the first two digits of the first, third, fourth, and fifth lines, the second, the third, fourth, and fifth lines, the third and fourth digits of each line, Send the 5th, 4th, and 5th lines, the 5th and 6th lines of each line....)

主处理电路的控制电路将向量P中的数据逐次发送到第0个基础处理电路;The control circuit of the main processing circuit sequentially transmits the data in the vector P to the 0th basic processing circuit;

第0个基础处理电路接收到向量P的数据之后,将该数据发送给与其相连接的下一个基础处理电路,即基础处理电路1;After receiving the data of the vector P, the 0th basic processing circuit sends the data to the next basic processing circuit connected thereto, that is, the basic processing circuit 1;

具体的,有些基础处理电路不能直接从主处理电路处获得计算所需的所有的数据,例如,图2d中的基础处理电路1,只有一个数据输入接口与主处理电路相连,所以只能直接从主处理电路获得矩阵S的数据,而向量P的数据就需要依靠基础处理电路0输出给基础处理电路1,同理,基础处理电路1收到数据后也要继续把向量P的数据输出给基础处理电路2。Specifically, some basic processing circuits cannot obtain all the data required for calculation directly from the main processing circuit. For example, the basic processing circuit 1 in FIG. 2d has only one data input interface connected to the main processing circuit, so it can only directly The main processing circuit obtains the data of the matrix S, and the data of the vector P needs to be output to the basic processing circuit 1 by the basic processing circuit 0. Similarly, after receiving the data, the basic processing circuit 1 continues to output the data of the vector P to the basis. Processing circuit 2.

每一个基础处理电路对接收到的数据进行运算,该运算包括但不限于:内积运算、乘法运算、加法运算等等;Each of the basic processing circuits performs operations on the received data, including but not limited to: inner product operations, multiplication operations, addition operations, and the like;

在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;

在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;

基础处理电路计算出结果后,将结果从数据输出接口传输出去(即传输给与其连接的其他基础处理电路);After the basic processing circuit calculates the result, the result is transmitted from the data output interface (ie, transmitted to other basic processing circuits connected thereto);

在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;In an alternative, the result of the calculation may be the final result or an intermediate result of the inner product operation;

基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;

主处理电路接收到各个基础处理电路内积运算的结果,将该结果处理得到最终结果(该处理可以为累加运算或激活运算等等)。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing may be an accumulation operation or an activation operation, etc.).

采用上述计算装置实现矩阵乘向量方法的实施例:An embodiment of implementing a matrix multiplication vector method using the above computing device:

在一种可选方案中,方法所用到的多个基础处理电路按照如下图2d或者图2e所示的方式排列;In an alternative, the plurality of basic processing circuits used in the method are arranged in the manner as shown in FIG. 2d or FIG. 2e as follows;

如图2c所示,主处理电路可分别获取矩阵S和矩阵P各自对应的mask矩阵(即前文所述的标识数据/标识数据块)。具体的,该矩阵S和矩阵P各自对应的mask矩阵可以是预先存储在主处理电路中的高速存储器中;也可是主处理电路启用第一映射电路分别根据矩阵S和矩阵P获得的各自对应的mask 矩阵。主处理单元的控制电路将矩阵S的M行数据分成K组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Ai)的运算;相应地,主处理单元的控制电路同样也会将矩阵S对应的第一mask矩阵的M行数据分成K组,并和矩阵S被划分为K组后新形成的矩阵一起发送给相应地的基础处理电路,以在该基础处理电路中完成相关数据的运算操作。As shown in FIG. 2c, the main processing circuit can respectively obtain a mask matrix corresponding to each of the matrix S and the matrix P (ie, the identification data/identification data block described above). Specifically, the mask matrix corresponding to the matrix S and the matrix P may be pre-stored in the high-speed memory in the main processing circuit; or the main processing circuit enables the first mapping circuit to respectively obtain the corresponding corresponding according to the matrix S and the matrix P. Mask matrix. The control circuit of the main processing unit divides the M rows of the matrix S into K groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is denoted as Ai); correspondingly, the main processing unit The control circuit also divides the M-line data of the first mask matrix corresponding to the matrix S into K groups, and sends them to the corresponding basic processing circuit together with the newly formed matrix after the matrix S is divided into K groups, on the basis The processing operation of the related data is completed in the processing circuit.

此处对M行数据进行分组的方法是任意不会重复分配的分组方式;The method of grouping M rows of data here is any grouping method that does not repeatedly allocate;

在一种可选方案中,采用如下分配方式:将第j行分给第j%K(%为取余数运算)个基础处理电路;In an alternative, the following allocation mode is adopted: the jth row is allocated to the j%K (% is a remainder operation) basic processing circuit;

在一种可选方案中,对于不能平均分组的情况也可以先对一部分行平均分配,对于剩下的行以任意方式分配。In an alternative, for a case where the grouping cannot be averaged, a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.

主处理电路的控制电路每次将矩阵S中部分或全部行中的数据依次发送给对应的基础处理电路;相应地,控制电路还会将矩阵S中这几行数据所对应在第一mask矩阵中的标识数据一起发送给对应的基础处理电路。The control circuit of the main processing circuit sequentially sends the data in part or all of the rows in the matrix S to the corresponding basic processing circuit; correspondingly, the control circuit also corresponds the data in the matrix S to the first mask matrix. The identification data in the data is sent together to the corresponding basic processing circuit.

例如,矩阵S为50*50的矩阵数据块,主处理电路可将矩阵S分为10个小矩阵,每个小矩阵的尺寸大小为5*50,则主处理电路可将第1个小矩阵S0(5行50列)以及该小矩阵S0关联的标识数据块(5行50列)一起发送给第1个基础处理电路,以在第1个基础处理电路中完成相关数据的运算处理。For example, the matrix S is a matrix data block of 50*50, and the main processing circuit can divide the matrix S into 10 small matrices, each of which has a size of 5*50, and the main processing circuit can take the first small matrix. S0 (5 rows and 50 columns) and the identification data block (5 rows and 50 columns) associated with the small matrix S0 are sent together to the first basic processing circuit to complete the arithmetic processing of the related data in the first basic processing circuit.

在一种可选方案中,主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的一行数据中的一个或多个数据,该第i组数据Mi可以是矩阵S中的数据,也可以是该矩阵S对应的第一mask矩阵中的数据;In an alternative, the control circuit of the main processing circuit sends one or more data of one of the data of the i-th group of data Mi that it is responsible for, to the i-th basic processing circuit, the i-th data Mi It may be data in the matrix S, or may be data in the first mask matrix corresponding to the matrix S;

在一种可选方案中,主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的部分或全部行中的每行的一个或多个数据;In an alternative, the control circuit of the main processing circuit transmits one or more data of each of some or all of the i-th group of data Mi to which it is responsible to the i-th basic processing circuit;

主处理电路的控制电路将向量P中的数据依次向第1个基础处理电路发送;相应地,主处理电路的控制电路可将向量P关联的第二mask矩阵中的数据也一起依次发送给第1个基础处理电路The control circuit of the main processing circuit sequentially transmits the data in the vector P to the first basic processing circuit; correspondingly, the control circuit of the main processing circuit can sequentially send the data in the second mask matrix associated with the vector P to the first 1 basic processing circuit

在一种可选方案中,主处理电路的的控制电路每次可以发送向量P或者向量P关联的第二mask矩阵中的一个或多个数据;In an alternative, the control circuit of the main processing circuit can send one or more data in the second mask matrix associated with the vector P or the vector P each time;

第i个基础处理电路接收到向量P或者第二mask矩阵的数据之后还可发送给与其相连的第i+1个基础处理电路;After receiving the data of the vector P or the second mask matrix, the i-th basic processing circuit may also send the data to the i+1th basic processing circuit connected thereto;

每个基础处理电路接收到来自矩阵S中某一行或者某几行中的一个或多个数据以及来自向量P的一个或多个数据后,进行运算(包括但不限于乘法或加法);Each basic processing circuit receives one or more data from a certain row or rows of the matrix S and one or more data from the vector P, and performs operations (including but not limited to multiplication or addition);

具体实现中,每个基础处理电路接收到矩阵S中的数据以及该数据在第一mask矩阵中关联的第一标识数据、向量P中的数据以及该数据在第二mask数据中关联的第二标识数据后;可先根据第一标识数据和第二标识数据获得连接标识数据;然后利用该连接标识数据决定是否对矩阵P中的数据和向量P中的数据执行相关运算操作。该连接标识数据是通过对第一标识数据和第二标识数据进行与操作所获得的,其可为0或1,1表示矩阵S中某个位置的数据和向量P中同一位置的数据均为绝对值大于预设阈值的数据;反之,0表示矩阵S中同一位置的数据和/或向量P中同一位置的数据为绝对值小于或等于预设阈值的数据。In a specific implementation, each of the basic processing circuits receives the data in the matrix S and the first identification data associated with the data in the first mask matrix, the data in the vector P, and the second associated with the data in the second mask data. After the data is identified, the connection identification data may be obtained according to the first identification data and the second identification data; and then the connection identification data is used to determine whether to perform a correlation operation on the data in the matrix P and the data in the vector P. The connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the vector P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of the same position in the matrix S and/or the data of the same position in the vector P are data whose absolute value is less than or equal to the preset threshold.

即是,每个基础处理电路启动第二映射电路根据矩阵S的第一mask矩阵和向量P的第二mask矩阵选取同一位置中标识数据为1对应在矩阵S和向量P中的数据执行相关运算操作,例如乘法、加法操作等等。也即是,利用第一mask矩阵和第二mask矩阵对应来选取矩阵S和矩阵P中相同位置上绝对值大于预设阈值的数据执行相关运算操作,如乘法操作。That is, each of the basic processing circuits starts the second mapping circuit to perform correlation operations on the data in the matrix S and the vector P according to the first mask matrix of the matrix S and the second mask matrix of the vector P. Operations, such as multiplication, addition operations, and the like. That is, the first mask matrix and the second mask matrix are used to select data in the matrix S and the matrix P whose absolute value is greater than a preset threshold, and perform a correlation operation, such as a multiplication operation.

例如,基础处理电路接收到矩阵S中的某两行的数据为矩阵

Figure PCTCN2019076088-appb-000032
对应的该矩阵S 0关联的第一mask矩阵
Figure PCTCN2019076088-appb-000033
接收到向量P中的某几个数据为向量 P 0[1 0.01 1.1 0.6] T,该向量P 0关联的第二mask向量[1 0 1 1] T;进一步的基础处理电路可启用第二映射电路先对
Figure PCTCN2019076088-appb-000034
和[1 0 1 1] T进行逐元素与操作,获得连接mask矩阵
Figure PCTCN2019076088-appb-000035
进一步利用该连接mask矩阵对接收的矩阵S 0和向量P 0进行处理,从而获得处理后的矩阵
Figure PCTCN2019076088-appb-000036
和处理后的向量P 0[1 0 0 0.6] T,以便基础处理电路针对处理后的矩阵S 0和处理后的向量P 0执行相关的运算操作。 For example, the basic processing circuit receives the data of two rows in the matrix S as a matrix.
Figure PCTCN2019076088-appb-000032
Corresponding first mask matrix associated with the matrix S 0
Figure PCTCN2019076088-appb-000033
Some data received in vector P is vector P 0 [1 0.01 1.1 0.6] T , and the second mask vector [1 0 1 1] T associated with the vector P 0 ; further basic processing circuit can enable second mapping Circuit first
Figure PCTCN2019076088-appb-000034
Perform element-by-element operations with [1 0 1 1] T to obtain the connection mask matrix
Figure PCTCN2019076088-appb-000035
Further processing the received matrix S 0 and the vector P 0 by using the connection mask matrix to obtain the processed matrix
Figure PCTCN2019076088-appb-000036
And the processed vector P 0 [1 0 0 0.6] T , so that the basic processing circuit performs an associated arithmetic operation on the processed matrix S 0 and the processed vector P 0 .

在一种可选方案中,每个基础处理电路中若接收的数据(具体可为待计算的数据块,如矩阵S或向量P中某几行/列的数据以及对应在mask矩阵中的标识数据)的数据量超过预设阈值时,该基础处理电路将不再接收新的输入数据,如主处理电路将后续发送的矩阵S或向量P某几行/列的数据以及该数据对应在mask矩阵中的标识数据等等,直至基础处理电路中拥有足够的缓存/存储空间,再接收主处理电路新发送的数据。In an optional solution, if the data is received in each basic processing circuit (specifically, the data block to be calculated, such as data of a certain row/column in the matrix S or the vector P and the identifier corresponding to the mask matrix When the data amount of the data exceeds the preset threshold, the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the vector data of the vector P and the data corresponding to the mask. The identification data in the matrix, etc., until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.

在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;

在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;

在一种可选方案中,基础处理电路接收到的数据也可以是中间结果,保存在寄存器和或片上缓存上;In an alternative, the data received by the basic processing circuit may also be an intermediate result, stored in a register or an on-chip buffer;

基础处理电路将本地的计算结果传输给与其相连接的下一个基础处理电路或者主处理电路;The basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected thereto;

在一种可选方案中,对应于图2d的结构,只有每列的最后一个基础处理电路的输出接口与主处理电路相连接的,这种情况下,只有最后一个基础处理电路可以直接将本地的计算结果传输给主处理电路,其他基础处理电路的计算结果都要传递给自己的下一个基础处理电路,下一个基础处理电路传递给下下个基础处理电路直至全部传输给最后一个基础处理电路,最后一个基础处理电路将本地的计算结果以及接收到的本列的其他基础处理电路的结果执行累加计算得到中间结果,将中间结果发送至主处理电路;当然还可以为:最后一个基础处理电路将本列的其他基础电路的结果以及本地的处理结果直接发送给主处理电路。In an alternative, corresponding to the structure of Figure 2d, only the output interface of the last basic processing circuit of each column is connected to the main processing circuit. In this case, only the last basic processing circuit can directly be localized. The calculation result is transmitted to the main processing circuit, and the calculation results of other basic processing circuits are transmitted to the next basic processing circuit, and the next basic processing circuit is transferred to the next basic processing circuit until all is transmitted to the last basic processing circuit. The last basic processing circuit performs an accumulated calculation on the local calculation result and the results of other basic processing circuits received in the column to obtain an intermediate result, and sends the intermediate result to the main processing circuit; of course, it may be: the last basic processing circuit The results of the other basic circuits in this column and the local processing results are sent directly to the main processing circuit.

在一种可选方案中,对应于图2e的结构,每一个基础处理电路都有与主处理电路相连接的输出接口,这种情况下,每一个基础处理电路都直接将本地的计算结果传输给主处理电路;In an alternative, corresponding to the structure of Figure 2e, each of the basic processing circuits has an output interface connected to the main processing circuit. In this case, each of the basic processing circuits directly transmits the local calculation result. Give the main processing circuit;

基础处理电路接收到其他基础处理电路传递过来的计算结果之后,传输给与其相连接的下一个基础处理电路或者主处理电路。After receiving the calculation result transmitted by other basic processing circuits, the basic processing circuit transmits to the next basic processing circuit or main processing circuit connected thereto.

主处理电路接收到M个内积运算的结果,作为矩阵乘向量的运算结果。The main processing circuit receives the result of the M inner product operations as the operation result of the matrix multiplication vector.

使用所述电路装置完成矩阵乘矩阵运算;Performing matrix multiplication matrix operations using the circuit arrangement;

下面描述计算尺寸是M行L列的矩阵S和尺寸是L行N列的矩阵P的乘法的运算,(矩阵S中的每一行与矩阵P的每一列长度相同,如图2f所示)The operation of calculating the multiplication of the matrix S of the M rows and L columns and the matrix P of the size of the L rows and N columns is described below (each row in the matrix S is the same length as each column of the matrix P, as shown in Fig. 2f)

本方法使用所述装置如图1b所示的实施例进行说明;The method is described using the apparatus as shown in the embodiment of Figure 1b;

主处理电路的第一映射电路获取矩阵S和矩阵P各自对应的标识mask矩阵,例如启动第一映射电路分别对矩阵S和矩阵P进行处理以获得该矩阵S对应的第一mask矩阵以及该矩阵P对应的第二mask矩阵;The first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix. a second mask matrix corresponding to P;

主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到通过横向数据输入接口直接与主处理电路相连的基础处理电路(例如,图1b中最上方的灰色填充的竖向数据通路);同时,控制电路还会 将对应在第一mask矩阵中的部分或全部行中的标识数据发送到与其连接的基础处理电路中。例如,控制电路将矩阵S中的前两行数据以及该前两行数据对应在第一mask矩阵中的前两行标识数据一起发送到与主处理电路相连的基础电路中。The control circuitry of the main processing circuit transmits data in part or all of the rows of the matrix S to the underlying processing circuitry directly connected to the main processing circuitry via the lateral data input interface (eg, the top gray filled vertical data path in Figure 1b) At the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the first mask matrix to the base processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.

在一种可选方案中,主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3行第1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者第1次发送第3行前两个数,第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)In an alternative, the control circuit of the main processing circuit sends a certain number or part of the data of a row in the matrix S to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time) Send the first number in the third line, the second number in the third line of data, the third number in the third line in the third time... or the first two lines in the third line. Number, the third time sends the 3rd and 4th numbers in the 3rd line, the third time sends the 5th and 6th numbers in the 3rd line...;)

相应地,控制电路同时还将与矩阵S中该行对应在第一mask矩阵中的标识数据每次发送一个或一部分标识数据给某个基础处理电路。Correspondingly, the control circuit also transmits one or a portion of the identification data to the underlying processing circuit each time the identification data corresponding to the row in the matrix S in the first mask matrix.

在一种可选方案中,主处理电路的控制电路将矩阵S中某几行的数据以及对应在第一mask矩阵中对应几行的标识数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……;)In an alternative, the control circuit of the main processing circuit sends the data of a certain row in the matrix S and the identification data corresponding to the corresponding rows in the first mask matrix to each of the numbers or a part of the data to a certain base. Processing circuit; (for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth lines is transmitted for the first time, and the second number of the third, fourth, and fifth lines for the second time is transmitted for the second time. The third time, the third number of the third, fourth, and fifth lines is sent..., or the first two lines of the third, fourth, and fifth lines are sent, and the second time is the third and fourth. 5 lines per line 3rd and 4th, the third time 3, 4, 5 lines 5th and 6th per line...;)

主处理电路的控制电路将矩阵P中的部分或全部列中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图1b中基础处理电路阵列左侧的灰色填充的横向数据通路);同时,控制电路还会将对应在第二mask矩阵中的部分或全部行中的标识数据发送到与其连接的基础处理电路中。例如,控制电路将矩阵P中的前两行数据以及该前两行数据对应在第二mask矩阵中的前两行标识数据一起发送到与主处理电路相连的基础电路中。The control circuitry of the main processing circuit sends data in some or all of the columns in the matrix P to those basic processing circuits that are directly connected to the main processing circuitry through the vertical data input interface (eg, to the left of the basic processing circuitry array in Figure 1b) The gray filled horizontal data path); at the same time, the control circuit also transmits identification data corresponding to some or all of the rows in the second mask matrix to the base processing circuit connected thereto. For example, the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.

在一种可选方案中,主处理电路的控制电路将矩阵P中某列的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3列第1个数,第2次发送第3列数据中的第2个数,第3次发送第3列的第3个数……,或者第1次发送第3列前两个数,第二次发送第3列第3和第4个数,第三次发送第3列第5和第6个数……;)相应地,控制电路同时还将与矩阵P中该行对应在第二mask矩阵中对应行的标识数据每次发送一个或一部分标识数据给某个基础处理电路。In an alternative, the control circuit of the main processing circuit sends a certain number or part of the data of a column in the matrix P to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time) The first number in the third column is transmitted, the second number in the third column data is transmitted in the second time, the third number in the third column is transmitted in the third time..., or the first two columns in the third column are transmitted. Number, the third and third numbers of the third column are sent, and the fifth and sixth numbers of the third column are sent for the third time...;) correspondingly, the control circuit will also correspond to the row in the matrix P The identification data of the corresponding row in the second mask matrix transmits one or a portion of the identification data to a certain basic processing circuit each time.

在一种可选方案中,主处理电路的控制电路将矩阵P中某几列的数据以及对应在第二mask矩阵中对应几行的标识数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5列每列的第1个数,第2次发送第3,4,5列每列的第2个数,第3次发送第3,4,5列每列的第3个数……,或者第1次发送第3,4,5列每列前两个数,第二次发送第3,4,5列每列第3和第4个数,第三次发送第3,4,5列每列第5和第6个数……;)In an alternative, the control circuit of the main processing circuit sends the data of a certain column in the matrix P and the identification data corresponding to the corresponding rows in the second mask matrix to each certain number or part of each time to a certain base. Processing circuit; (for example, for a certain basic processing circuit, the first number of the third, fourth, and fifth columns is transmitted for the first time, and the second number of the third, fourth, and fifth columns for the second time is transmitted for the second time. The third time, the third number of the third, fourth, and fifth columns is sent..., or the first two numbers of the third, fourth, and fifth columns are sent for the first time, and the third and fourth times are sent for the second time. 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th columns in each column...;)

基础处理电路接收到矩阵S的数据以及矩阵S关联的第一mask矩阵的标识数据之后,将该数据(具体可为矩阵S的数据以及该数据对应在第一mask矩阵中的标识数据)通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到矩阵P的数据后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的竖向的数据通路);After receiving the data of the matrix S and the identification data of the first mask matrix associated with the matrix S, the basic processing circuit passes the data (specifically, the data of the matrix S and the identification data corresponding to the data in the first mask matrix) through the data The horizontal data output interface is transmitted to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); after the basic processing circuit receives the data of the matrix P, the data is Through its vertical data output interface, it is transmitted to the next basic processing circuit (for example, the white filled vertical data path in the middle of the basic processing circuit array in FIG. 1b);

每一个基础处理电路对接收到的数据进行运算;具体的,每个基础处理电路接收到矩阵S中某一行或几行的数据以及该数据对应在第一mask矩阵中关联的第一标识数据、矩阵P中某一列或几列的数据以及该数据对应在第二mask数据中关联的第二标识数据后;可先根据第一标识数据和第二标识数据获得连接标识数据;然后利用该连接标识数据决定是否对矩阵S中的数据和矩阵P中的数据执行相关运算操作。该连接标识数据是通过对第一标识数据和第二标识数据进行与操作所获得的,其可为0或1,1表示矩阵S中某个位置的数据和矩阵P中同一位置的数据均为绝对值大于预设阈值的数据;反之,0表示矩阵S中某一位置的数据和/或矩阵P中同一位置的数据为绝对值小于或等于预设阈值的数据。具体可参见前述实施例所述,这里不再赘述。Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P. The connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold. For details, refer to the foregoing embodiment, and details are not described herein again.

即是,每个基础处理电路启动第二映射电路根据矩阵S的第一mask矩阵和矩阵P的第二mask矩 阵选取同一位置中标识数据为1的数据执行相关运算操作,例如乘法、加法操作等等。That is, each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.

在一种可选方案中,每个基础处理电路中若接收的数据(具体可为待计算的数据块,如矩阵S或矩阵P中某几行/列的数据以及对应在mask矩阵中的标识数据)的数据量超过预设阈值时,该基础处理电路将不再接收新的输入数据,如主处理电路将后续发送的矩阵S或矩阵P某几行/列的数据以及该数据对应在mask矩阵中的标识数据等等,直至基础处理电路中拥有足够的缓存/存储空间,再接收主处理电路新发送的数据。In an optional solution, if the data is received in each basic processing circuit (specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix When the data amount of the data exceeds the preset threshold, the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask. The identification data in the matrix, etc., until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.

在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;

在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;

基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;In an alternative, the result of the calculation may be the final result or an intermediate result of the inner product operation;

具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,图1b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).

基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;

向着能够直接向主处理电路输出的方向输出结果(例如,图1b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);Output results in a direction that can be directly output to the main processing circuit (for example, in FIG. 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation from the vertical output interface downward. result);

主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.

“矩阵乘矩阵”方法的实施例:Example of a "matrix multiply matrix" method:

方法用到按照如图1b所示方式排列的基础处理电路阵列;The method uses a basic processing circuit array arranged in the manner shown in Figure 1b;

主处理电路的第一映射电路获取矩阵S和矩阵P各自对应的标识mask矩阵,例如启动第一映射电路分别对矩阵S和矩阵P进行处理以获得该矩阵S对应的第一mask矩阵以及该矩阵P对应的第二mask矩阵,可选的,还可得到处理后的矩阵S和矩阵P,假设处理后的矩阵S有h行,处理后的矩阵P有w列。The first mapping circuit of the main processing circuit acquires the identification mask matrix corresponding to each of the matrix S and the matrix P. For example, the first mapping circuit is started to process the matrix S and the matrix P respectively to obtain a first mask matrix corresponding to the matrix S and the matrix. The second mask matrix corresponding to P, optionally, may also obtain the processed matrix S and the matrix P, assuming that the processed matrix S has h rows, and the processed matrix P has w columns.

主处理电路的控制电路将矩阵S的h行数据分成h组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Hi)的运算;同时,控制电路还会将数据对应在第一mask矩阵中的部分或全部行中的标识数据发送到与其连接的基础处理电路中。例如,控制电路将矩阵S中的前两行数据以及该前两行数据对应在第一mask矩阵中的前两行标识数据一起发送到与主处理电路相连的基础电路中。The control circuit of the main processing circuit divides the h row data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Hi); meanwhile, the control circuit will also The data corresponds to the identification data in some or all of the rows in the first mask matrix being sent to the underlying processing circuit connected thereto. For example, the control circuit transmits the first two rows of data in the matrix S and the first two rows of data corresponding to the first two rows of data in the first mask matrix to the base circuit connected to the main processing circuit.

此处对h行数据进行分组的方法是任意不会重复分配的分组方式;The method of grouping the h-line data here is any grouping method that does not repeatedly allocate;

在一种可选方案中,采用如下分配方式:主处理电路的控制电路将第j行分给第j%h个基础处理电路;In an alternative, the following allocation mode is adopted: the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit;

在一种可选方案中,对于不能平均分组的情况也可以先对一部分行平均分配,对于剩下的行以任意方式分配。In an alternative, for a case where the grouping cannot be averaged, a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.

主处理电路的控制电路将矩阵P的W列数据分成w组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Wi)的运算;相应地,控制电路同时还将与矩阵P中该列对应在第二mask矩阵中对应列的标识数据每次发送一个或一部分标识数据给某个基础处理电路。The control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Wi); accordingly, the control circuit also The identification data corresponding to the column in the matrix P corresponding to the column in the second mask matrix is sent one or a part of the identification data to a certain basic processing circuit.

此处对W列数据进行分组的方法是任意不会重复分配的分组方式;The method of grouping the W column data here is any grouping method that does not repeatedly allocate;

在一种可选方案中,采用如下分配方式:主处理电路的控制电路将第j行分给第j%w个基础处理电路;In an alternative, the following allocation mode is adopted: the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit;

在一种可选方案中,对于不能平均分组的情况也可以先对一部分列平均分配,对于剩下的列以任意方式分配。In an alternative, for a case where the grouping cannot be averaged, a part of the columns may be equally distributed first, and the remaining columns may be allocated in an arbitrary manner.

主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到基础处理电路阵列中每行的第一个基础处理电路;The control circuit of the main processing circuit transmits data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;

在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的一行数据中的一个或多个数据;同时采用相同方法可将第i组数据Hi对应在mask矩阵中的标识数据也发送给第一基础处理电路;In an alternative, the control circuit of the main processing circuit transmits one or more of a row of data in the i-th data Hi that it is responsible for to the first basic processing circuit of the i-th row in the basic processing circuit array. Data can be sent to the first basic processing circuit by using the same method to simultaneously identify the identification data of the i-th data Hi corresponding to the mask matrix;

在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的部分或全部行中的每行的一个或多个数据;同时采用相同方法可将第i组数据Hi对应在mask矩阵中的标识数据也发送给第一基础处理电路;In an alternative, the control circuit of the main processing circuit transmits each of some or all of the i-th data Hi in its responsibility to the first basic processing circuit of the i-th row in the basic processing circuit array. One or more data of the row; at the same time, the same method can be used to send the identification data corresponding to the i-th data Hi in the mask matrix to the first basic processing circuit;

主处理电路的控制电路将矩阵P的部分或全部列中的数据发送到基础处理电路阵列中每列的第一个基础处理电路;同时,控制电路还会将对应在第二mask矩阵中的部分或全部行中的标识数据发送到与其连接的基础处理电路中。例如,控制电路将矩阵P中的前两行数据以及该前两行数据对应在第二mask矩阵中的前两行标识数据一起发送到与主处理电路相连的基础电路中。The control circuit of the main processing circuit sends the data in part or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array; at the same time, the control circuit will also correspond to the portion in the second mask matrix The identification data in all or all of the rows is sent to the underlying processing circuitry connected to it. For example, the control circuit sends the first two rows of data in the matrix P and the first two rows of data corresponding to the first two rows of data in the second mask matrix to the base circuit connected to the main processing circuit.

在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Wi中的一列数据中的一个或多个数据;In an alternative, the control circuit of the main processing circuit transmits one or more of a column of data of the i-th data Wi that it is responsible to to the first basic processing circuit of the i-th column of the basic processing circuit array. Data

在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Ni中的部分或全部列中的每列的一个或多个数据;In an alternative, the control circuit of the main processing circuit transmits each of some or all of the columns of the i-th data Ni that it is responsible for to the first basic processing circuit of the i-th column of the basic processing circuit array. One or more data of the column;

基础处理电路接收到矩阵S的数据之后,将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到矩阵P的数据后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的竖向的数据通路);After receiving the data of the matrix S, the basic processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b). After receiving the data of the matrix P, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 1b) Vertical data path);

每一个基础处理电路对接收到的数据进行运算;具体的,每个基础处理电路接收到矩阵S中某一行或几行的数据以及该数据对应在第一mask矩阵中关联的第一标识数据、矩阵P中某一列或几列的数据以及该数据对应在第二mask数据中关联的第二标识数据后;可先根据第一标识数据和第二标识数据获得连接标识数据;然后利用该连接标识数据决定是否对矩阵S中的数据和矩阵P中的数据执行相关运算操作。该连接标识数据是通过对第一标识数据和第二标识数据进行与操作所获得的,其可为0或1,1表示矩阵S中某个位置的数据和矩阵P中同一位置的数据均为绝对值大于预设阈值的数据;反之,0表示矩阵S中某一位置的数据和/或矩阵P中同一位置的数据为绝对值小于或等于预设阈值的数据。具体可参见前述实施例所述,这里不再赘述。Each of the basic processing circuits performs operations on the received data. Specifically, each of the basic processing circuits receives data of a certain row or rows of the matrix S and the data corresponding to the first identification data associated with the first mask matrix, The data of a column or columns in the matrix P and the data corresponding to the second identifier data associated with the second mask data; the connection identifier data may be obtained according to the first identifier data and the second identifier data; and then the connection identifier is used The data determines whether or not the correlation operation is performed on the data in the matrix S and the data in the matrix P. The connection identification data is obtained by performing an AND operation on the first identification data and the second identification data, which may be 0 or 1, and 1 indicates that data of a certain position in the matrix S and data of the same position in the matrix P are both The data whose absolute value is greater than the preset threshold; otherwise, 0 indicates that the data of a certain position in the matrix S and/or the data of the same position in the matrix P is data whose absolute value is less than or equal to a preset threshold. For details, refer to the foregoing embodiment, and details are not described herein again.

即是,每个基础处理电路启动第二映射电路根据矩阵S的第一mask矩阵和矩阵P的第二mask矩阵选取同一位置中标识数据为1的数据执行相关运算操作,例如乘法、加法操作等等。That is, each of the basic processing circuits starts the second mapping circuit to perform correlation operations, such as multiplication, addition operations, etc., by selecting data of the identification data of 1 in the same position according to the first mask matrix of the matrix S and the second mask matrix of the matrix P. Wait.

在一种可选方案中,每个基础处理电路中若接收的数据(具体可为待计算的数据块,如矩阵S或矩阵P中某几行/列的数据以及对应在mask矩阵中的标识数据)的数据量超过预设阈值时,该基础处理电路将不再接收新的输入数据,如主处理电路将后续发送的矩阵S或矩阵P某几行/列的数据以及该数据对应在mask矩阵中的标识数据等等,直至基础处理电路中拥有足够的缓存/存储空间,再接收主处理电路新发送的数据。In an optional solution, if the data is received in each basic processing circuit (specifically, the data block to be calculated, such as the data of some rows/columns in the matrix S or the matrix P and the identifier corresponding to the mask matrix When the data amount of the data exceeds the preset threshold, the basic processing circuit will no longer receive new input data, for example, the main processing circuit will sequentially transmit the matrix S or the matrix P data of a certain row/column and the data corresponding to the mask. The identification data in the matrix, etc., until there is sufficient buffer/storage space in the basic processing circuit, and then receives the newly transmitted data of the main processing circuit.

在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;

在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;In an alternative, the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;

基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;In an alternative, the result of the calculation may be the final result or an intermediate result of the inner product operation;

具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, The following line of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).

基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;

向着能够直接向主处理电路输出的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);Output results in a direction that can be directly output to the main processing circuit (for example, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result from the vertical output interface);

主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.

以上描述中使用的“横向”,“竖向”等词语只是为了表述图1b所示的例子,实际使用只需要区分出每个单元的“横向”“竖向”接口代表两个不同的接口即可。The words "lateral" and "vertical" are used in the above description only to describe the example shown in Figure 1b. In practice, it is only necessary to distinguish the "horizontal" and "vertical" interfaces of each unit to represent two different interfaces. can.

使用所述电路装置完成全连接运算:Using the circuit arrangement to complete the full join operation:

如果全连接层的输入数据是一个向量(即神经网络的输入是单个样本的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为向量P,使用所述装置执行矩阵乘以向量的方法运算;If the input data of the fully connected layer is a vector (ie, the case where the input of the neural network is a single sample), then the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, and the matrix is multiplied by the device. Method operation of vector;

如果全连接层的输入数据是一个矩阵(即神经网络的输入是多个样本的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为矩阵P,或者以全连接层的权值矩阵作为矩阵P,输入向量作为矩阵S,按照所述装置的矩阵乘以矩阵执行运算;If the input data of the fully connected layer is a matrix (that is, the case where the input of the neural network is a plurality of samples), the weight matrix of the fully connected layer is used as the matrix S, the input vector is used as the matrix P, or the weight of the fully connected layer The value matrix is used as the matrix P, and the input vector is used as the matrix S, and the operation is performed by multiplying the matrix of the device by the matrix;

使用所述电路装置完成卷积运算:The convolution operation is performed using the circuit device:

下面描述卷积运算,下面的图中一个方块表示一个数据,输入数据用图3a表示(N个样本,每个样本有C个通道,每个通道的特征图的高为H,宽为W),权值也即卷积核用图3b表示(有M个卷积核,每个卷积核有C个通道,高和宽分别为KH和KW)。对于输入数据的N个样本,卷积运算的规则都是一样的,下面解释在一个样本上进行卷积运算的过程,在一个样本上,M个卷积核中的每一个都要进行同样的运算,每个卷积核运算得到一张平面特征图,M个卷积核最终计算得到M个平面特征图,(对一个样本,卷积的输出是M个特征图),对于一个卷积核,要在一个样本的每一个平面位置进行内积运算,然后沿着H和W方向进行滑动,例如,图3c表示一个卷积核在输入数据的一个样本中右下角的位置进行内积运算的对应图;图3d表示卷积的位置向左滑动一格和图3e表示卷积的位置向上滑动一格。The convolution operation is described below. One block in the following figure represents a data, and the input data is represented by Figure 3a (N samples, each sample has C channels, and the feature map of each channel has a height H and a width W). The weight, that is, the convolution kernel, is represented by Figure 3b (there are M convolution kernels, each convolution kernel has C channels, and the height and width are KH and KW, respectively). The rules for convolution operations are the same for N samples of input data. The following is a process of convolution on a sample. On one sample, each of the M convolution kernels must perform the same. Operation, each convolution kernel operation obtains a plane feature map, and M convolution kernels finally calculate M plane feature maps (for one sample, the convolution output is M feature maps), for a convolution kernel To perform inner product operations at each plane position of a sample, and then slide in the H and W directions. For example, Figure 3c shows a convolution kernel performing inner product operations at the position of the lower right corner of a sample of input data. Corresponding diagram; Fig. 3d shows that the position of the convolution slides one space to the left and Fig. 3e shows that the position of the convolution slides up one space.

本方法使用所述装置如图1b所示的实施例进行说明;The method is described using the apparatus as shown in the embodiment of Figure 1b;

主处理电路的第一映射电路可将权值的部分或全部卷积核中的数据进行处理,得到对应的mask数据以及处理后的权值数据(即是处理后权值的部分或全部卷积核中的数据)。The first mapping circuit of the main processing circuit can process the data in part or all of the convolution kernel of the weight to obtain the corresponding mask data and the processed weight data (that is, part or all of the convolution of the processed weights) Data in the kernel).

主处理电路的控制电路将权值的部分或全部卷积核中的数据(该数据可为原来的权值数据或者处理后的权值数据)发送到通过横向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图1b中最上方的灰色填充的竖向数据通路);同时,控制电路将与该数据对应关联的mask数据也一起发送给与主处理电路连接的基础处理电路中;The control circuit of the main processing circuit sends the data in part or all of the convolution kernel of the weight (the data can be the original weight data or the processed weight data) to be directly connected to the main processing circuit through the horizontal data input interface. Those basic processing circuits (for example, the uppermost gray filled vertical data path in FIG. 1b); at the same time, the control circuit sends the mask data associated with the data to the basic processing circuit connected to the main processing circuit. ;

在一种可选方案中,主处理电路的控制电路将权值中某个卷积核的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3行第1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者第1次发送第3行前两个数,第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)同时,控制电路将该权值中某个卷积核对应的mask数据也采用上述每次发送一个数或一部分数据给上述基础处理电路;In an alternative, the control circuit of the main processing circuit sends a certain number or part of the data of a convolution kernel to a basic processing circuit each time (for example, for a basic processing circuit, The first time, the first number of the third line is transmitted, the second number of the third line of data is transmitted for the second time, the third number of the third line is transmitted for the third time, ..., or the third line is transmitted for the first time. The first two numbers, the second time sends the third and third numbers of the third line, the third time sends the fifth and fifth numbers of the third line...;) At the same time, the control circuit takes a weight in one of the volumes The mask data corresponding to the product core also uses the above-mentioned one or a part of data to be sent to the above basic processing circuit;

在一种可选方案中另一种情况是,主处理电路的控制电路将权值中某几个卷积核的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……;)相应地,控制电路将与该权值中某几个卷积核所对应关联的mask数据也采用上述相同的方法每次发送一个数或一部分数据给那个基础处理电路;In another alternative, the control circuit of the main processing circuit sends the data of some convolution kernels of the weights to each of the basic processing circuits each time (for example, For a certain basic processing circuit, the first number of lines in the 3rd, 4th, and 5th lines is transmitted for the first time, and the second number of the 3rd, 4th, and 5th lines is transmitted for the second time, and the third time is transmitted. 3, 4, 5 lines, the 3rd number of each line..., or the first 2, 4, 5 lines, the first two lines, the second time, the 3rd, 4th, 5th line, the third line And the fourth number, the third time, the third, fourth, fifth line, the fifth and sixth numbers of each line...;) correspondingly, the control circuit will be associated with some of the convolution kernels in the weight The mask data also uses the same method described above to send one or a portion of the data to the basic processing circuit each time;

主处理电路的控制电路把输入数据按照卷积的位置进行划分,主处理电路的控制电路将输入数据中的部分或全部卷积位置中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图1b中基础处理电路阵列左侧的灰色填充的横向数据通路);相应地,控制电路同样也会按照卷积的位置对于所述输入数据关联的mask数据进行划分,相应地控制电路同时也会将所述输入数 据中的部分或全部卷积位置中的数据所对应的mask数据也一起发送给与主处理电路电性连接的基础处理电路中;The control circuit of the main processing circuit divides the input data according to the position of the convolution, and the control circuit of the main processing circuit sends the data in part or all of the convolution position in the input data to the main processing circuit through the vertical data input interface. Connected to those basic processing circuits (for example, the gray-filled horizontal data path on the left side of the basic processing circuit array in Figure 1b); accordingly, the control circuit also performs the mask data associated with the input data according to the position of the convolution Dividing, correspondingly, the control circuit also sends the mask data corresponding to the data in some or all of the convolution positions of the input data to the basic processing circuit electrically connected to the main processing circuit;

在一种可选方案中,主处理电路的控制电路将输入数据中某个卷积位置的数据以及与该数据对应关联的mask数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3列第1个数,第2次发送第3列数据中的第2个数,第3次发送第3列的第3个数……,或者第1次发送第3列前两个数,第二次发送第3列第3和第4个数,第三次发送第3列第5和第6个数……;)In an alternative, the control circuit of the main processing circuit sends a certain number or part of the data of a certain convolution position in the input data and the mask data associated with the data to a certain basic processing circuit; For example, for a certain basic processing circuit, the first number of the third column is transmitted for the first time, the second number of the third column of data is transmitted for the second time, and the third number of the third column is transmitted for the third time... , or send the first two numbers in the third column for the first time, the third and fourth numbers in the third column for the second time, and the fifth and sixth numbers in the third column for the third time...;)

在一种可选方案中另一种情况是,主处理电路的控制电路将输入数据中某几个卷积位置的数据以及与该数据对应关联的mask数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5列每列的第1个数,第2次发送第3,4,5列每列的第2个数,第3次发送第3,4,5列每列的第3个数……,或者第1次发送第3,4,5列每列前两个数,第二次发送第3,4,5列每列第3和第4个数,第三次发送第3,4,5列每列第5和第6个数……;)In another alternative, the control circuit of the main processing circuit sends data of a certain convolution position in the input data and mask data associated with the data each time by a number or a part of the data. a basic processing circuit; (for example, for a basic processing circuit, the first number of the third, fourth, and fifth columns is sent for the first time, and the second, third, fourth, and fifth columns are for the second 2 digits, the 3rd number of the 3rd, 4th, and 5th columns, the 3rd number of each column..., or the 1st transmission of the 3rd, 4th, and 5th columns, the first two digits, the second transmission of the third digit , 4, 5 columns, 3rd and 4th, and 3rd, 3rd, 4th, 5th, 5th and 6th, each column...;)

基础处理电路接收到权值的数据(具体可为权值中卷积核的数据(简称权值数据)或者与该权值数据对应关联的mask数据)之后,将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到数据(该数据可为主处理电路发送的输入数据以及该输入数据关联的标识mask数据)后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图1b中基础处理电路阵列中间的白色填充的竖向的数据通路);The basic processing circuit receives the data of the weight (specifically, the data of the convolution kernel in the weight (referred to as weight data) or the mask data associated with the weight data), and then outputs the data through the horizontal data thereof. The interface is transmitted to the next basic processing circuit (eg, a white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 1b); the basic processing circuit receives the data (the data can be input to the main processing circuit) And the identification mask data associated with the input data, the data is transmitted through its vertical data output interface to the next underlying processing circuit connected thereto (eg, the white-filled vertical in the middle of the basic processing circuit array in FIG. 1b) Data path)

具体的,主处理电路的控制电路可将输入数据以及该输入数据关联的mask数据一起发送给基础处理电路,基础处理电路接收该输入数据以及该输入数据关联的mask数据;Specifically, the control circuit of the main processing circuit may send the input data and the mask data associated with the input data to the basic processing circuit, and the basic processing circuit receives the input data and the mask data associated with the input data;

每一个基础处理电路对接收到的数据进行运算;具体的,基础处理电路可启用第二映射电路根据输入数据关联的mask数据以及权值数据关联的mask数据(即权值中卷积核所关联的mask数据)得到连接标识数据;再利用连接标识数据选择输入数据以及权值数据中绝对值大于预设阈值的数据进行乘法运算;Each of the basic processing circuits operates on the received data; specifically, the basic processing circuit can enable the second mapping circuit to associate the mask data associated with the input data with the mask data associated with the weight data (ie, the convolution kernel associated with the weights) The mask data is obtained by the connection identification data; and the connection identification data is used to select the input data and the data in which the absolute value of the weight data is greater than a preset threshold is multiplied;

在一种可选方案中,每个基础处理电路中若接收的数据(具体可为待计算的数据块,如权值中卷积核中的数据以及该数据关联的mask数据、输入数据或者该输入数据关联的mask数据)的数据量超过预设阈值时,该基础处理电路将不再接收新的输入数据,如主处理电路将后续发送的权值中某几个卷积核中的数据以及该数据对应关联的mask数据等等,直至基础处理电路中拥有足够的缓存/存储空间,再接收主处理电路新发送的数据。In an optional solution, if the data is received in each basic processing circuit (specifically, the data block to be calculated, such as the data in the convolution kernel in the weight and the mask data, the input data, or the data associated with the data When the amount of data of the mask data of the input data exceeds a preset threshold, the basic processing circuit will no longer receive new input data, such as data in some convolution kernels of the weights subsequently sent by the main processing circuit and The data corresponds to the associated mask data and the like until the base processing circuit has sufficient buffer/storage space, and then receives the newly transmitted data of the main processing circuit.

在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和/或片上缓存上;In an alternative, the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on a register and/or an on-chip buffer;

在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和/或片上缓存上;In an alternative, the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and/or the on-chip buffer;

基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;In an alternative, the result of the calculation may be the final result or an intermediate result of the inner product operation;

具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,图1b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 1b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).

基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;

向着能够直接向主处理电路输出的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);Output results in a direction that can be directly output to the main processing circuit (for example, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result from the vertical output interface);

主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.

如图4a所示,神经网络训练的步骤包括:As shown in Figure 4a, the steps of neural network training include:

一个(多层)神经网络中的各层依次执行正向运算;Each layer in a (multi-layer) neural network performs a forward operation in sequence;

按照相反的层的顺序依次执行反向运算得到权值梯度;Performing a reverse operation in the order of the opposite layers to obtain a weight gradient;

用计算得到的权值的梯度去更新正向运算的权值;Using the calculated gradient of weights to update the weight of the forward operation;

这就是神经网络的训练的依次迭代,整个训练过程需要重复执行(即多次迭代计算)这个过程多次。This is the sequential iteration of the training of the neural network. The entire training process requires repeated execution (ie multiple iterations) to process this process multiple times.

参阅图1a所示的集成电路芯片装置,所述装置用于执行的神经网络的训练,该神经网络包含n层,所述n取值范围为大于等于2的整数,其特征在于,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;Referring to the integrated circuit chip device shown in FIG. 1a, the device is used for training of a neural network, the neural network includes n layers, and the n value ranges from an integer of 2 or more, characterized in that the integration The circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes a first mapping circuit, at least one of the plurality of basic processing circuits includes a second mapping circuit, the first mapping circuit and The second mapping circuit is configured to perform compression processing of each data in the neural network operation;

所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column;

所述集成电路芯片装置,用于接收训练指令,依据该训练指令确定第一层输入数据和第一层权值组数据,对第一层输入数据和第一层权值组数据执行神经网络的n层正向运算得到正向运算的第n输出结果;The integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data. The n-th layer forward operation obtains the nth output result of the forward operation;

所述主处理电路,还用于依据所述第n输出结果得到第n输出结果梯度,依据所述训练指令获取第n层反向运算的第n反向运算指令以及所述第n反向运算指令所需的第n层输入数据以及第n层权值组数据;依据所述第n反向运算指令将所述第n输出结果梯度、第n层输入数据以及第n层权值组数据划分为竖向数据块和横向数据块;依据所述第n反向运算指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述第n反向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction The nth layer input data and the nth layer weight group data required by the instruction; dividing the nth output result gradient, the nth layer input data, and the nth layer weight group data according to the nth reverse operation instruction a vertical data block and a horizontal data block; determining, according to the operation control of the nth reverse operation instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data The block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit;

所述多个基础处理电路,用于依据所述第n反向运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block;

所述主处理电路,还用于对该运算结果进行处理得到第n层权值组梯度和第n层输入数据梯度,应用所述第n层权值组梯度对第n层权值组数据进行更新;The main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data. Update

所述集成电路芯片装置,还用于将第n层输入数据梯度作为第n-1层的第n-1输出结果梯度执行n-1层反向运算得到n-1层权值组梯度,应用n-1层权值组梯度更新对应层的权值组数据,所述权值组数据包括至少二个权值。The integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights.

如图4b所示,为本披露实施例提供的一种神经网络的正向运算,每一层使用自己的输入数据和权值按照层的类型所指定的运算规则计算得到相应的输出数据;As shown in FIG. 4b, a forward operation of a neural network according to an embodiment of the present disclosure, each layer uses its own input data and weights to calculate corresponding output data according to an operation rule specified by a layer type;

神经网络的正向运算过程(也叫推理,inference)是逐层处理各层的输入数据,经过一定的计算,得到输出数据的过程,具有如下特征:The forward operation process (also called inference) of the neural network is a process of processing the input data of each layer layer by layer, and obtaining the output data after a certain calculation, which has the following characteristics:

某一层的输入:Input to a layer:

某一层的输入可以是神经网络的输入数据;The input to a layer can be the input data of the neural network;

某一层的输入可以是其他层的输出;The input of one layer can be the output of other layers;

某一层的输入可以是本层上一时刻的输出(对应于循环神经网络的情况);The input of a layer can be the output of the layer at a moment (corresponding to the case of a cyclic neural network);

某一层可以同时从多个上述输入源获取输入;A layer can simultaneously acquire input from a plurality of the above input sources;

某一层的输出:The output of a layer:

某一层的输出可以作为神经网络的输出结果;The output of a layer can be used as the output of the neural network;

某一层的输出可以是其它层的输入;The output of one layer can be the input of other layers;

某一层的输出可以是下一时刻本层的输入(循环神经网络的情况);The output of a layer can be the input of this layer at the next moment (in the case of a cyclic neural network);

某一层的输出可以向上述多个输出方向输出结果;The output of a layer may output a result to the plurality of output directions described above;

具体地,所述神经网络中的层的运算的类型包括但不限于以下几种:Specifically, the types of operations of the layers in the neural network include, but are not limited to, the following:

卷积层(即执行卷积运算);Convolution layer (ie performing convolution operations);

全连接层(即执行全连接运算);Fully connected layer (ie performing a full join operation);

归一化(规则化)层:包括LRN(Local Response Normalization)层,BN(Batch Normalization)层等类型;Normalized (regularized) layer: includes LRN (Local Response Normalization) layer, BN (Batch Normalization) layer, and the like;

池化层;Pooling layer

激活层:包括但不限于以下类型Sigmoid层,ReLU层,PReLu层,LeakyReLu层,Tanh层;Activation layer: including but not limited to the following types of Sigmoid layer, ReLU layer, PReLu layer, LeakyReLu layer, Tanh layer;

层的反向运算,每一层的反向运算需要执行两部分运算:一部分是使用可能是稀疏表示的输出数据梯度和可能是稀疏表示的输入数据计算出权值的梯度(用于在“权值更新”步骤更新本层的权值),另一部分是使用可能是稀疏表示的输出数据梯度和可能是稀疏表示的权值,计算出输入数据梯度(用于作为反向运算中下一层的输出数据梯度以供其进行反向运算);The inverse operation of the layer, the inverse operation of each layer needs to perform a two-part operation: one part is to calculate the weight gradient using the output data gradient which may be sparse representation and the input data which may be sparse representation (for the weight The value update step updates the weight of this layer, and the other part uses the output data gradient, which may be sparse representation, and the weight of the sparse representation, to calculate the input data gradient (used as the next layer in the inverse operation). Output a data gradient for its inverse operation);

反向运算按照与正向运算相反的顺序,从最后一层开始反向传递梯度。The inverse operation reverses the gradient from the last layer in the reverse order of the forward operation.

在一种可选方案中,某一层反向计算得到的输出数据梯度可以来自:In an alternative, the inverse of the output data gradient calculated by a layer can be derived from:

神经网络最后的损失函数(lost function或者cost function)回传的梯度;The gradient of the return function of the last loss function or cost function of the neural network;

其它层的输入数据梯度;Input data gradient for other layers;

本层上一时刻的输入数据梯度(对应于循环神经网络的情况);Input data gradient at a moment on this layer (corresponding to the case of a cyclic neural network);

某一层可以同时从多个上述源获取输出数据梯度;A layer can simultaneously obtain an output data gradient from a plurality of the above sources;

在执行完神经网络的反向运算之后,就计算出了各层的权值的梯度,在这个步骤中,所述装置的第一输入缓存和第二输入缓存分别用于存储本层的权值和权值的梯度,然后在运算单元中使用权值梯度对权值进行更新;After performing the inverse operation of the neural network, the gradient of the weights of the layers is calculated. In this step, the first input buffer and the second input buffer of the device are respectively used to store the weights of the layer. And the gradient of the weights, and then use the weight gradient to update the weights in the arithmetic unit;

上文中提到的运算都是神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层人工神经网络执行完成之后,下一层的运算指令会将运算单元中计算出的输出数据作为下一层的输入数据进行运算(或者是对该输出数据进行某些操作再作为下一层的输入数据),同时,将权值也替换为下一层的权值;在反向运算中,当上一层人工神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入数据梯度作为下一层的输出数据梯度进行运算(或者是对该输入数据梯度进行某些操作再作为下一层的输出数据梯度),同时将权值替换为下一层的权值;具体如图4c所示,图中虚线的箭头表示反向运算,实线的箭头表示正向运算,各图下面的标注表示图的含义。The operations mentioned above are all operations of a layer in a neural network. For a multi-layer neural network, the implementation process is that, in the forward operation, when the previous artificial neural network is executed, the next layer of operations is performed. The instruction will operate the output data calculated in the operation unit as the input data of the next layer (or perform some operations on the output data as the input data of the next layer), and at the same time, replace the weight with the next one. The weight of one layer; in the reverse operation, when the reverse operation of the upper artificial neural network is completed, the next layer of operation instructions will use the input data gradient calculated in the operation unit as the output data of the next layer. Gradient operation (or some operation on the input data gradient and then as the output data gradient of the next layer), and replace the weight with the weight of the next layer; as shown in Figure 4c, the dotted line in the figure The arrows indicate the reverse operation, the solid arrows indicate the forward operation, and the labels below the respective figures indicate the meaning of the figure.

下面阐述如图1a所示装置完成张量乘张量运算,所述张量和前文所述的数据块相同,其可为矩阵、向量、三维数据块、四位数据块以及高维数据块中的任一项或多项的组合;参考图2c和2f分别示出的矩阵乘向量和矩阵乘矩阵运算的具体实现。The tensor multiplying tensor operation is performed as shown in FIG. 1a, and the tensor is the same as the data block described above, which may be a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block. A combination of any one or more of the plurality; a specific implementation of the matrix multiplication vector and the matrix multiplication matrix operation shown in Figures 2c and 2f, respectively.

参阅图2f,为一种矩阵乘以矩阵的运算,当所述第一运算指令所指示的正向运算为矩阵乘矩阵运算,所述输入数据为所述矩阵乘矩阵运算的第一矩阵,所述权值为所述矩阵乘矩阵运算的第二矩阵。Referring to FIG. 2f, an operation of multiplying a matrix by a matrix, when the forward operation indicated by the first operation instruction is a matrix multiplication matrix operation, the input data is a first matrix of the matrix multiplication matrix operation, The weight is the second matrix of the matrix multiplication matrix operation.

参阅图6a,使用如图1a所示的装置完成矩阵乘矩阵的运算;Referring to Figure 6a, the operation of the matrix multiplication matrix is performed using the apparatus as shown in Figure 1a;

下面描述计算尺寸是M行L列的矩阵S和尺寸是L行N列的矩阵P的乘法的运算,(矩阵S中的每一行与矩阵P的每一列长度相同)所述神经网络计算装置拥有K个基础处理电路:The operation of calculating the multiplication of the matrix S of the M rows and L columns and the matrix P of the size of the L rows and N columns is described below, (each row in the matrix S is the same length as each column of the matrix P) the neural network computing device owns K basic processing circuits:

步骤S401b、主处理电路的控制电路将矩阵S中的每一行数据分发到K个基础处理电路中的某一个上,基础处理电路将接收到的数据保存在片上缓存和/或寄存器中;Step S401b, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received data in an on-chip buffer and/or a register;

在一种可选的方案中,所述矩阵S的数据为处理后的数据。具体的,主处理电路启用第一映射电路对矩阵S进行处理,从而获得处理后的矩阵S以及该矩阵S关联的第一标识(mask)矩阵。或者,主处理电路的第一映射电路根据预存的矩阵S关联的第一mask矩阵对矩阵S进行处理,得到处理后的矩阵S。进一步地,通过控制电路将处理后的矩阵S中的每一行数据以及该行数据对应在第一mask矩阵中对 应关联的标识数据一起发送给K个基础处理电路中的某一个或多个中。在主处理电路向基础处理电路发送数据时,具体可将处理后的矩阵S中绝对值大于预设阈值的数据,或者非0数据发送给基础处理电路,以减少数据传输量。In an optional solution, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S. Alternatively, the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S. Further, each row data in the processed matrix S and the row data corresponding to the identification data corresponding to the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit. When the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.

在一种可选方案中,如果S的行数M<=K则,主处理电路的控制电路给M个基础处理电路分别分发S矩阵的一行;可选的,同时还发送由该行对应在第一标识矩阵中的行的标识数据;In an alternative, if the number of rows of S is M<=K, the control circuit of the main processing circuit respectively distributes one row of the S matrix to the M basic processing circuits; optionally, the corresponding row is also sent by the row. Identification data of a row in the first identification matrix;

在一种可选方案中,如果S的行数M>K,主处理电路的控制电路给每个基础处理电路分别分发S矩阵中一行或多行的数据。可选的,同时还发送由该一行或几行对应在第一标识矩阵中的行的标识数据;In an alternative, if the number of rows of S is M > K, the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits. Optionally, the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;

S中有Mi行分发到第i个基础处理电路,这Mi行的集合称为Ai,如图2e表示第i个基础处理电路上将要执行的计算。The Mi line in S is distributed to the i-th base processing circuit, and the set of Mi lines is called Ai, and the calculation to be performed on the i-th basic processing circuit is shown in Fig. 2e.

在一种可选方案中,在每个基础处理电路中,例如第i个基础处理电路中:In an alternative, in each of the basic processing circuits, such as the i-th basic processing circuit:

接收的由主处理电路分发的矩阵Ai,将矩阵Ai保存在第i个基础处理电路寄存器和/或片上缓存中;优点是减少了之后的数据传输量,提高了计算效率,降低了功耗。The received matrix Ai distributed by the main processing circuit stores the matrix Ai in the i-th basic processing circuit register and/or the on-chip buffer; the advantage is that the amount of data transmission afterwards is reduced, the calculation efficiency is improved, and the power consumption is reduced.

步骤S402b、主处理电路的控制电路将矩阵P中各部分以广播的方式传输给各个基础处理电路;Step S402b, the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast manner;

在一种可选方案中,所述矩阵P的数据(各部分)可为处理后的数据。具体的,主处理电路启用第一映射电路对矩阵P进行处理,从而获得处理后的矩阵P以及该矩阵P关联的第二标识(mask)矩阵。或者,主处理电路的第一映射电路根据预存的矩阵P关联的第二mask矩阵对矩阵P进行处理,得到处理后的矩阵P。进一步地,通过控制电路将处理后的矩阵P中的数据(即各部分)以及该数据对应在第二mask矩阵中对应关联的标识数据一起发送给K个基础处理电路中的某一个或多个中。在主处理电路向基础处理电路发送数据时,具体可将处理后的矩阵P中绝对值大于预设阈值的数据,或者非0数据发送给基础处理电路,以减少数据传输量。In an alternative, the data (parts) of the matrix P can be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix P, thereby obtaining the processed matrix P and a second matrix of the matrix associated with the matrix P. Alternatively, the first mapping circuit of the main processing circuit processes the matrix P according to the second mask matrix associated with the pre-stored matrix P to obtain the processed matrix P. Further, the data in the processed matrix P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in. When the main processing circuit sends data to the basic processing circuit, specifically, the data in the processed matrix P whose absolute value is greater than a preset threshold or non-zero data may be sent to the basic processing circuit to reduce the data transmission amount.

在一种可选方案中,可以将矩阵P中各部分只广播一次到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对这一次得到的矩阵P的数据进行充分地复用,完成对应与矩阵Ai中每一行的内积运算;本实施例中的复用具体可以为基础处理电路在计算中重复使用,例如矩阵P的数据的复用,可以是对矩阵P的数据再多次使用。In an alternative, each part of the matrix P can be broadcast only once to the registers of the respective basic processing circuits or the on-chip buffer, and the i-th basic processing circuit fully multiplexes the data of the matrix P obtained this time. The inner product operation corresponding to each row in the matrix Ai is completed; the multiplexing in this embodiment may be repeatedly used in the calculation by the basic processing circuit, for example, the multiplexing of the data of the matrix P, and may be the data of the matrix P. use many times.

在一种可选方案中,主处理电路的控制电路可以将矩阵P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的矩阵P的数据不进行复用,分次完成对应于矩阵Ai中的每一行的内积运算;In an alternative, the control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time. The data is not multiplexed, and the inner product operations corresponding to each row in the matrix Ai are completed in stages;

在一种可选方案中,主处理电路的控制电路可以将矩阵P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的矩阵P的数据进行部分复用,完成对应于矩阵Ai中的每一行的内积运算;In an alternative, the control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time. The data is partially multiplexed to complete an inner product operation corresponding to each row in the matrix Ai;

在一种可选方案中,每个基础处理电路,例如第i个基础处理电路,计算矩阵Ai的数据和矩阵P的数据的内积;In an alternative, each of the basic processing circuits, such as the i-th basic processing circuit, calculates an inner product of the data of the matrix Ai and the data of the matrix P;

步骤S403b、每个基础处理电路的累加器电路将内积运算的结果进行累加并传输回主处理电路。Step S403b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.

可选的,步骤S403b之前,基础处理电路的内积运算器需计算矩阵S和矩阵P的数据的内积,具体存在以下几种实施方案。Optionally, before step S403b, the inner product operator of the basic processing circuit needs to calculate the inner product of the data of the matrix S and the matrix P. Specifically, the following implementation manners exist.

在一种具体实施方案中,所述基础处理电路接收到处理后的矩阵S中的数据以及该数据对应在第一mask矩阵中关联的标识数据;同时还接收到处理后的矩阵P中的数据。相应地,基础处理电路启用第二映射电路根据接收的第一mask矩阵中的标识数据对接收的矩阵P的数据进行处理,得到处理后的矩阵P的数据。进一步地,该基础处理电路启用内积运算器电路对接收的处理后的矩阵S中的数据和处理后的矩阵P的数据执行内积运算,得到内积运算的结果。In a specific implementation, the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. . Correspondingly, the basic processing circuit enables the second mapping circuit to process the data of the received matrix P according to the identification data in the received first mask matrix to obtain the data of the processed matrix P. Further, the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation.

在一种具体实施方案中,所述基础处理电路接收到处理后的矩阵P中的数据以及该数据对应在第二mask矩阵中关联的标识数据;同时还接收到处理后的矩阵S中的数据。相应地,基础处理电路启用第二映射电路根据接收的第二mask矩阵中的标识数据对接收的矩阵S的数据进行处理,得到处理后的矩阵S的数据。进一步地,该基础处理电路启用内积运算器电路对接收的处理后的矩阵P的数据和处理后 的矩阵S中的数据执行内积运算,得到内积运算的结果。In a specific implementation, the basic processing circuit receives the data in the processed matrix P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. . Correspondingly, the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S. Further, the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed matrix P and the processed data in the matrix S to obtain a result of the inner product operation.

在一种具体实施方案中,所述基础处理电路接收到处理后的矩阵S中的数据以及该数据对应在第一mask矩阵中关联的标识数据;同时还接收到处理后的矩阵P中的数据以及该数据对应在第二mask矩阵中关联的标识数据。相应地,基础处理电路启用第二映射电路根据接收的第一mask矩阵中的标识数据和第二mask矩阵中的标识数据得到关系标识矩阵;然后利用关系标识矩阵中的标识数据分别对接收的矩阵S中的数据和矩阵P中的数据进行处理,得到处理后的矩阵S的数据和处理后的矩阵P的数据。进一步地,启用内积运算器电路对处理后的矩阵S中的数据和处理后的矩阵P的数据执行内积运算,得到内积运算的结果。例如,第i个基础处理电路,接收到矩阵Ai、该Ai关联的标识矩阵Bi、矩阵P以及矩阵P关联的第二标识矩阵;此时可启用第二映射电路利用Bi和第二标识矩阵获得关系标识矩阵,再利用该关系标识矩阵同时或分别对矩阵Ai和矩阵P进行处理,得到处理后的矩阵Ai和处理后的矩阵P。接着,启用内积运算器电路对处理后的矩阵Ai和处理后的矩阵P进行内积运算。In a specific implementation, the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed matrix P. And the data corresponds to the identification data associated in the second mask matrix. Correspondingly, the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix The data in S and the data in the matrix P are processed to obtain the data of the processed matrix S and the data of the processed matrix P. Further, the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed matrix P data to obtain a result of the inner product operation. For example, the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the matrix P, and the second identification matrix associated with the matrix P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix. The relationship identification matrix is used to process the matrix Ai and the matrix P simultaneously or separately to obtain the processed matrix Ai and the processed matrix P. Next, the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed matrix P.

在一种可选方案中,基础处理电路可以将每次执行内积运算得到的部分和传输回主处理电路进行累加;In an alternative, the basic processing circuit may accumulate the portion obtained by performing the inner product operation each time and transmit back to the main processing circuit;

在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和保存在基础处理电路的寄存器和/或片上缓存中,累加结束之后传输回主处理电路;In an optional solution, the portion obtained by the inner product operation performed by each basic processing circuit may be stored in a register and/or an on-chip buffer of the basic processing circuit, and then accumulated and returned to the main processing circuit;

在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和在部分情况下保存在基础处理电路的寄存器和/或片上缓存中进行累加,部分情况下传输到主处理电路进行累加,累加结束之后传输回主处理电路。In an alternative, it is also possible to accumulate the portion obtained by the inner product operation performed by each basic processing circuit and, in some cases, the register and/or the on-chip buffer of the basic processing circuit, and in some cases transmit to The main processing circuit performs accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit.

参阅图2c,为一种矩阵乘以向量的运算示意图。当所述第一运算指令所指示的正向运算为:矩阵乘向量运算,所述输入数据为所述矩阵乘向量运算的第一矩阵,所述权值为所述矩阵乘向量运算的向量。参阅图6b,图6b提供了一种矩阵乘向量的实现方法,具体可以包括:Referring to Figure 2c, a schematic diagram of a matrix multiplied by a vector. When the forward operation indicated by the first operation instruction is: matrix multiplication vector operation, the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation. Referring to FIG. 6b, FIG. 6b provides a method for implementing a matrix multiplication vector, which may specifically include:

步骤S401、主处理电路的控制电路将矩阵S中的每一行数据分发到K个基础处理电路中的某一个上,基础处理电路将接收到的分发数据保存在基础处理电路的片上缓存和/或寄存器中;Step S401, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received distribution data in an on-chip buffer of the basic processing circuit and/or In the register;

在一种可选的方案中,所述矩阵S的数据为处理后的数据。具体的,主处理电路启用第一映射电路对矩阵S进行处理,从而获得处理后的矩阵S以及该矩阵S关联的第一标识(mask)矩阵。或者,主处理电路的第一映射电路根据预存的矩阵S关联的第一mask矩阵对矩阵S进行处理,得到处理后的矩阵S。进一步地,通过控制电路将处理后的矩阵S中的每一行数据以及该行数据对应在第一mask矩阵中对应关联的标识数据一起发送给K个基础处理电路中的某一个或多个中。在主处理电路向基础处理电路发送数据时,具体可将处理后的矩阵S中绝对值大于预设阈值的数据,或者非0数据发送给基础处理电路,以减少数据传输量。例如,分发到第i个基础处理电路中处理后的矩阵S中的行的集合为Ai,共有Mi行;相应地,同时也分发有与Ai对应的标识矩阵Bi,Bi为第一mask矩阵的一部分,共有大于或等于Mi行。In an optional solution, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, thereby obtaining the processed matrix S and a first mask matrix associated with the matrix S. Alternatively, the first mapping circuit of the main processing circuit processes the matrix S according to the first mask matrix associated with the pre-stored matrix S to obtain the processed matrix S. Further, each row data in the processed matrix S and the row data corresponding to the corresponding associated identification data in the first mask matrix are sent to one or more of the K basic processing circuits by the control circuit. When the main processing circuit sends data to the basic processing circuit, the data in the processed matrix S whose absolute value is greater than the preset threshold or the non-zero data may be sent to the basic processing circuit to reduce the data transmission amount. For example, the set of rows in the matrix S processed in the i-th basic processing circuit is Ai, and the Mi rows are shared; correspondingly, the identification matrix Bi corresponding to Ai is also distributed, and Bi is the first mask matrix. Part of a total of greater than or equal to the Mi line.

在一种可选方案中,如果矩阵S的行数M<=K则,主处理电路的控制电路给K个基础处理电路分别分发S矩阵的一行;可选的,同时还发送由该行对应在第一标识矩阵中的行的标识数据;In an alternative, if the number of rows of the matrix S is M<=K, the control circuit of the main processing circuit separately distributes one row of the S matrix to the K basic processing circuits; optionally, the corresponding row is also sent by the row. Identification data of rows in the first identification matrix;

在一种可选方案中,如果矩阵S的行数M>K,则主处理电路的控制电路给每个基础处理电路分别分发S矩阵中一行或多行的数据。可选的,同时还发送由该一行或几行对应在第一标识矩阵中的行的标识数据;In an alternative, if the number of rows of the matrix S is M > K, the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits. Optionally, the identifier data corresponding to the row in the first identifier matrix by the one or several rows is also sent;

分发到第i个基础处理电路的S中的行的集合为Ai,共有Mi个行,如图2c表示第i个基础处理电路上将要执行的计算。The set of rows in S distributed to the i-th base processing circuit is Ai, sharing a total of Mi rows, as shown in Figure 2c, which is to be performed on the i-th base processing circuit.

在一种可选方案中,在每个基础处理电路中,例如第i个基础处理电路中,可以将接收到的分发数据例如矩阵Ai保存在第i个基础处理电路的寄存器和/或片上缓存中;优点是减少了之后的分发数据的数据传输量,提高了计算效率,降低了功耗。In an alternative, in each of the basic processing circuits, such as the i-th basic processing circuit, the received distribution data, such as the matrix Ai, may be stored in a register and/or an on-chip buffer of the i-th base processing circuit. The advantage is to reduce the amount of data transmission of the distributed data afterwards, improve the calculation efficiency, and reduce the power consumption.

步骤S402、主处理电路的控制电路将向量P中的各部分以广播的方式传输给K个基础处理电路;Step S402, the control circuit of the main processing circuit transmits the parts in the vector P to the K basic processing circuits in a broadcast manner;

在一种可选方案中,所述向量P的数据(各部分)可为处理后的数据。具体的,主处理电路启用第 一映射电路对向量P进行处理,从而获得处理后的向量P以及该向量P关联的第二标识(mask)矩阵。或者,主处理电路的第一映射电路根据预存的向量P关联的第二mask矩阵对向量P进行处理,得到处理后的向量P。进一步地,通过控制电路将处理后的向量P中的数据(即各部分)以及该数据对应在第二mask矩阵中对应关联的标识数据一起发送给K个基础处理电路中的某一个或多个中。在主处理电路向基础处理电路发送数据时,具体可将处理后的向量P中绝对值大于预设阈值的数据,或者非0数据发送给基础处理电路,以减少数据传输量。In an alternative, the data (parts) of the vector P may be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the vector P to obtain a processed vector P and a second mask matrix associated with the vector P. Alternatively, the first mapping circuit of the main processing circuit processes the vector P according to the second mask matrix associated with the pre-stored vector P to obtain the processed vector P. Further, the data in the processed vector P (ie, each part) and the corresponding data corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits by the control circuit. in. When the main processing circuit sends data to the basic processing circuit, the data of the processed vector P having an absolute value greater than a preset threshold or non-zero data may be specifically sent to the basic processing circuit to reduce the data transmission amount.

在一种可选方案中,主处理电路的控制电路可以将向量P中各部分只广播一次到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对这一次得到的向量P的数据进行充分地复用,完成对应与矩阵Ai中每一行的内积运算。优点是,减少从主处理电路到基础处理电路的向量P的重复传输的数据传输量,提高执行效率,降低传输功耗。In an alternative, the control circuit of the main processing circuit can broadcast each part of the vector P only once to the register or the on-chip buffer of each basic processing circuit, and the i-th basic processing circuit obtains the vector P of this time. The data is fully multiplexed to complete the inner product operation corresponding to each row in the matrix Ai. The advantage is that the data transmission amount of the repeated transmission of the vector P from the main processing circuit to the basic processing circuit is reduced, the execution efficiency is improved, and the transmission power consumption is reduced.

在一种可选方案中,主处理电路的控制电路可以将向量P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的向量P的数据不进行复用,分次完成对应于矩阵Ai中的每一行的内积运算;优点是,减少基础处理电路内部的单次传输的向量P的数据传输量,并可以降低基础处理电路缓存和/或寄存器的容量,提高执行效率,降低传输功耗,降低成本。In an alternative, the control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time. The data is not multiplexed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of the vector P of the single transmission inside the basic processing circuit is reduced, and the basic processing circuit buffer and the lower limit can be reduced. / or the capacity of the register to improve execution efficiency, reduce transmission power consumption, and reduce costs.

在一种可选方案中,主处理电路的控制电路可以将向量P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的向量P的数据进行部分复用,完成对应于矩阵Ai中的每一行的内积运算;优点是,减少从主处理电路到基础处理电路的数据传输量,也减少基础处理电路内部的数据传输量,提高执行效率,降低传输功耗。In an alternative, the control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time. The data is partially multiplexed to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that the data transmission amount from the main processing circuit to the basic processing circuit is reduced, and the data transmission amount inside the basic processing circuit is also reduced, and the execution is improved. Efficiency, reducing transmission power consumption.

步骤S403、K个基础处理电路的内积运算器电路计算矩阵S和向量P的数据的内积,例如第i个基础处理电路,计算矩阵Ai的数据和向量P的数据的内积;Step S403, the inner product operator circuit of the K basic processing circuits calculates an inner product of the data of the matrix S and the vector P, for example, the i-th basic processing circuit, and calculates an inner product of the data of the matrix Ai and the data of the vector P;

在一种具体实施方案中,所述基础处理电路接收到处理后的矩阵S中的数据以及该数据对应在第一mask矩阵中关联的标识数据;同时还接收到处理后的向量P中的数据。相应地,基础处理电路启用第二映射电路根据接收的第一mask矩阵中的标识数据对接收的向量P的数据进行处理,得到处理后的向量P的数据。进一步地,该基础处理电路启用内积运算器电路对接收的处理后的矩阵S中的数据和处理后的向量P的数据执行内积运算,得到内积运算的结果。例如,第i个基础处理电路,接收到矩阵Ai、该Ai关联的标识矩阵Bi以及向量P;此时可启用第二映射电路利用Bi对向量P进行处理得到处理后的向量P;再启用内积运算器电路对矩阵Ai和处理后的向量P进行内积运算。In a specific implementation, the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. . Correspondingly, the basic processing circuit enables the second mapping circuit to process the data of the received vector P according to the identification data in the received first mask matrix to obtain the data of the processed vector P. Further, the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation. For example, the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, and the vector P; at this time, the second mapping circuit can be enabled to process the vector P by using the Bi to obtain the processed vector P; The product operator circuit performs an inner product operation on the matrix Ai and the processed vector P.

在一种具体实施方案中,所述基础处理电路接收到处理后的向量P中的数据以及该数据对应在第二mask矩阵中关联的标识数据;同时还接收到处理后的矩阵S中的数据。相应地,基础处理电路启用第二映射电路根据接收的第二mask矩阵中的标识数据对接收的矩阵S的数据进行处理,得到处理后的矩阵S的数据。进一步地,该基础处理电路启用内积运算器电路对接收的处理后的向量P的数据和处理后的矩阵S中的数据执行内积运算,得到内积运算的结果。例如,第i个基础处理电路,接收到矩阵Ai、处理后的向量P以及该向量P关联的第二标识矩阵;此时可启用第二映射电路利用第二标识矩阵对Ai进行处理得到处理后的矩阵Ai;再启用内积运算器电路对处理后的矩阵Ai和处理后的向量P进行内积运算。In a specific implementation, the basic processing circuit receives the data in the processed vector P and the data corresponding to the identification data associated in the second mask matrix; and also receives the data in the processed matrix S. . Correspondingly, the basic processing circuit enables the second mapping circuit to process the data of the received matrix S according to the identification data in the received second mask matrix to obtain the data of the processed matrix S. Further, the basic processing circuit enables the inner product operator circuit to perform an inner product operation on the received data of the processed vector P and the processed data in the matrix S to obtain a result of the inner product operation. For example, the i-th basic processing circuit receives the matrix Ai, the processed vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to process the Ai by using the second identification matrix. The matrix Ai; the inner product operator circuit is further enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.

在一种具体实施方案中,所述基础处理电路接收到处理后的矩阵S中的数据以及该数据对应在第一mask矩阵中关联的标识数据;同时还接收到处理后的向量P中的数据以及该数据对应在第二mask矩阵中关联的标识数据。相应地,基础处理电路启用第二映射电路根据接收的第一mask矩阵中的标识数据和第二mask矩阵中的标识数据得到关系标识矩阵;然后利用关系标识矩阵中的标识数据分别对接收的矩阵S中的数据和向量P中的数据进行处理,得到处理后的矩阵S的数据和处理后的向量P的数据。进一步地,启用内积运算器电路对处理后的矩阵S中的数据和处理后的向量P的数据执行内积运算,得到内积运算的结果。例如,第i个基础处理电路,接收到矩阵Ai、该Ai关联的标识矩阵Bi、向量P以及向量P关联的第二标识矩阵;此时可启用第二映射电路利用Bi和第二标识矩阵获得关系标识矩阵,再利用该关系标识矩阵同时或分别对矩阵Ai和向量P进行处理,得到处理后的矩阵Ai和处理后的向量P。 接着,启用内积运算器电路对处理后的矩阵Ai和处理后的向量P进行内积运算。In a specific implementation, the basic processing circuit receives the data in the processed matrix S and the data corresponding to the identification data associated in the first mask matrix; and also receives the data in the processed vector P. And the data corresponds to the identification data associated in the second mask matrix. Correspondingly, the basic processing circuit enables the second mapping circuit to obtain the relationship identifier matrix according to the identifier data in the received first mask matrix and the identifier data in the second mask matrix; and then use the identifier data in the relationship identifier matrix to respectively receive the matrix The data in S and the data in the vector P are processed to obtain the data of the processed matrix S and the data of the processed vector P. Further, the inner product operator circuit is enabled to perform an inner product operation on the data in the processed matrix S and the processed vector P data to obtain a result of the inner product operation. For example, the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit can be enabled to obtain the second identification matrix by using the Bi and the second identification matrix. The relationship identification matrix is used to process the matrix Ai and the vector P simultaneously or separately to obtain the processed matrix Ai and the processed vector P. Next, the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.

步骤S404、K个基础处理电路的累加器电路将内积运算的结果进行累加得到累加结果,将累加结果以定点类型形式传输回主处理电路。Step S404: The accumulator circuit of the K basic processing circuits accumulates the result of the inner product operation to obtain an accumulated result, and transmits the accumulated result to the main processing circuit in a fixed point type.

在一种可选方案中,可以将每次基础处理电路执行内积运算得到的部分和(部分和即累加结果的一部分,例如累加结果为:F1*G1+F2*G2+F3*G3+F4*G4+F5*G5,那么部分和可以为:F1*G1+F2*G2+F3*G3的值)传输回主处理电路进行累加;优点是,减少了基础处理电路内部的运算量,提高基础处理电路的运算效率。In an alternative, each part of the basic processing circuit may perform an inner product operation to obtain a part and (partial and part of the accumulated result, for example, the accumulated result is: F1*G1+F2*G2+F3*G3+F4 *G4+F5*G5, then the part and can be: the value of F1*G1+F2*G2+F3*G3) is transferred back to the main processing circuit for accumulation; the advantage is that the calculation amount inside the basic processing circuit is reduced, and the basis is improved. The computational efficiency of the processing circuit.

在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和保存在基础处理电路的寄存器和/或片上缓存中,累加结束之后传输回主处理电路;优点是,减少了基础处理电路和主处理电路之间的数据传输量,提高了运算效率,降低了数据传输功耗。In an alternative, the part obtained by the inner product operation performed by each basic processing circuit and the register and/or the on-chip buffer stored in the basic processing circuit may be transferred to the main processing circuit after the accumulation is completed; The data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.

在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和在部分情况下保存在基础处理电路的寄存器和/或片上缓存中进行累加,部分情况下传输到主处理电路进行累加,累加结束之后传输回主处理电路;优点是,减少了基础处理电路和主处理电路之间的数据传输量,提高了运算效率,降低了数据传输功耗,减少了基础处理电路内部的运算量,提高基础处理电路的运算效率。In an alternative, it is also possible to accumulate the portion obtained by the inner product operation performed by each basic processing circuit and, in some cases, the register and/or the on-chip buffer of the basic processing circuit, and in some cases transmit to The main processing circuit performs accumulation, and is transferred back to the main processing circuit after the accumulation is completed; the advantage is that the data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the data transmission power consumption is reduced, and the basic processing is reduced. The amount of calculation inside the circuit improves the computational efficiency of the basic processing circuit.

神经网络训练方法Neural network training method

在神经网络训练过程中所涉及到所有或部分数据可以是处理后的数据,具体可参见前述实施例由第一映射电路和/或第二映射电路处理获得,这里不在赘述。All or part of the data involved in the neural network training process may be processed data. For details, refer to the foregoing embodiment, which is obtained by the first mapping circuit and/or the second mapping circuit, and is not described here.

需要说明的是,训练过程的不同时刻(具体来说就是不同的迭代次数或者初始化的时刻)、训练过程中的不同阶段(即正向或者反向运算)、不同的层、同一层中的不同数据块(即多个输入数据块,输出数据块),或者同一个数据块中划分的不同部分的子数据块,都可以是指处理后的数据块。It should be noted that different times of the training process (specifically, different iterations or initialization time), different stages of the training process (ie, forward or reverse operations), different layers, and different layers A data block (ie, multiple input data blocks, an output data block), or a sub-block of a different portion divided in the same data block, may refer to a processed data block.

下面以一个实际的例子来说明神经网络训练的具体实现方法,如图4c所示为单层运算的神经网络训练的具体的计算示意图,如图4c实线示出单层神经网络的正向运算,如图4c虚线示出单层神经网络的反向运算。具体的,先根据输入数据与权值或参数执行本层的正向运算,得到输出数据,再根据输出数据进行预设规则的运算(该预设规则可以由厂家根据自身需要自行设定,这里并不限定该预设规则运算的具体运算步骤)得到本层的输出数据梯度。接着,可根据本层的输入数据、权值或参数、输出数据梯度执行本层神经网络的反向运算,得到本层的输入数据梯度和权值或参数的梯度,可利用计算获得的权值或参数的梯度对应更新本层的权值或参数,即完成了本层神经网络训练。The following is a practical example to illustrate the specific implementation of neural network training. Figure 4c shows the specific calculation of neural network training for single-layer operation. Figure 4c shows the forward operation of single-layer neural network. The inverse of the single layer neural network is shown by the dashed line in Figure 4c. Specifically, the forward operation of the layer is performed according to the input data and the weight or the parameter, and the output data is obtained, and then the preset rule is calculated according to the output data (the preset rule can be set by the manufacturer according to its own needs, here) The specific operation step of the preset rule operation is not limited to obtain the output data gradient of the layer. Then, the inverse operation of the layer neural network can be performed according to the input data, the weight or the parameter of the layer, and the output data gradient, and the gradient of the input data and the gradient of the weight or the parameter of the layer can be obtained, and the weight obtained by the calculation can be used. Or the gradient of the parameter corresponds to updating the weight or parameter of the layer, that is, the neural network training of the layer is completed.

在具体实现过程中,在正向运算或反向运算过程中涉及的数据可为处理后的数据,以正向运算为例,本申请实施例提供的技术方案可为依据本层的反向运算指令确定是否启动相关映射电路(具体可为第一映射电路和/或第二映射电路)对输入数据和/或权值进行处理,然后利用处理后的输入数据和/或权值执行本层运算。对于上述数据处理的原理可参见前述实施例中的相关阐述,这里不再赘述。应理解的,利用上述处理后的数据来执行神经网络运算,可以极大的减少计算器之间的传输开销,另外,对于计算器来说,较少比特位的数据存储的空间也较小,即存储开销会较小,计算量也会减少,即计算开销会减少,所以能够减少计算开销以及存储的开销。In the specific implementation process, the data involved in the forward operation or the reverse operation may be the processed data. Taking the forward operation as an example, the technical solution provided by the embodiment of the present application may be an inverse operation according to the layer. The instruction determines whether to initiate a correlation mapping circuit (specifically, the first mapping circuit and/or the second mapping circuit) to process the input data and/or the weight, and then perform the layer operation using the processed input data and/or weights . For the principle of the above data processing, refer to the related description in the foregoing embodiment, and details are not described herein again. It should be understood that using the above processed data to perform neural network operations can greatly reduce the transmission overhead between the calculators. In addition, for the calculator, the data storage space of fewer bits is also small. That is, the storage overhead will be smaller, the amount of calculation will be reduced, that is, the calculation overhead will be reduced, so the calculation overhead and the storage overhead can be reduced.

下面以图7a和图7b为例,具体给出矩阵乘法和卷积的神经网络训练的结构示意图。其中,图7a展示的本层运算方式为矩阵乘法,图7b展示的本层运算方式为卷积运算,假设本层的输入数据以及权值均为矩阵,为了方便说明这里的输入数据以矩阵I为例,权值以矩阵W为例,其中输出数据=矩阵I*矩阵W;如果矩阵I和矩阵W均为维数较大的稀疏矩阵,这里的稀疏矩阵是指矩阵中包括有绝对值小于等于预设阈值,或者为0的数据较多。维数较大可以理解为矩阵I以及矩阵W的列数量以及行数量之和较大,即可以认为上述矩阵I以及矩阵W在存储器和/或寄存器所占空间较大以及计算量也较大,此时如果按照常规矩阵乘法处理,数据计算量较大;为提高数据处理效率,需将矩阵I以及矩阵W进行处理,然后再执行矩阵乘法的运算。The following is a structural diagram of neural network training for matrix multiplication and convolution, taking FIG. 7a and FIG. 7b as examples. The calculation mode of the layer shown in FIG. 7a is matrix multiplication, and the operation mode of the layer shown in FIG. 7b is a convolution operation, assuming that the input data and the weight of the layer are all matrices, and the input data is matrix I for convenience. For example, the weight is based on the matrix W, where the output data = matrix I * matrix W; if the matrix I and the matrix W are sparse matrices with larger dimensions, the sparse matrix here means that the matrix includes absolute values less than Equal to the preset threshold, or more data with 0. The larger the dimension can be understood as the larger the number of columns of the matrix I and the matrix W and the sum of the number of rows, that is, the matrix I and the matrix W can be considered to occupy a larger space in the memory and/or the register and the calculation amount is larger. At this time, according to the conventional matrix multiplication processing, the amount of data calculation is large; in order to improve the data processing efficiency, the matrix I and the matrix W need to be processed, and then the matrix multiplication operation is performed.

例如,矩阵I为1000*1000的稀疏矩阵,矩阵W也为1000*1000的稀疏矩阵,那么对于列数量以及行数量之和为2000,其数量很大,对应的计算量就更大,矩阵乘以矩阵的内积运算的乘法运算即109 次,对于此技术方案,由于矩阵I以及矩阵W的数量很大,不可能一次将所有的数据全部传输和计算,这样同一数据可能会多次传输和计算,如果此时采用数据处理方法,对两个稀疏矩阵进行数据处理,能够很大程度上的降低矩阵I以及矩阵W的维度(即数据量),就可以极大的减少传输的数据量和计算量,进而减少传输开销和计算开销。For example, the matrix I is a sparse matrix of 1000*1000, and the matrix W is also a sparse matrix of 1000*1000. Then, the sum of the number of columns and the number of rows is 2000, the number of which is large, and the corresponding calculation amount is larger, matrix multiplication The multiplication operation of the inner product of the matrix is 109 times. For this technical solution, since the number of the matrix I and the matrix W is large, it is impossible to transmit and calculate all the data at a time, so that the same data may be transmitted multiple times. Calculation, if the data processing method is used at this time, the data processing of the two sparse matrices can greatly reduce the dimension of the matrix I and the matrix W (ie, the amount of data), thereby greatly reducing the amount of data transmitted and Calculate the amount, which in turn reduces transmission overhead and computational overhead.

如图7c和7d示出多层神经网络训练的具体结构示意图。如图7c所示,虚线箭头方向示出一种反向运算。对于反向运算,其反向运算的输出为输出数据梯度;当所述输出数据梯度为多层神经网络迭代计算的最后一层,则该输出数据梯度具体为本次迭代计算的最后一层的输出数据经过预设运算(该预设运算可以由厂家根据自身需要自行设定,这里并不限定该预设运算的具体运算步骤)所得到;如过该输出数据梯度并非为多层神经网络迭代计算的最后一层,例如该输出数据梯度为本次迭代计算的第n层,那么该第n层的输出数据梯度可为第n+1层反向运算计算得到的输入数据梯度。同理可理解图7d,图7d具体可为多层卷积神经网络训练的示意图(包括正向运算和反向运算),图示中的其他操作可表示为除了卷积层之外的其他层或者层之间的操作,不做限定。A schematic diagram of the specific structure of the multilayer neural network training is shown in Figures 7c and 7d. As shown in Figure 7c, the direction of the dashed arrow shows a reverse operation. For the inverse operation, the output of the inverse operation is the output data gradient; when the output data gradient is the last layer of the iterative calculation of the multilayer neural network, the output data gradient is specifically the last layer of the iterative calculation The output data is subjected to a preset operation (the preset operation can be set by the manufacturer according to his own needs, and the specific operation steps of the preset operation are not limited herein); if the output data gradient is not an iterative for the multilayer neural network The last layer of the calculation, for example, the output data gradient is the nth layer of the iterative calculation, then the output data gradient of the nth layer can be the input data gradient calculated by the n+1th layer inverse operation. Similarly, FIG. 7d can be understood. FIG. 7d can be a schematic diagram of multi-layer convolutional neural network training (including forward operation and reverse operation), and other operations in the figure can be expressed as layers other than the convolution layer. Or the operation between the layers, not limited.

本披露还提供一种集成电路芯片装置,所述集成电路芯片装置用于执行神经网络的训练,所述神经网络包含多层,所述集成电路芯片装置包括:处理电路以及外部接口;The present disclosure also provides an integrated circuit chip device for performing training of a neural network, the neural network comprising a plurality of layers, the integrated circuit chip device comprising: a processing circuit and an external interface;

所述外部接口,用于接收训练指令;The external interface is configured to receive a training instruction;

所述处理电路,用于依据该训练指令确定第一层输入数据和第一层权值数据,通过第一层输入数据和第一层权值数据执行神经网络的n层正向运算得到第n输出结果;The processing circuit is configured to determine first layer input data and first layer weight data according to the training instruction, and perform n-th layer forward operation of the neural network by using the first layer input data and the first layer weight data to obtain the nth Output result

所述处理电路,还用于依据第n输出结果得到第n输出结果梯度,依据所述训练指令获取第n层反向运算的第n反向运算指令以及所述第n反向运算指令所需的第n层输入数据以及第n层权值组数据;依据第n反向运算指令、第n输出结果梯度、第n层输入数据以及第n层权值组数据执行神经网络的n层反向运算得到n层运算的n个权值梯度;The processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of the nth layer reverse operation and the nth reverse operation instruction according to the training instruction. The nth layer input data and the nth layer weight group data; performing n-layer reversal of the neural network according to the nth reverse operation instruction, the nth output result gradient, the nth layer input data, and the nth layer weight group data The operation obtains n weight gradients of the n-layer operation;

所述处理电路,还用于将应用所述n个权值梯度对n层运算的n个权值进行更新。The processing circuit is further configured to update the n weights of the n-layer operation by applying the n weight gradients.

本披露还揭露了一个神经网络运算装置,其包括一个或多个在如图8所示的芯片,用于从其他处理装置中获取待运算数据和控制信息,执行指定的神经网络运算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上神如图8所示的芯片时,如图8所示的芯片间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的神经网络的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。可选的,该神经网络运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。The present disclosure also discloses a neural network computing device including one or more chips as shown in FIG. 8 for acquiring data to be processed and control information from other processing devices, performing specified neural network operations, and executing results. Passed to the peripheral device through the I/O interface. Peripherals such as cameras, monitors, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip as shown in FIG. 8 is included, the chips shown in FIG. 8 can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus to support a larger scale. The operation of the neural network. At this point, you can share the same control system, or you can have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology. Optionally, the neural network computing device has high compatibility and can be connected to various types of servers through a PCIE interface.

使用所述电路装置完成加偏置操作的方法;A method of performing an offset operation using the circuit device;

利用主处理电路的向量运算器电路可以实现两个向量或者两个矩阵相加的功能;The vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices;

利用主处理电路的向量运算器电路可以实现把一个向量加到一个矩阵的每一行上,或者每一个列上的功能。The vector operator circuit of the main processing circuit can be used to add a vector to each row of a matrix, or to the function on each column.

在一种可选方案中,所述矩阵可以来自所述装置执行矩阵乘矩阵运算的结果;In an alternative, the matrix may be derived from the result of the matrix multiplication matrix operation performed by the apparatus;

在一种可选方案中,所述向量可以来自所述装置执行矩阵乘向量运算的结果;In an alternative, the vector may be from a result of the device performing a matrix multiplication vector operation;

在一种可选方案中,所述矩阵可以来自所述装置的主处理电路从外部接受的数据。In an alternative, the matrix may be derived from externally accepted data from the main processing circuitry of the device.

在一种可选方案中,所述向量可以来自所述装置的主处理电路从外部接受的数据。In an alternative, the vector may be from externally accepted data from the main processing circuitry of the device.

包括但不限于以上这些数据来源。This includes, but is not limited to, the above data sources.

使用所述电路装置完成激活函数运算的方法:The method of activating the function operation is performed using the circuit device:

利用主处理电路的激活电路,输入一向量,计算出该向量的激活向量;Using an activation circuit of the main processing circuit, inputting a vector to calculate an activation vector of the vector;

在一种可选方案中,主处理电路的激活电路将输入向量中的每一个值通过一个激活函数(激活函数的输入是一个数值,输出也是一个数值),计算出一个数值输出到输出向量的对应位置;In an alternative, the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a value and the output is also a value), and a value is output to the output vector. Corresponding position

在一种可选方案中,激活函数可以是:y=max(m,x),其中x是输入数值,y是输出数值,m是一个 常数;In an alternative, the activation function can be: y = max(m, x), where x is the input value, y is the output value, and m is a constant;

在一种可选方案中,激活函数可以是:y=tanh(x),其中x是输入数值,y是输出数值;In an alternative, the activation function can be: y = tanh(x), where x is the input value and y is the output value;

在一种可选方案中,激活函数可以是:y=sigmoid(x),其中x是输入数值,y是输出数值;In an alternative, the activation function can be: y = sigmoid (x), where x is the input value and y is the output value;

在一种可选方案中,激活函数可以是一个分段线性函数;In an alternative, the activation function can be a piecewise linear function;

在一种可选方案中,激活函数可以是任意输入一个数,输出一个数的函数。In an alternative, the activation function can be any function that inputs a number and outputs a number.

在一种可选方案中,输入向量的来源有(包括但不限于):In an alternative, the source of the input vector is (including but not limited to):

所述装置的外部数据来源;External data source of the device;

在一种可选方案中,输入数据来自所述装置进行矩阵乘向量的运算结果;In an alternative, the input data is from the device for performing a matrix multiplication vector;

在一种可选方案中,输入数据来自所述装置进行矩阵乘矩阵的运算结果;In an alternative, the input data is from the device for performing a matrix multiplication matrix;

所述装置的主处理电路计算结果;The main processing circuit of the device calculates a result;

在一种可选方案中,输入数据来自所述装置主处理电路实现加偏置之后的计算结果。In an alternative, the input data is derived from the calculations after the device main processing circuit implements the offset.

使用所述装置实现BLAS(Basic Linear Algebra Subprograms)的方法;A method of implementing BLAS (Basic Linear Algebra Subprograms) using the device;

GEMM计算是指:BLAS库中的矩阵-矩阵乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*op(P)+beta*C,其中,A和B为输入的两个矩阵,C为输出矩阵,alpha和beta为标量,op代表对矩阵S或P的某种操作,此外,还会有一些辅助的整数作为参数来说明矩阵的A和B的宽高;GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * op (P) + beta * C, where A and B are the two matrices of the input, C is the output matrix, alpha and beta are scalar, op Represents some operation on the matrix S or P. In addition, there are some auxiliary integers as parameters to describe the width and height of the matrix A and B;

使用所述装置实现GEMM计算的步骤为:The steps to implement GEMM calculations using the device are:

主处理电路在进行OP操作之前可以将输入矩阵S和矩阵P进行数据类型的转换;The main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;

主处理电路的转换电路对输入矩阵S和矩阵P进行各自相应的op操作;The conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix P;

在一种可选方案中,op可以为矩阵的转置操作;利用主处理电路的向量运算功能或者数据重排列功能(前面提到了主处理电路具有数据重排列的电路),实现矩阵转置操作,当然在实际应用中,上述OP也可以直接通过转换电路来实现,例如矩阵转置操作时,直接通过矩阵转置电路来实现OP操作;In an alternative, op can be a transposition operation of the matrix; using a vector operation function of the main processing circuit or a data rearrangement function (the circuit in which the main processing circuit has data rearrangement is mentioned), and the matrix transposition operation is implemented. Of course, in practical applications, the above OP can also be directly realized by a conversion circuit, for example, when the matrix transposition operation is performed, the OP operation is directly realized by the matrix transposition circuit;

在一种可选方案中,某个矩阵的op可以为空,OP操作不进行;In an alternative, the op of a certain matrix may be empty, and the OP operation is not performed;

利用矩阵乘矩阵的计算方法完成op(S)与op(P)之间的矩阵乘法计算;The matrix multiplication calculation between op(S) and op(P) is completed by the matrix multiplication matrix calculation method;

利用主处理电路的算术逻辑电路对op(S)*op(P)的结果中的每一个值进行乘以alpha的操作;Multiplying each value in the result of op(S)*op(P) by an alpha operation using an arithmetic logic circuit of the main processing circuit;

在一种可选方案中,alpha为1的情况下乘以alpha操作不进行;In an alternative, multiplying the alpha operation without alpha is performed;

利用主处理电路的算术逻辑电路实现beta*C的运算;Realizing the operation of beta*C by using the arithmetic logic circuit of the main processing circuit;

在一种可选方案中,beta为1的情况下,不进行乘以beta操作;In an alternative, where beta is 1, no multiplication by beta operation is performed;

利用主处理电路的算术逻辑电路,实现矩阵alpha*op(S)*op(P)和beta*C之间对应位置相加的步骤;Using the arithmetic logic circuit of the main processing circuit, the steps of adding the corresponding positions between the matrix alpha*op(S)*op(P) and beta*C are implemented;

在一种可选方案中,beta为0的情况下,不进行相加操作;In an alternative, if the beta is 0, the addition operation is not performed;

GEMV计算是指:BLAS库中的矩阵-向量乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*P+beta*C,其中,S为输入矩阵,P为输入的向量,C为输出向量,alpha和beta为标量,op代表对矩阵S的某种操作;GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * P + beta * C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op is the pair Some operation of the matrix S;

使用所述装置实现GEMV计算的步骤为:The steps to implement GEMV calculations using the apparatus are:

主处理电路在进行OP操作之前可以将输入矩阵S和矩阵P进行数据类型的转换;The main processing circuit can perform data type conversion on the input matrix S and the matrix P before performing the OP operation;

主处理电路的转换电路对输入矩阵S进行相应的op操作;The conversion circuit of the main processing circuit performs a corresponding op operation on the input matrix S;

在一种可选方案中,op可以为矩阵的转置操作;利用主处理电路的矩阵转置电路实现矩阵转置操作;In an alternative, the op can be a transposition operation of the matrix; the matrix transposition operation is implemented by the matrix transposition circuit of the main processing circuit;

在一种可选方案中,某个矩阵的op可以为空,op操作不进行;In an alternative, the op of a certain matrix may be empty, and the op operation is not performed;

用矩阵乘向量的计算方法完成矩阵op(S)与向量P之间的矩阵-向量乘法计算;The matrix-vector multiplication calculation between the matrix op(S) and the vector P is completed by the matrix multiplication vector calculation method;

利用主处理电路的算术逻辑电路对op(S)*P的结果中的每一个值进行乘以alpha的操作;Multiplying each value of the result of op(S)*P by an alpha operation using an arithmetic logic circuit of the main processing circuit;

在一种可选方案中,alpha为1的情况下乘以alpha操作不进行;In an alternative, multiplying the alpha operation without alpha is performed;

利用主处理电路的算术逻辑电路,实现beta*C的运算;Using the arithmetic logic circuit of the main processing circuit to implement the operation of beta*C;

在一种可选方案中,beta为1的情况下,不进行乘以beta操作;In an alternative, where beta is 1, no multiplication by beta operation is performed;

利用主处理电路的算术逻辑电路,实现矩阵alpha*op(S)*P和beta*C之间对应位置相加的步骤;Using the arithmetic logic circuit of the main processing circuit to implement the step of adding the corresponding positions between the matrix alpha*op(S)*P and beta*C;

在一种可选方案中,beta为0的情况下,不进行相加操作;In an alternative, if the beta is 0, the addition operation is not performed;

更新权值的方法:How to update the weight:

利用主处理电路的向量运算器电路实现神经网络训练过程中的权值更新功能,具体地,权值更新是指使用权值的梯度来更新权值的方法。The vector updater circuit of the main processing circuit is used to implement the weight update function in the neural network training process. Specifically, the weight update refers to a method of updating the weight using the gradient of the weight.

在一种可选方案中,使用主处理电路的向量运算器电路对权值和权值梯度这两个向量进行加减运算得到运算结果,该运算结果即为更新权值。In an alternative, the vector operator circuit of the main processing circuit is used to add and subtract the two vectors of the weight and the weight gradient to obtain an operation result, and the operation result is an update weight.

在一种可选方案中,使用主处理电路的向量运算器电路在权值以及权值梯度乘以或除以一个数得到中间权值和中间权值梯度值,向量运算器电路对中间权值和中间权值梯度值进行加减运算得到运算结果,该运算结果即为更新权值。In an alternative, the vector operator circuit using the main processing circuit multiplies or divides the weight and the weight gradient by a number to obtain an intermediate weight and an intermediate weight gradient value, and the vector operator circuit pairs the intermediate weight And the intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the update weight.

在一种可选方案中,可以先使用权值的梯度计算出一组动量,然后再使用动量与权值进行加减计算得到更新后的权值。In an alternative, a set of momentum can be calculated by using the gradient of the weights, and then the updated weights are obtained by adding and subtracting the momentum and the weights.

实现全连接层的反向运算的方法Method for implementing reverse operation of fully connected layers

全连接层的反向运算可以分成两部分,如下图中,实线箭头表示全连接层的正向计算过程,虚线部分表示全连接层的反向计算过程。The reverse operation of the fully connected layer can be divided into two parts. As shown in the following figure, the solid arrow indicates the forward calculation process of the fully connected layer, and the broken line indicates the reverse calculation process of the fully connected layer.

从上图可以看出来,可以使用所述装置完成矩阵相乘运算的方法完成全连接层的反向运算;As can be seen from the above figure, the method of performing matrix multiplication operation using the device can complete the inverse operation of the fully connected layer;

实现卷积层的反向运算;Implement the inverse operation of the convolutional layer;

卷积层的反向运算可以分成两部分,如下图9a中,实线箭头表示卷积层的正向计算过程,如图9b所示,表示卷积层的反向计算过程。The inverse operation of the convolutional layer can be divided into two parts. As shown in Fig. 9a, the solid arrow indicates the forward calculation process of the convolutional layer, as shown in Fig. 9b, which represents the reverse calculation process of the convolutional layer.

图9a、图9b所示的卷积层的反向运算,可以使用如图1a所示装置采用如图1b所示的装置完成卷积层的反向运算。在执行正向运算或反向运算时实际为神经网络的多个运算,该多个运算包括但不限于:矩阵乘以矩阵、矩阵乘以向量、卷积运算、激活运算等等运算中的一种或任意组合,上述运算的方式可以本披露中的描述,这里不在赘述。In the reverse operation of the convolutional layer shown in Figures 9a and 9b, the reverse operation of the convolutional layer can be accomplished using the apparatus shown in Figure 1b using the apparatus shown in Figure 1a. When performing a forward operation or a reverse operation, it is actually a plurality of operations of the neural network, including but not limited to: one of a matrix multiplication matrix, a matrix multiplication vector, a convolution operation, an activation operation, and the like. Alternatively or in any combination, the manner of the above operations may be described in the present disclosure and will not be described herein.

本披露实施例提供了一种神经网络处理器板卡,可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、智能家居、家电、多处理器系统、基于微处理器的系统、机器人、可编程的消费电子设备、网络个人计算机(personal computer,PC)、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。Embodiments of the present disclosure provide a neural network processor board that can be used in numerous general purpose or dedicated computing system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, smart homes, home appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics devices, personal computers (personal computers) , PC), small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and so on.

请参照图10a,图10a为本披露实施例提供的一种神经网络处理器板卡的结构示意图。如图10a所示,上述神经网络处理器板卡10包括神经网络芯片封装结构11、第一电气及非电气连接装置12和第一基板(substrate)13。Please refer to FIG. 10a. FIG. 10a is a schematic structural diagram of a neural network processor card according to an embodiment of the present disclosure. As shown in FIG. 10a, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.

本披露对于神经网络芯片封装结构11的具体结构不作限定,可选的,如图10b所示,上述神经网络芯片封装结构11包括:神经网络芯片111、第二电气及非电气连接装置112、第二基板113。The disclosure is not limited to the specific structure of the neural network chip package structure 11. Alternatively, as shown in FIG. 10b, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a first Two substrates 113.

本披露所涉及的神经网络芯片111的具体形式不作限定,上述的神经网络芯片111包含但不限于将神经网络处理器集成的神经网络晶片,上述晶片可以由硅材料、锗材料、量子材料或分子材料等制成。根据实际情况(例如:较严苛的环境)和不同的应用需求可将上述神经网络晶片进行封装,以使神经网络晶片的大部分被包裹住,而将神经网络晶片上的引脚通过金线等导体连到封装结构的外边,用于和更外层进行电路连接。The specific form of the neural network chip 111 involved in the disclosure is not limited. The above neural network chip 111 includes, but is not limited to, a neural network chip integrated with a neural network processor, and the above silicon wafer may be made of silicon material, germanium material, quantum material or molecule. Made of materials, etc. The neural network wafer can be packaged according to actual conditions (for example, a more severe environment) and different application requirements, so that most of the neural network wafer is wrapped, and the pins on the neural network wafer are passed through the gold wire. The conductors are connected to the outside of the package structure for electrical connection to the outer layer.

本披露对于神经网络芯片111的具体结构不作限定,可选的,请参照图1a或图1b所示的装置。The specific structure of the neural network chip 111 is not limited in this disclosure. Alternatively, please refer to the device shown in FIG. 1a or 1b.

本披露对于第一基板13和第二基板113的类型不做限定,可以是印制电路板(printed circuit board,PCB)或(printed wiring board,PWB),还可能为其它电路板。对PCB的制作材料也不做限定。The present disclosure is not limited to the types of the first substrate 13 and the second substrate 113, and may be a printed circuit board (PCB) or a printed wiring board (PWB), and may be other circuit boards. There are no restrictions on the materials used to make the PCB.

本披露所涉及的第二基板113用于承载上述神经网络芯片111,通过第二电气及非电气连接装置112将上述的神经网络芯片111和第二基板113进行连接得到的神经网络芯片封装结构11,用于保护神经网络芯片111,便于将神经网络芯片封装结构11与第一基板13进行进一步封装。The second substrate 113 of the present disclosure is used to carry the neural network chip 111, and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112. The neural network chip 111 is protected to further encapsulate the neural network chip package structure 11 and the first substrate 13.

对于上述具体的第二电气及非电气连接装置112的封装方式和封装方式对应的结构不作限定,可根据实际情况和不同的应用需求选择合适的封装方式并进行简单地改进,例如:倒装芯片球栅阵列封装 (Flip Chip Ball Grid Array Package,FCBGAP),薄型四方扁平式封装(Low-profile Quad Flat Package,LQFP)、带散热器的四方扁平封装(Quad Flat Package with Heat sink,HQFP)、无引脚四方扁平封装(Quad Flat Non-lead Package,QFN)或小间距四方扁平式封装(Fine-pitch Ball Grid Package,FBGA)等封装方式。The structure of the specific second electrical and non-electrical connection device 112 and the corresponding configuration of the packaging method are not limited, and a suitable packaging method can be selected according to actual conditions and different application requirements and simply improved, for example, flip chip Flip Chip Ball Grid Array Package (FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat Sink (HQFP), None Packaged in a Quad Flat Non-lead Package (QFN) or a Fine-Pitch Ball Grid Package (FBGA).

倒装芯片(Flip Chip),适用于对封装后的面积要求高或对导线的电感、信号的传输时间敏感的情况下。除此之外可以用引线键合(Wire Bonding)的封装方式,减少成本,提高封装结构的灵活性。Flip Chip is suitable for the case where the area after packaging is required to be high or sensitive to the inductance of the wire and the transmission time of the signal. In addition, wire bonding can be used to reduce the cost and increase the flexibility of the package structure.

球栅阵列(Ball Grid Array),能够提供更多引脚,且引脚的平均导线长度短,具备高速传递信号的作用,其中,封装可以用引脚网格阵列封装(Pin Grid Array,PGA)、零插拔力(Zero Insertion Force,ZIF)、单边接触连接(Single Edge Contact Connection,SECC)、触点阵列(Land Grid Array,LGA)等来代替。Ball Grid Array, which can provide more pins, and the average lead length of the pins is short, which has the function of transmitting signals at high speed. The package can be packaged in Pin Grid Array (PGA). Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA), etc.

可选的,采用倒装芯片球栅阵列(Flip Chip Ball Grid Array)的封装方式对神经网络芯片111和第二基板113进行封装,具体的神经网络芯片封装结构的示意图可参照图11a。如图11a所示,上述神经网络芯片封装结构包括:神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26。Optionally, the neural network chip 111 and the second substrate 113 are encapsulated by using a flip chip ball grid array (Flip Chip Ball Grid Array). For a schematic diagram of a specific neural network chip package structure, reference may be made to FIG. 11a. As shown in FIG. 11a, the neural network chip package structure includes a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a lead 26.

其中,焊盘22与神经网络芯片21相连,通过在焊盘22和第二基板24上的连接点25之间焊接形成焊球23,将神经网络芯片21和第二基板24连接,即实现了神经网络芯片21的封装。Wherein, the pad 22 is connected to the neural network chip 21, and the solder ball 23 is soldered between the pad 22 and the connection point 25 on the second substrate 24 to connect the neural network chip 21 and the second substrate 24, thereby realizing The package of the neural network chip 21.

引脚26用于与封装结构的外部电路(例如,神经网络处理器板卡10上的第一基板13)相连,可实现外部数据和内部数据的传输,便于神经网络芯片21或神经网络芯片21对应的神经网络处理器对数据进行处理。对于引脚的类型和数量本披露也不作限定,根据不同的封装技术可选用不同的引脚形式,并遵从一定规则进行排列。The pin 26 is used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), and can realize transmission of external data and internal data, facilitating the neural network chip 21 or the neural network chip 21. The corresponding neural network processor processes the data. The type and number of pins are not limited in this disclosure. Different pin types may be selected according to different packaging technologies, and are arranged according to certain rules.

可选的,上述神经网络芯片封装结构还包括绝缘填充物,置于焊盘22、焊球23和连接点25之间的空隙中,用于防止焊球与焊球之间产生干扰。其中,绝缘填充物的材料可以是氮化硅、氧化硅或氧氮化硅;干扰包含电磁干扰、电感干扰等。Optionally, the neural network chip package structure further includes an insulating filler disposed in the gap between the pad 22, the solder ball 23 and the connection point 25 for preventing interference between the solder ball and the solder ball. The material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductance interference and the like.

可选的,上述神经网络芯片封装结构还包括散热装置,用于散发神经网络芯片21运行时的热量。其中,散热装置可以是一块导热性良好的金属片、散热片或散热器,例如,风扇。Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat of the neural network chip 21 during operation. The heat sink may be a piece of metal with good thermal conductivity, a heat sink or a heat sink, for example, a fan.

举例来说,如图11b所示,神经网络芯片封装结构11包括:神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26、绝缘填充物27、散热膏28和金属外壳散热片29。其中,散热膏28和金属外壳散热片29用于散发神经网络芯片21运行时的热量。For example, as shown in FIG. 11b, the neural network chip package structure 11 includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a pin 26, Insulating filler 27, thermal grease 28 and metal housing fins 29. Among them, the heat dissipation paste 28 and the metal case heat sink 29 are used to dissipate the heat of the neural network chip 21 during operation.

可选的,上述神经网络芯片封装结构11还包括补强结构,与焊盘22连接,且内埋于焊球23中,以增强焊球23与焊盘22之间的连接强度。其中,补强结构可以是金属线结构或柱状结构,在此不做限定。Optionally, the neural network chip package structure 11 further includes a reinforcing structure, is connected to the pad 22, and is embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22. The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本披露对于第一电气及非电气装置12的具体形式也不作限定,可参照第二电气及非电气装置112的描述,即通过焊接的方式将神经网络芯片封装结构11进行封装,也可以采用连接线连接或插拔方式连接第二基板113和第一基板13的方式,便于后续更换第一基板13或神经网络芯片封装结构11。The present disclosure is not limited to the specific form of the first electrical and non-electrical device 12, and the description of the second electrical and non-electrical device 112 may be referred to, that is, the neural network chip package structure 11 may be packaged by soldering, or may be connected. The manner of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging and unplugging facilitates subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

可选的,第一基板13包括用于扩展存储容量的内存单元的接口等,例如:同步动态随机存储器(Synchronous Dynamic Random Access Memory,SDRAM)、双倍速率同步动态随机存储器(Double Date Rate SDRAM,DDR)等,通过扩展内存提高了神经网络处理器的处理能力。Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example, a Synchronous Dynamic Random Access Memory (SDRAM), and a Double Rate Rate SDRAM (Double Rate Rate SDRAM). DDR), etc., improve the processing power of the neural network processor by expanding the memory.

第一基板13上还可包括快速外部设备互连总线(Peripheral Component Interconnect-Express,PCI-E或PCIe)接口、小封装可热插拔(Small Form-factor Pluggable,SFP)接口、以太网接口、控制器局域网总线(Controller Area Network,CAN)接口等等,用于封装结构和外部电路之间的数据传输,可提高运算速度和操作的便利性。The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, and an Ethernet interface. Controller Area Network (CAN) interface, etc., used for data transmission between the package structure and external circuits, which can improve the operation speed and convenience of operation.

将神经网络处理器封装为神经网络芯片111,将神经网络芯片111封装为神经网络芯片封装结构11,将神经网络芯片封装结构11封装为神经网络处理器板卡10,通过板卡上的接口(插槽或插芯)与外部电路(例如:计算机主板)进行数据交互,即直接通过使用神经网络处理器板卡10实现神经网络处理 器的功能,并保护神经网络芯片111。且神经网络处理器板卡10上还可添加其他模块,提高了神经网络处理器的应用范围和运算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, and the neural network chip package structure 11 is packaged as a neural network processor board 10, through an interface on the board ( The slot or ferrule) performs data interaction with an external circuit (for example, a computer motherboard), that is, directly implements the function of the neural network processor by using the neural network processor board 10, and protects the neural network chip 111. Moreover, other modules can be added to the neural network processor board 10, which improves the application range and computational efficiency of the neural network processor.

在一个实施例里,本公开公开了一个电子装置,其包括了上述神经网络处理器板卡10或神经网络芯片封装结构11。In one embodiment, the present disclosure discloses an electronic device that includes the neural network processor board 10 or neural network chip package structure 11 described above.

电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , wearables, vehicles, household appliances, and/or medical equipment.

所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicle includes an airplane, a ship, and/or a vehicle; the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood; the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.

以上所述的具体实施例,对本披露的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本披露的具体实施例而已,并不用于限制本披露,凡在本披露的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本披露的保护范围之内。The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present disclosure. It is understood that the above description is only the specific embodiment of the disclosure, and is not intended to limit the disclosure. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure are intended to be included within the scope of the disclosure.

Claims (23)

一种集成电路芯片装置,其特征在于,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路包括第一映射电路,所述多个基础处理电路中至少一个电路包括第二映射电路,所述第一映射电路以及所述第二映射电路均用于执行神经网络运算中的各个数据的压缩处理;An integrated circuit chip device, comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprising a first mapping circuit, at least one of the plurality of basic processing circuits The circuit includes a second mapping circuit, each of the first mapping circuit and the second mapping circuit for performing compression processing of each data in the neural network operation; 所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;The plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to n basic processing circuits of the first row, and n basic processings of the mth row The circuit and the m basic processing circuits of the first column; 所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;The main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto; 所述多个基础处理电路,用于依据传输的数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。The plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit. 根据权利要求1所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to claim 1, wherein 所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块,所述横向数据块为按照横向方向分发至与所述主处理电路连接的基础处理电路的数据块,所述竖向数据块为按照纵向方向分发至与所述主处理电路连接的基础处理电路的数据块;启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;The main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction, where the horizontal data block is in a horizontal direction. a direction is distributed to a data block of a base processing circuit connected to the main processing circuit, the vertical data block being a data block distributed to a basic processing circuit connected to the main processing circuit in a longitudinal direction; The mapping circuit processes the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identification data block associated with the horizontal data block, the processed vertical data block and the vertical data block associated with the vertical data block Identifying the data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks associated with the basic data blocks, And the basic data block and the identification data block associated with each of the plurality of basic data blocks are distributed to a basic processing circuit connected thereto, and the Vertical processing data block and a block of data associated with the identification data block vertical to broadcast its base connected to the processing circuit; 所述基础处理电路,用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identification data Processing, by the block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; and performing the operation instruction on the processed vertical data block and the basic data block The operation obtains an operation result, and the operation result is sent to the main processing circuit; 所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction. 根据权利要求1所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to claim 1, wherein 所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;The main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction; and start the first mapping circuit pair Processing the horizontal data block to obtain the processed horizontal data block and the identification data block associated with the horizontal data block, or starting the first mapping circuit according to the pre-stored identification data block associated with the horizontal data block The data block is processed to obtain a processed horizontal data block; the processed horizontal data block and the identification data block associated with the horizontal data block are split and processed to obtain a plurality of basic data blocks and the basic data blocks are associated with each other. Identifying a data block, distributing the plurality of basic data blocks and the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected thereto, and broadcasting the vertical data block to a basic processing circuit connected thereto ; 所述基础处理电路,用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执 行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The vertical data block and the processed basic data block perform an operation indicated by the operation instruction to obtain an operation result, and send the operation result to the main processing circuit; 所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction. 根据权利要求1所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to claim 1, wherein 所述主处理电路,用于获取待计算的数据块以及运算指令,依据所述运算指令将所述待计算的数据块划分为横向数据块和竖向数据块;启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;The main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a horizontal data block and a vertical data block according to the operation instruction; and start the first mapping circuit pair Processing the vertical data block to obtain the processed vertical data block and the identification data block associated with the vertical data block, or initiating the identification data associated with the pre-stored vertical data block by the first mapping circuit The block processes the vertical data block to obtain a processed vertical data block; splitting the horizontal data block to obtain a plurality of basic data blocks; and distributing the plurality of basic data blocks to a base connected thereto Processing circuitry, broadcasting the processed vertical data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto; 所述基础处理电路,用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is configured to start, by the second mapping circuit, the basic data block to be processed according to the identification data block associated with the vertical data block to obtain a processed basic data block; Performing an operation indicated by the operation instruction on the data block and the processed basic data block to obtain an operation result, and transmitting the operation result to the main processing circuit; 所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction. 根据权利要求1-4中任一项所述的集成电路芯片装置,其特征在于,所述待计算的数据块包括至少一个权值,和/或至少一个输入神经元。The integrated circuit chip device according to any one of claims 1 to 4, wherein the data block to be calculated comprises at least one weight, and/or at least one input neuron. 根据权利要求5所述的集成电路芯片装置,其特征在于,所述标识数据块是由0和1组成的矩阵数据块,其中,0表示所述权值或者所述输入神经元的绝对值小于或等于第一阈值,1表示所述权值或者所述输入神经元的绝对值大于第一阈值。The integrated circuit chip device according to claim 5, wherein said identification data block is a matrix data block composed of 0 and 1, wherein 0 represents said weight or said absolute value of said input neuron is less than Or equal to the first threshold, 1 means the weight or the absolute value of the input neuron is greater than the first threshold. 根据权利要求6所述的集成电路芯片装置,其特征在于,所述连接标识数据块为对所述竖向数据块关联的标识数据和所述基本数据块关联的标识数据块进行逐元素与操作而得到的。The integrated circuit chip device according to claim 6, wherein the connection identification data block performs element-by-element operation on the identification data associated with the vertical data block and the identification data block associated with the basic data block. And got it. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to any one of claims 2 to 4, wherein 所述基础处理电路,具体用于该基本数据块与该竖向数据块执行内积处理得到内积结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主处理电路;The basic processing circuit is specifically configured to perform inner product processing on the basic data block and the vertical data block to obtain an inner product result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main Processing circuit 所述主处理电路,用于在如所述运算结果为内积处理的结果时,对所述运算结果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to accumulate the result of the operation as the result of the inner product processing, and obtain an accumulated result, and the accumulated result is arranged to obtain the data block to be calculated and the operation instruction The result of the instruction. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to any one of claims 2 to 4, wherein 所述主处理电路,具体用于将所述竖向数据块分成多个部分竖向数据块,将所述多个部分竖向数据块通过多次广播至所述基础处理电路;所述多个部分竖向数据块组合形成所述竖向数据块;The main processing circuit is specifically configured to divide the vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the basic processing circuit by multiple times; Partial vertical data block combinations form the vertical data block; 所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述部分竖向数据块进行处理得到处理后的部分竖向数据块;对所述基本数据块以及所述处理后的部分竖向数据块执行内积运算。The basic processing circuit is configured to: start the second mapping circuit to process the partial vertical data block according to the identifier data block associated with the basic data block to obtain a processed partial vertical data block; The basic data block and the processed partial vertical data block perform an inner product operation. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to any one of claims 2 to 4, wherein 所述主处理电路,具体用于将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块分成多个部分竖向数据块以及所述部分竖向数据块关联的标识数据块,将所述多个部分竖向数据块以及所述多个部分竖向数据块各自关联的标识数据块通过多次广播至所述基础处理电路;所述多个部分竖向数据块组合形成所述竖向数据块;The main processing circuit is configured to divide the processed vertical data block and the identification data block associated with the vertical data block into a plurality of partial vertical data blocks and identification data associated with the partial vertical data blocks. a block, the identification data block associated with each of the plurality of partial vertical data blocks and the plurality of partial vertical data blocks is broadcasted to the basic processing circuit by multiple times; the plurality of partial vertical data blocks are combined to form The vertical data block; 所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块以及所述部分竖向数据块关联的标识数据块获得连接标识数据块;根据所述连接标识数据块对所述基本数据块以及所述部分竖向数据块进行处理得到处理后的基本数据块以及处理后的部分广播数据;对所述处理后的基本数据块以及处理后的部分竖向数据块执行内积运算;The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the basic data block and the identification data block associated with the partial vertical data block; according to the connection The identification data block processes the basic data block and the partial vertical data block to obtain the processed basic data block and the processed partial broadcast data; and the processed basic data block and the processed partial vertical direction The data block performs an inner product operation; 或者,所述基础处理电路,具体用于启动所述第二映射电路根据所述部分竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的基本数据块以及所述部分竖向数据块执行所述运算指令所指示的运算。Or the basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identifier data block associated with the partial vertical data block to obtain a processed basic data block; The processed basic data block and the partial vertical data block perform an operation indicated by the operation instruction. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,The integrated circuit chip device according to any one of claims 2 to 4, wherein 所述基础处理电路,具体用于将该部分竖向数据块与该基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主处理电路;或者,The basic processing circuit is specifically configured to perform inner product processing on the partial vertical data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, and the part is The operation result is sent to the main processing circuit; or 所述基础处理电路,具体用于复用n次该部分竖向数据块执行该部分竖向数据块与n个该基本数据块的运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主处理电路,所述n为大于等于2的整数。The basic processing circuit is specifically configured to perform the operation of the partial vertical data block and the n basic data blocks by multiplexing the partial vertical data block to obtain n partial processing results, and accumulating the n partial processing results respectively. Then, n partial operation results are obtained, and the n partial operation results are sent to the main processing circuit, and the n is an integer greater than or equal to 2. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,所述运算指令为卷积指令,The integrated circuit chip device according to any one of claims 2 to 4, wherein the operation instruction is a convolution instruction. 所述主处理电路,用于获取输入数据块、卷积核数据块以及卷积指令,依据所述卷积指令将所述输入数据块划分为竖向数据块,将所述卷积核数据块划分为横向数据块;依据所述卷积指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述卷积指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is configured to acquire an input data block, a convolution kernel data block, and a convolution instruction, and divide the input data block into vertical data blocks according to the convolution instruction, and the convolution kernel data block Dividing into a horizontal data block; determining, according to the operation control of the convolution instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data block including the horizontal data block And/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the convolution instruction; 所述多个基础处理电路,用于依据所述卷积指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the convolution instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. Calculating an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing circuit determined by the basic processing circuit a transmitted data block, the second data block being associated with the processed first data block; 所述主处理电路,用于将所述运算结果处理得到所述卷积指令的指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result of the convolution instruction. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,所述运算指令为乘法指令,The integrated circuit chip device according to any one of claims 2 to 4, wherein the operation instruction is a multiplication instruction. 所述主处理电路,用于获取输入数据块、权值数据块以及乘法指令,依据所述乘法指令将所述输入数据块划分成横向数据块,将所述权值数据块划分成竖向数据块;依据所述乘法指令的运算控制确定启 动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述乘法指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is configured to acquire an input data block, a weight data block, and a multiplication instruction, divide the input data block into horizontal data blocks according to the multiplication instruction, and divide the weight data block into vertical data blocks. Blocking; determining, according to the operation control of the multiplication instruction, starting the first mapping circuit to process the first data block to obtain the processed first data block; the first data block includes the horizontal data block and/or the a vertical data block; transmitting, according to the multiplication instruction, the processed first data block to at least one of the basic processing circuits connected to the main processing circuit; 所述多个基础处理电路,用于依据所述乘法指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the multiplication instruction, whether to start the second mapping circuit to process the second data block, and perform the operation in the neural network in parallel according to the processed second data block. Obtaining an operation result, and transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is sent by the main processing circuit determined by the basic processing circuit Data block, the second data block is associated with the processed first data block; 所述主处理电路,用于将所述运算结果处理得到所述乘法指令的指令结果。The main processing circuit is configured to process the operation result to obtain an instruction result of the multiplication instruction. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,所述运算指令为正向运算指令,The integrated circuit chip device according to any one of claims 2 to 4, wherein the operation instruction is a forward operation instruction. 所述主处理电路,用于接收正向运算指令,解析所述正向运算指令得到所述正向运算指令在所述神经网络正向运算中第i层包含的第一运算指令以及所述第一运算指令所需的输入数据块和权值数据块,所述i的取值范围为大于等于1,且小于等于n的整数,如所述i大于等于2,所述输入数据块为第i-1层的输出数据块;The main processing circuit is configured to receive a forward operation instruction, and parse the forward operation instruction to obtain a first operation instruction included in an ith layer of the forward operation instruction in the neural network forward operation, and the first processing instruction An input data block and a weight data block required for an operation instruction, wherein the value range of i is an integer greater than or equal to 1, and less than or equal to n, as the i is greater than or equal to 2, and the input data block is the ith -1 layer of output data blocks; 所述主处理电路,还用于依据所述第一运算指令将所述输入数据块划分为竖向数据块,将所述权值数据块划分为横向数据块;根据所述第一运算指令的运算控制确定是否启动第一映射电路对第一数据块进行处理,以得到处理后的第一数据块,所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述正向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至少一个基础处理电路;The main processing circuit is further configured to divide the input data block into a vertical data block according to the first operation instruction, and divide the weight data block into a horizontal data block; according to the first operation instruction The operation control determines whether to activate the first mapping circuit to process the first data block to obtain the processed first data block, where the first data block includes the horizontal data block and/or the vertical data block; The forward operation instruction sends the processed first data block to at least one basic processing circuit of the basic processing circuit connected to the main processing circuit; 所述多个基础处理电路,用于依据所述第一运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the first operation instruction, whether to start the second mapping circuit to process the second data block, and execute the neural network in a parallel manner according to the processed second data block. The operation obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is the receiving the main processing determined by the basic processing circuit a data block sent by the circuit, the second data block being associated with the processed first data block; 所述主处理电路,用于将所述运算结果进行处理得到所述第一运算指令的指令结果,完成所述第i层包含的所述第一运算指令的运算。The main processing circuit is configured to process the operation result to obtain an instruction result of the first operation instruction, and complete an operation of the first operation instruction included in the ith layer. 根据权利要求2-4中任一项所述的集成电路芯片装置,其特征在于,所述运算指令为训练指令,The integrated circuit chip device according to any one of claims 2 to 4, wherein the operation instruction is a training instruction. 所述集成电路芯片装置,用于接收训练指令,依据该训练指令确定第一层输入数据和第一层权值组数据,对第一层输入数据和第一层权值组数据执行神经网络的n层正向运算得到正向运算的第n输出结果;The integrated circuit chip device is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform a neural network on the first layer input data and the first layer weight group data. The n-th layer forward operation obtains the nth output result of the forward operation; 所述主处理电路,还用于依据所述第n输出结果得到第n输出结果梯度,依据所述训练指令获取第n层反向运算的第n反向运算指令以及所述第n反向运算指令所需的第n层输入数据以及第n层权值组数据;依据所述第n反向运算指令将所述第n输出结果梯度、第n层输入数据以及第n层权值组数据划分为竖向数据块和横向数据块;依据所述第n反向运算指令的运算控制确定启动第一映射电路对第一数据块进行处理,得到处理后的第一数据块;所述第一数据块包括所述横向数据块和/或所述竖向数据块;依据所述第n反向运算指令将处理后的第一数据块发送至与所述主处理电路相连的基础处理电路中的至 少一个基础处理电路;The main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation nth inverse operation instruction and the nth inverse operation according to the training instruction The nth layer input data and the nth layer weight group data required by the instruction; dividing the nth output result gradient, the nth layer input data, and the nth layer weight group data according to the nth reverse operation instruction a vertical data block and a horizontal data block; determining, according to the operation control of the nth reverse operation instruction, starting the first mapping circuit to process the first data block, and obtaining the processed first data block; the first data The block includes the horizontal data block and/or the vertical data block; transmitting the processed first data block to at least one of the basic processing circuits connected to the main processing circuit according to the nth reverse operation instruction a basic processing circuit; 所述多个基础处理电路,用于依据所述第n反向运算指令的运算控制确定是否启动第二映射电路对第二数据块进行处理,依据处理后的第二数据块以并行方式执行神经网络中的运算得到运算结果,并将该运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路;所述第二数据块为所述基础处理电路确定的接收所述主处理电路发送的数据块,所述第二数据块与所述处理后的第一数据块关联;The plurality of basic processing circuits are configured to determine, according to the operation control of the nth reverse operation instruction, whether to start the second mapping circuit to process the second data block, and execute the nerve in parallel according to the processed second data block An operation in the network obtains an operation result, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit; the second data block is determined by the basic processing circuit to receive the a data block sent by the main processing circuit, the second data block being associated with the processed first data block; 所述主处理电路,还用于对该运算结果进行处理得到第n层权值组梯度和第n层输入数据梯度,应用所述第n层权值组梯度对第n层权值组数据进行更新;The main processing circuit is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and apply the nth layer weight group gradient to the nth layer weight group data. Update 所述集成电路芯片装置,还用于将第n层输入数据梯度作为第n-1层的第n-1输出结果梯度执行n-1层反向运算得到n-1层权值组梯度,应用n-1层权值组梯度更新对应层的权值组数据,所述权值组数据包括至少二个权值。The integrated circuit chip device is further configured to perform an n-1 layer inverse operation on the nth layer input data gradient as an n-1th output gradient of the n-1th layer to obtain an n-1 layer weight group gradient, and apply The n-1 layer weight group gradient updates the weight group data of the corresponding layer, and the weight group data includes at least two weights. 根据权利要求12-15中任一项所述的集成电路芯片装置,其特征在于,当所述第一数据块包括横向数据块和竖向数据块时,The integrated circuit chip device according to any one of claims 12-15, wherein when said first data block comprises a horizontal data block and a vertical data block, 所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块和所述竖向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,处理后的竖向数据块以及该竖向数据块关联的标识数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;The main processing circuit is configured to start, by the first mapping circuit, processing the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identifier data block associated with the horizontal data block, and processing a vertical data block and an identification data block associated with the vertical data block; splitting the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and the An identification data block associated with each of the basic data blocks, the identification data blocks associated with the plurality of basic data blocks and the plurality of basic data blocks are distributed to a basic processing circuit connected thereto, and the processed vertical data is And the block and the identification data block associated with the vertical data block are broadcast to a base processing circuit connected thereto; 所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块和所述基本数据块关联的标识数据获得连接标识数据块,并根据所述连接标识数据块对所述竖向数据块和所述基本数据块进行处理得到处理后的竖向数据块和基本数据块;对所述处理后的竖向数据块和基本数据块执行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and according to the connection identifier Processing, by the data block, the vertical data block and the basic data block to obtain a processed vertical data block and a basic data block; and executing the operation instruction on the processed vertical data block and the basic data block The operation of the instruction obtains an operation result, and the operation result is transmitted to the main processing circuit. 根据权利要求12-15中所述的集成电路芯片装置,其特征在于,当所述第一数据块包括横向数据块时,An integrated circuit chip device according to any of claims 12-15, wherein when said first data block comprises a horizontal data block, 所述主处理电路,具体用于启动所述第一映射电路对所述横向数据块进行处理得到处理后的横向数据块以及该横向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述横向数据块关联的标识数据块对所述横向数据块进行处理得到处理后的横向数据块;将所述处理后的横向数据块以及该横向数据块关联的标识数据块进行拆分处理得到多个基本数据块以及所述基本数据块各自关联的标识数据块,将所述多个基本数据块以及所述多个基本数据块各自关联的标识数据块分发至与其连接的基础处理电路,将所述竖向数据块广播至与其连接的基础处理电路;The main processing circuit is configured to start, by the first mapping circuit, processing the horizontal data block to obtain a processed horizontal data block and an identifier data block associated with the horizontal data block, or start the first mapping circuit. Processing the horizontal data block according to the pre-stored identifier data block associated with the horizontal data block to obtain the processed horizontal data block; and disassembling the processed horizontal data block and the identification data block associated with the horizontal data block Sub-processing to obtain a plurality of basic data blocks and identification data blocks respectively associated with the basic data blocks, and distributing the plurality of basic data blocks and the identification data blocks associated with the plurality of basic data blocks to a basic processing connected thereto a circuit that broadcasts the vertical data block to a base processing circuit connected thereto; 所述基础处理电路,具体用于启动所述第二映射电路根据所述基本数据块关联的标识数据块对所述竖向数据块进行处理,得到处理后的竖向数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, to obtain a processed vertical data block; The subsequent vertical data block and the processed basic data block perform an operation indicated by the operation instruction to obtain an operation result, and the operation result is transmitted to the main processing circuit. 根据权利要求12-15中所述的集成电路芯片装置,其特征在于,当所述第一数据块包括竖向数据块时,An integrated circuit chip device according to any of claims 12-15, wherein when said first data block comprises a vertical data block, 所述主处理电路,具体用于启动所述第一映射电路对所述竖向数据块进行处理,得到处理后的竖向数据块以及该竖向数据块关联的标识数据块,或者启动所述第一映射电路根据预存的所述竖向数据块关联的标识数据块对所述竖向数据块进行处理得到处理后的竖向数据块;对所述横向数据块进行拆分处理得到多个基本数据块;将所述多个基本数据块分发至与其连接的基础处理电路,将所述处理后的竖向数据块以及该竖向数据块关联的标识数据块广播至与其连接的基础处理电路;The main processing circuit is configured to start the first mapping circuit to process the vertical data block, obtain the processed vertical data block and the identification data block associated with the vertical data block, or start the The first mapping circuit processes the vertical data block according to the pre-stored identification data block associated with the vertical data block to obtain a processed vertical data block; and splitting the horizontal data block to obtain a plurality of basic data blocks. a data block; distributing the plurality of basic data blocks to a base processing circuit connected thereto, and broadcasting the processed vertical data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto; 所述基础处理电路,具体用于启动所述第二映射电路根据所述竖向数据块关联的标识数据块对所述基本数据块进行处理得到处理后的基本数据块;对所述处理后的竖向数据块和所述处理后的基本数据块执行所述运算指令所指示的运算得到运算结果,将所述运算结果发送至所述主处理电路。The basic processing circuit is specifically configured to start, by the second mapping circuit, processing the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; The vertical data block and the processed basic data block perform an operation indicated by the operation instruction to obtain an operation result, and the operation result is transmitted to the main processing circuit. 一种神经网络处理器板卡,其特征在于,所述神经网络处理器板卡包括:神经网络芯片封装结构、第一电气及非电气连接装置和第一基板;所述神经网络芯片封装结构包括:神经网络芯片、第二电气及非电气连接装置和第二基板,所述第二基板承载所述神经网络芯片,所述第二基板通过所述第二电气及非电气连接装置与所述神经网络芯片连接;A neural network processor board, characterized in that: the neural network processor board comprises: a neural network chip package structure, a first electrical and non-electrical connection device and a first substrate; the neural network chip package structure comprises a neural network chip, a second electrical and non-electrical connection device and a second substrate, the second substrate carrying the neural network chip, the second substrate passing through the second electrical and non-electrical connection device and the nerve Network chip connection; 所述神经网络芯片包括:如上权利要求1-18中任一项所述的集成电路芯片装置。The neural network chip comprises the integrated circuit chip device of any of claims 1-18. 根据权利要求19所述的神经网络处理器板卡,其特征在于,所述神经网络芯片封装结构还包括:散热装置。The neural network processor card of claim 19, wherein the neural network chip package structure further comprises: a heat sink. 根据权利要求19所述的神经网络处理器板卡,其特征在于,所述神经网络芯片封装结构的封装结构为下述封装的任意一种:The neural network processor board according to claim 19, wherein the package structure of the neural network chip package structure is any one of the following packages: 倒装芯片球栅阵列封装、薄型四方扁平式封装、带散热器的四方扁平封装、无引脚四方扁平封装、小间距四方扁平式封装。Flip-chip ball grid array package, thin quad flat package, quad flat package with heat sink, leadless quad flat package, small pitch quad flat package. 一种芯片,其特征在于,所述芯片集成如权利要求1-19任意一项所述的装置。A chip, characterized in that the chip integrates the device according to any of claims 1-19. 一种智能设备,其特征在于,所述智能设备包括如权利要求22所述的芯片。A smart device, characterized in that the smart device comprises the chip of claim 22.
PCT/CN2019/076088 2018-02-27 2019-02-25 Integrated circuit chip device, board card and related product Ceased WO2019165946A1 (en)

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
CN201810164844.8A CN110197275B (en) 2018-02-27 2018-02-27 Integrated circuit chip devices and related products
CN201810164331.7A CN110197269B (en) 2018-02-27 2018-02-27 Integrated circuit chip devices and related products
CN201810161819.4A CN110197263B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product
CN201810164843.3A CN110197274B (en) 2018-02-27 2018-02-27 Integrated circuit chip devices and related products
CN201810164331.7 2018-02-27
CN201810161819.4 2018-02-27
CN201810161886.6 2018-02-27
CN201810161820.7 2018-02-27
CN201810161820.7A CN110197264B (en) 2018-02-27 2018-02-27 Neural network processor boards and related products
CN201810161886.6A CN110197265B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product
CN201810164843.3 2018-02-27
CN201810164844.8 2018-02-27

Publications (1)

Publication Number Publication Date
WO2019165946A1 true WO2019165946A1 (en) 2019-09-06

Family

ID=67805195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076088 Ceased WO2019165946A1 (en) 2018-02-27 2019-02-25 Integrated circuit chip device, board card and related product

Country Status (1)

Country Link
WO (1) WO2019165946A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021070006A1 (en) * 2019-10-11 2021-04-15 International Business Machines Corporation Hybrid data-model parallelism for efficient deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000039658A2 (en) * 1998-12-30 2000-07-06 Irvine Sensors Corporation Neural processing module with input architectures that make maximal use of a weighted synapse array
WO2014085975A1 (en) * 2012-12-04 2014-06-12 中国科学院半导体研究所 Dynamically reconfigurable multistage parallel single-instruction multi-data array processing system
CN107003989A (en) * 2014-12-19 2017-08-01 英特尔公司 Method and apparatus for distributed and collaborative computing in artificial neural networks
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107578095A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural Network Computing Device and Processor Containing the Computing Device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000039658A2 (en) * 1998-12-30 2000-07-06 Irvine Sensors Corporation Neural processing module with input architectures that make maximal use of a weighted synapse array
WO2014085975A1 (en) * 2012-12-04 2014-06-12 中国科学院半导体研究所 Dynamically reconfigurable multistage parallel single-instruction multi-data array processing system
CN107003989A (en) * 2014-12-19 2017-08-01 英特尔公司 Method and apparatus for distributed and collaborative computing in artificial neural networks
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107578095A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural Network Computing Device and Processor Containing the Computing Device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021070006A1 (en) * 2019-10-11 2021-04-15 International Business Machines Corporation Hybrid data-model parallelism for efficient deep learning
CN114424214A (en) * 2019-10-11 2022-04-29 国际商业机器公司 Hybrid data-model parallelism for efficient deep learning
GB2604060A (en) * 2019-10-11 2022-08-24 Ibm Hybrid data-model parallelism for efficient deep learning
JP2022552803A (en) * 2019-10-11 2022-12-20 インターナショナル・ビジネス・マシーンズ・コーポレーション HYBRID DATA-MODEL PARALLEL PROCESSING METHOD, SYSTEM AND PROGRAM
US11556450B2 (en) 2019-10-11 2023-01-17 International Business Machines Corporation Hybrid data-model parallelism for efficient deep learning
JP7497946B2 (en) 2019-10-11 2024-06-11 インターナショナル・ビジネス・マシーンズ・コーポレーション Hybrid data-model parallel processing method, system, and program
GB2604060B (en) * 2019-10-11 2024-10-16 Ibm Hybrid data-model parallelism for efficient deep learning

Similar Documents

Publication Publication Date Title
US11748605B2 (en) Integrated circuit chip device
US12217162B2 (en) Integrated circuit chip apparatus
TWI793225B (en) Method for neural network training and related product
CN111105033B (en) Neural network processor board card and related products
CN111242294B (en) Integrated circuit chip device and related products
TWI767097B (en) Integrated circuit chip apparatus and related product
CN110197264B (en) Neural network processor boards and related products
TWI793224B (en) Integrated circuit chip apparatus and related product
WO2019165946A1 (en) Integrated circuit chip device, board card and related product
CN110197267B (en) Neural network processor boards and related products
CN109978152B (en) Integrated circuit chip device and related product
TWI768160B (en) Integrated circuit chip apparatus and related product
CN109977071A (en) Neural network processor board and Related product
WO2019165940A1 (en) Integrated circuit chip apparatus, board card and related product
CN109961137B (en) Integrated circuit chip device and related product
CN109978130A (en) Integrated circuit chip device and Related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19760275

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19760275

Country of ref document: EP

Kind code of ref document: A1