HK1261499B

HK1261499B - Depth concatenation using a matrix computation unit

Info

Publication number: HK1261499B
Application number: HK19121247.1A
Authority: HK
Inventors: John Gulland William; Clifford Young Reginald
Original assignee: Google Llc
Priority date: 2017-03-07
Filing date: 2019-03-20
Publication date: 2023-03-31

Description

Deep cascading using matrix computing units

Technical Field

This specification relates to performing neural network computations in hardware.

Background

Neural networks are machine learning models that employ one or more layers of the model to generate output (e.g., classification) for received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as an input to one or more other layers in the network (i.e., one or more other hidden layers or output layers of the network). Each layer of the network generates an output from the received input in accordance with current values of the corresponding set of parameters.

Some neural networks include deep cascaded layers, which are layers that receive as inputs two or more tensors (i.e., two or more multi-dimensional matrices) that are outputs of other layers in the neural network and cascade the input tensors along a depth dimension. In particular, each input tensor has two spatial dimensions x and y and a depth dimension z. By cascading two input tensors (one with dimension x) along the depth dimension z ₁ By y ₁ By z ₁ And the other having a dimension x ₁ By y ₁ By z ₂ ) The depth cascade layer generation has a dimension x ₁ By y ₁ Multiplier (z) ₁ +z ₂ ) The output tensor of (c). The output tensor can then be used as input by another layer of the neural network.

Disclosure of Invention

This specification describes techniques for performing a concatenation of two tensors along a depth dimension using a matrix computation unit. These techniques generally involve receiving a request to process, on an integrated circuit for performing neural network computations, network inputs to a neural network that includes deep cascaded layers. The integrated circuit includes a matrix computation unit that performs vector-matrix multiplication in hardware but cannot perform deep concatenation operations directly in hardware. Alternatively, a neural network processing system generates instructions that, when executed by the integrated circuit, cause the integrated circuit to perform operations in hardware that use the matrix computation unit to generate an output that meets specifications of the deep cascade layer.

The subject matter described in this specification can be implemented in particular embodiments to realize one or more of the following advantages. Even if the integrated circuit cannot directly perform the deep cascading operation by using hardware, an output satisfying the specification of the deep cascading layer can be generated by using hardware by the application specific integrated circuit. By generating satisfactory output in hardware on the integrated circuit, the processing of inferences for a neural network including a deep cascade layer can be performed without returning data to a host, i.e., without performing a portion of the computation off-chip, even if the integrated circuit does not directly support the deep cascade operation. In other words, all computations related to computing the output of the deep cascade layer occur on the application specific integrated circuit. In particular, the integrated circuit is able to compute the output of the deep cascade layer in hardware on-chip by performing a matrix multiplication on the depth vectors from the two input tensors using a shift matrix as described in this specification. This allows the processing of inferences for such a neural network to be performed efficiently without modifying the hardware architecture of the integrated circuit. In particular, the system can efficiently process neural network inference without adding deep concatenation hardware to the dedicated circuitry or adding shift support to vector cells of the dedicated circuitry. That is, processing delays due to the need to perform a portion of the calculations off-chip, in software, or both are avoided.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 illustrates an example neural network processing system.

FIG. 2 illustrates an example application specific integrated circuit.

Figure 3 is a flow diagram of an example process for generating instructions that cause an application specific integrated circuit to generate an output tensor for a deep cascaded layer.

Fig. 4 is a flow diagram of an example process for cascading two tensors along a depth dimension.

Fig. 5 is a flow diagram of another example process for cascading two tensors along a depth dimension.

Figure 6 is a flow diagram of yet another example process for cascading two tensors along a depth dimension.

Fig. 7 shows an example of a depth concatenation calculation requiring a single shift matrix.

Fig. 8 shows an example of a depth concatenation calculation requiring two shift matrices.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

Fig. 1 illustrates an example neural network processing system 100.

The neural network processing system 100 is an example of a system implemented as one or more computers in one or more locations that may implement the systems, components, and techniques described below.

The neural network processing system 100 is a system that performs neural network computations using an application specific integrated circuit 110. The integrated circuit 110 is an application specific integrated circuit for performing neural network computations and includes a matrix computation unit 120 that performs vector matrix multiplication in hardware. An example of such an application specific integrated circuit is described in more detail below with reference to fig. 2.

In particular, the neural network processing system 100 receives a request to implement a neural network on the application specific integrated circuit 110, implements the neural network on the application specific integrated circuit 110, and, once a given neural network is implemented, processes inputs to the neural network using the application specific integrated circuit 110 to generate neural network inferences.

That is, the neural network processing system 100 may receive a request to specify a neural network architecture for the neural network to be used to process the input. The neural network architecture defines the number and configuration of layers in the neural network and the values of the parameters for each layer having parameters.

To implement a neural network on the application specific integrated circuit 110, the neural network processing system 100 includes a neural network implementation engine 150 implemented as one or more computer programs on one or more computers in one or more physical locations.

The neural network implementation engine 150 generates instructions that, when executed by the integrated circuit 110, cause the integrated circuit 110 to perform operations specified by the neural network to generate a neural network output from the received neural network input.

Once instructions have been generated by the neural network implementation engine 150 and provided to the integrated circuit 110, the neural network processing system 100 may receive neural network inputs and may process the neural network inputs using the neural network by causing the integrated circuit 110 to execute the generated instructions.

However, some neural networks include one or more incompatible neural network layers. The term "incompatible neural network layer" as used in this specification refers to a neural network layer that specifies operations that cannot be performed directly in hardware by the integrated circuit 110. To implement these neural networks on an integrated circuit, the neural network implementation engine 150 generates instructions that, when executed by the integrated circuit 110, cause the integrated circuit 110 to generate outputs of incompatible neural network layers by performing in hardware operations that differ from those specified by the neural network layers but result in generation of layer outputs that meet the specifications of the incompatible neural network layers, i.e., the same layer outputs as would have been generated by directly performing the operations specified by the layers.

In particular, some neural networks include deep cascaded layers. The depth cascaded layer is a layer that receives as input two or more tensors (i.e., two or more multidimensional matrices) that are outputs of other layers in the neural network and concatenates the input tensors along the depth dimension. In particular, each input tensor has two spatial dimensions x and y and a depth dimension z. By cascading a single chip with dimension x along depth dimension z ₁ By y ₁ By z ₁ And the other having a dimension x ₁ By y ₁ By z ₂ The depth cascade layer generating a data stream having a dimension x ₁ By y ₁ Multiplier (z) ₁ +z ₂ ) The output tensor of (a). The output tensor can then be used as input by another layer of the neural network.

Examples of neural networks that may be implemented on the integrated circuit 110 and include one or more deep cascade layers are the image recognition neural networks described in Christian Szegedy, Sergey Ioffe, "expression-v 4, expression-ResNet and the image of reactive Connections on Learning," available at https:// static.

Other examples of Neural networks that include deep cascading layers are Long Short Term Memory (LSTM) Neural networks such as those described in Hasim Sak, Andrew Senior, Long Short Term Memory recovery Neural Network architecture for Large Screen Acoustic Modeling by Francie Beaufays, Hasim Sak available at http:// 193.6.4.39/. czap/letoltes/is14/is 2014/pdf/author/iso 141304. pdf.

Because the main hardware unit performing the matrix operation on the integrated circuit 110 is the matrix calculation unit 120, the integrated circuit cannot directly perform the deep concatenation operation with hardware.

To implement a neural network including deep cascaded layers on integrated circuit 110, neural network implementation engine 150 generates instructions that, when executed by integrated circuit 110 during processing of neural network inputs by the neural network, cause integrated circuit 110 to perform other operations in hardware to generate an output tensor that satisfies specifications of the deep cascaded neural network layers using matrix computation unit 120. These instructions and other operations are described in more detail below with reference to fig. 3 and 4.

Although this specification describes the circuitry used to execute the neural network as being an application specific integrated circuit, the techniques described in this specification can be performed on any circuit augmented with matrix computation units (e.g., FPGA, ASIC, GPU, etc.).

FIG. 2 illustrates an example application specific integrated circuit 200 for performing neural network computations.

Integrated circuit 200 includes host interface 202. The host interface 202 may receive instructions including parameters for neural network computations. The parameters may include at least one or more of the following: how many layers should be processed, data identifying the corresponding set of weight inputs for each layer, the initial set of activation inputs, i.e., the inputs to the neural network used to compute the inference, the corresponding input and output sizes for each layer, etc. When the neural network being processed includes a deep cascade layer, the parameters include one or more shift weight matrices for the deep cascade layer and one or more modified unit weight matrices for the deep cascade layer. The shift weight matrix and the modified unit weight matrix are described in more detail below with reference to fig. 3, 4, and 5.

The host interface 202 may send instructions to the sequencer 206, which the sequencer 206 converts the instructions into low level control signals that control circuitry to perform neural network computations. In some embodiments, the control signal regulates the flow of data in circuit 200, e.g., how the input set of weights and the input set of activations flow through circuit 200. The sequencer 206 may send control signals to the unified buffer 208, the matrix multiplication unit 212, and the vector calculation unit 214. In some implementations, the sequencer 206 also sends control signals to the direct memory access engine 204 and the dynamic memory 210.

The host interface 202 may send the set of weight inputs and the set of initial activation inputs to the direct memory access engine 204. The direct memory access engine 204 may store the set of activation inputs at the unified buffer 208.

In some embodiments, the direct memory access engine 204 stores the set of weights to a dynamic memory 210, which dynamic memory 210 may be a memory unit. In some embodiments, the dynamic memory is located outside of the circuit. When the neural network being processed includes a deep cascade layer, the direct memory access engine 204 stores one or more shift weight matrices for the deep cascade layer in the dynamic memory 210, and in some embodiments, stores a modified unit weight matrix for the deep cascade layer in the dynamic memory 210.

The unified buffer 208 is a memory buffer. Which may be used to store the set of activation inputs from the direct memory access engine 204 and the output of the vector calculation unit 214. The direct memory access engine 204 may also read the output of the vector calculation unit 214 from the unified buffer 208.

When instructed to do so, the dynamic memory 210 and unified buffer 208 may send the weight input set and the activation input set, respectively, to the matrix multiplication unit 212.

In general, matrix multiplication unit 212 may be any unit that performs matrix vector multiplication in hardware. In some embodiments, the matrix multiplication unit 212 is a two-dimensional systolic array. In this case, the matrix multiplication unit 212 may perform multi-vector-matrix multiplication or perform matrix-matrix multiplication in parallel. The matrix multiplication unit 212 may also be a one-dimensional systolic array or other circuit that may perform mathematical operations (e.g., multiplication and addition).

Matrix multiplication unit 212 may process the weight inputs and activation inputs and provide a vector of outputs to vector calculation unit 214. In some cases, matrix multiplication unit 212 sends the output vector to unified buffer 208, which unified buffer 208 sends the output vector to vector calculation unit 214 or back to matrix multiplication unit 212 without the current neural network layer applying an activation function. For example, once the depth cascade output of the depth cascade layer has been generated, matrix multiplication unit 212 may send the output to unified buffer 208 instead of vector calculation unit 214 because the depth cascade layer does not apply an activation function. In some other cases, even if the deep cascade layer does not apply an activation function, matrix multiplication unit 212 sends the output to vector calculation unit 214, which vector calculation unit 214 applies a unit activation function to the output before routing the output back to unified buffer 208, i.e., without modifying the output.

The vector calculation unit 214 may process the output vector and store the processed output vector to the unified buffer 208. For example, the vector calculation unit 214 may apply a non-linear function to the output of the matrix calculation unit (e.g., a vector of accumulated values) to generate the activation values. In some embodiments, vector calculation unit 214 generates a normalized value, a pooled value, or both. The vector of processed outputs may be used as activation inputs to matrix multiplication unit 212, e.g., for use in subsequent layers in a neural network.

An example implementation of an integrated circuit 200 and in particular a matrix multiplication unit 212 that allows the matrix multiplication unit 212 to perform vector-matrix multiplication in hardware is described in more detail in U.S. patent application No.14/844,524 entitled "Neural Network Processor" filed on 3/9/2015, the entire contents of which are hereby incorporated by reference in their entirety.

Figure 3 is a flow diagram of an example process 300 for generating instructions that cause an application specific integrated circuit to generate an output tensor for a deep cascade layer. For convenience, process 300 will be described as being performed by a system of one or more computers located in one or more locations and appropriately programmed according to the specification. For example, a suitably programmed neural network processing system, such as neural network processing system 100 of fig. 1, may perform process 300.

The system receives a request to implement a neural network on an application specific integrated circuit, i.e., a request to process inputs of the neural network using the application specific integrated circuit to generate outputs (step 302).

In particular, a neural network to be implemented on an integrated circuit includes a depth cascade layer that specifies a cascade of two tensors along a depth dimension. For example, the depth cascading layer may be normalized to have a dimension x ₁ By y ₁ By z ₁ And has a dimension x ₁ By y ₁ By z ₂ Is generated to have dimension x ₁ By y ₁ Multiplier (z) ₁ +z ₂ ) The output tensor of (c).

The system generates one or more shift weight matrices for use in performing a cascade as specified by the depth cascade layer (step 304).

A shift matrix is a matrix that when multiplied with an input vector generates an output vector in which the position of one or more entries of the input vector are shifted while one or more other entries of the input vector have been replaced with zeros. In general, a shift matrix described in this specification is a matrix used as a matrix on the right side of a matrix multiplication operation to move a value of an input vector to a desired position in an output vector. However, in some other embodiments, a differently constructed shift matrix may be used as the matrix to the left of the matrix multiplication operation to achieve the same result.

In particular, because the matrix calculation unit of the integrated circuit performs matrix multiplication in hardware, the matrix calculation unit has a maximum vector length (max). The maximum vector length is the maximum length of a vector that can be multiplied by the matrix in one pass by the matrix calculation unit, i.e. without dividing the vector into multiple inputs of the matrix calculation unit. For example, if the matrix calculation unit is a one-or two-dimensional systolic array, the maximum vector length is equal to the number of columns in the unit or the number of rows in the unit.

For tensors with a depth less than or equal to max, the system stores the tensor as a collection of corresponding depth vectors with length max for each spatial position in the input tensor.

The spatial position is a pair of (x, y) spatial coordinates, i.e., such that all terms at all depth dimensions that share the same (x, y) spatial coordinate in the tensor are at the same spatial position. The depth vector for a given spatial position is the vector that includes all the terms in the tensor at the given spatial position. If the tensor has a depth z less than max, the last max-z entries of each depth vector are filled with zero or garbage values (i.e., values that can be used for other purposes but are not related to and should not affect the depth cascade operation).

For tensors with depths exceeding max, the system represents each depth vector as a number of max length vectors. Each of these max length vectors that make up part of the depth vector will be referred to as a block vector in this specification.

For a given spatial location in the tensor having a depth z, the depth vector for that location is represented as a ceiling (z/max) block vector, where ceiling (x) is the smallest integer greater than or equal to x. The front floor (x) block vectors (where floor (x) is the largest integer less than or equal to x) each store the value of the corresponding dimension in the depth vector from the spatial location, i.e., where the first block vector stores the term at the spatial location in the first max depth dimension, the second block vector stores the term at the spatial location in the second max depth dimension, and so on. The first floor (z/max) max terms of the last block vector are the terms at the last floor (z/max) max depth dimension, and any remaining terms are filled with zero or garbage values.

In some cases, the depth dimension (z) of the output tensor ₁ +z ₂ ) The maximum vector length will not be exceeded and the depth vector of the output tensor is a monolithic vector. In these cases, the system generates a single shift weight matrix for the deep cascading operation.

If the depth dimension of the output tensor does exceed the maximum vector length and the depth vector is represented by multiple block vectors, the system may need to generate more than one shift weight matrix for the depth concatenation operation.

The system also generates one or more modified identity weight matrices for the deep cascaded layers. The modified unity weight matrix is a matrix with ones along a portion of the major diagonal and zeros for all other entries.

In general, a shift matrix that shifts an entry starting from the j-th position of an input vector to start from the i-th position of the input vector is a max by max matrix that is all zeros except for diagonal rows that are ones starting from the j-th value of the i-th column of the matrix.

The shift matrix and the modified unit weight matrix will be described in more detail below with reference to fig. 4 to 8.

The system stores the one or more shifted weight matrices and the one or more modified unity weight matrices for the deep cascade layer in a memory accessible to the application specific integrated circuit (step 306). For example, the system may send the generated one or more matrices to a host interface of the application specific integrated circuit for storage in a dynamic memory accessible by the circuit.

The system generates instructions (step 308) that, when executed by the integrated circuit during processing of the neural network input by the neural network, cause the integrated circuit to use the generated matrix to generate an output tensor that satisfies the specification of the deep cascaded neural network layer. In particular, the system generates instructions that, when executed, cause the integrated circuit to perform processes 400, 500, or 600 described below with reference to fig. 4,5, and 6, respectively.

Fig. 4 is a flow diagram of another example process 400 for concatenating two input vectors. Process 400 is performed in hardware by an application specific integrated circuit, such as application specific integrated circuit 110 of fig. 1, that includes a hardware matrix computation unit.

The integrated circuit receives two inputs to be cascaded (step 402). Each input is a depth vector of the depth cascade at a given spatial position from the respective tensor to be depth cascaded, and each input consists of one or more max-sized block vectors. That is, one input is a depth vector at the spatial location from one input tensor, and the other input is a depth vector at the spatial location from the other input tensor.

In some cases, i.e. when the number of depth dimensions in either or both tensors is not a multiple of max, the last block vector of each spatial position in either or both tensors comprises a fill value, i.e. a zero or garbage value that has been added to the block vector but is not part of the input tensor. The entries in the block vector other than the filler entries will be referred to as non-filler entries.

The integrated circuit identifies a block vector of the first block vector that needs to be modified as part of the depth concatenation (referred to as the "first modified block vector") in the first or second input, e.g., based on a control signal (step 404). The first block vector that needs to be modified is the first block vector that includes one or more padding.

For each block vector preceding the first block vector that needs to be modified, the integrated circuit moves the block vector unmodified to the output of the matrix computation unit and then moves the block vector out of that output as the output block vector for the concatenation operation (step 406). In the following, the movement of the vector to the output of the matrix calculation unit will be described with reference to fig. 5 to 8.

The integrated circuit moves the non-filler entries of the first block vector that need to be modified into the output of the matrix computation unit (step 408).

That is, the integrated circuit multiplies the first modified block vector by the modified identity matrix to move the block vector having the non-filler terms of the first block vector and zeros of the remaining terms to the output of the matrix calculation unit. Hereinafter, multiplying a vector by a partial identity matrix will be described with reference to fig. 5 to 8.

The integrated circuit moves the shifted block vector into the output using the appropriate shift matrix to sum the shifted block vector and the block vector currently in the output, and then moves the sum of the block vectors out of the output as the output block vector for the cascading operation (step 410).

For the first iteration of step 410, which is performed during the depth concatenation, the block vector currently in the output is the first block vector that needs to be modified.

The shifted block vector is a block vector having zero as its first (max-n) entry and the first n entries of the next block vector as its remaining entries, where n is the number of padding entries in the block vector currently in output, and the next block vector is the next block vector to be operated on when the block vectors in input are arranged in order starting with the first block vector in the first input and ending with the last block vector in the second input.

In the following, generating shifted block vectors using a shift matrix and summing the vectors will be described in more detail with reference to fig. 5-8.

The integrated circuit shifts another shifted block vector to the output using another appropriate shift matrix (step 412).

Another shifted block vector is a block vector having any additional non-filler in the next block vector as its first few entries and filler as its remaining entries.

The integrated circuit continues to perform steps 410 and 412 until there are no next block vectors remaining, i.e., until after all input block vectors have been operated on.

The system may perform process 400 for each spatial location in the input tensors to cascade two input tensors in depth.

Fig. 5 is a flow diagram of another example process 500 for cascading two tensors along a depth dimension. The process 500 is performed in hardware by an application specific integrated circuit, such as the application specific integrated circuit 110 of fig. 1, that includes a hardware matrix computation unit.

Specifically, the process 500 is an example of a process to be performed to concatenate two tensors when the concatenated tensor has a depth dimension that does not exceed the maximum vector length of the matrix computation unit, i.e., each output depth vector can be stored as a single block.

The integrated circuit accesses a shift weight matrix for depth concatenation from a memory accessible to the integrated circuit, such as from dynamic memory 210 of fig. 2 (step 502). In some embodiments, the integrated circuit also accesses a modified unity weight matrix for deep concatenation from the memory.

The integrated circuit moves a first depth vector for a given spatial location in the first input tensor to the output of the matrix computation unit (step 504). The first depth vector for a given spatial position is a vector comprising all terms in the first input tensor at the given spatial position and having the fill value as any remaining value of the first depth vector.

For example, the integrated circuit may move each entry of the first depth vector to a respective summing register of a set of registers storing the output of the multiplication performed by the matrix computation unit.

To move the first depth vector to the output, the integrated circuit may multiply the first depth vector by a modified unit weight matrix of the depth cascade using a matrix calculation unit, resulting in the first depth vector being stored in the output of the matrix calculation unit. The modified unit weight matrix is z in front of the main diagonal ₁ A max matrix with entries of zero outside of 1.

The integrated circuit multiplies the second depth vector for the given spatial location in the second input tensor by the depth cascaded shift weight matrix (step 506) to generate a shifted second depth vector. The second depth vector for the given spatial position is a vector comprising all terms in the second input tensor at the given spatial position and having a fill value as any remaining value of the second depth vector.

Due to the structure of the shift weight matrix, the resulting shifted second depth vector is a vector with a max term,wherein, front z ₁ Term is zero, next z ₂ The term is the term of the second depth vector for that spatial position and any remaining terms are zero.

The integrated circuit adds the first depth vector and the shifted second depth vector to generate a concatenated depth vector (step 508). For example, the system may add each entry of the second depth vector to a corresponding entry of the first depth vector by moving the entry of the second depth vector into a summing register that stores the corresponding entry of the first depth vector.

The integrated circuit may perform step 504-508 on each spatial location in the input tensor to generate an output for the deep cascade layer.

Figure 6 is a flow diagram of another example process 600 for cascading two tensors along the depth dimension. The process 600 is performed in hardware by an application specific integrated circuit, such as the application specific integrated circuit 110 of fig. 1, that includes a hardware matrix computation unit.

Specifically, the process 600 is an example of a process performed when the first tensor has a depth dimension that is less than the maximum vector length of the matrix computation unit and the concatenated tensor has a depth dimension that exceeds the maximum vector length when the two tensors are concatenated. That is, the first depth vector is the first block vector that needs to be modified in process 400 of fig. 4.

The integrated circuit accesses a shift weight matrix for deep concatenation from a memory accessible to the integrated circuit, such as from unified buffer 208 of fig. 2 (step 602). In some implementations, the integrated circuit also accesses a modified unity weight matrix for deep concatenation from a memory.

The integrated circuit moves a first depth vector for a given spatial location in the first input tensor to the output of the matrix computation unit (step 604). In this example, the first depth vector is a vector with a max term, where the front z is ₁ The term is the term of the first depth vector of the spatial position and the remaining terms are zero.

For example, the integrated circuit may move each entry of the first depth vector to store the depth vector by multiplying the first depth vector by a vector having a dimension (max) times (max) and including zeros along the main diagonal up to the packetContaining z ₁ Z th of column ₁ The respective summing registers of the register sets of outputs of the multiplications performed by the modified unit weight matrix up to the term.

The integrated circuit multiplies a first block vector of a second depth vector for a given spatial position in the second input tensor by the depth concatenated first shift matrix (step 606) to generate a first partial shift block.

Due to the structure of the first shift matrix, the first partial shift depth vector for a given spatial position is a vector with (max) terms, where the first z is ₁ The term is zero, and then (max-z) ₁ ) The term is preceding (max-z) the first block vector of the second depth vector of the spatial position ₁ ) An item.

The integrated circuit adds the first depth vector and the first partial shifted depth vector to generate an intermediate concatenated depth vector (step 608). For example, the system may add each entry of the first partially shifted depth vector to a corresponding entry of the first depth vector by moving the entry of the first partially shifted depth vector into a summing register that stores the corresponding entry of the first depth vector. The intermediate concatenated depth vector is a vector with a (max) term, where the first z is ₁ The term is the term of the first depth vector, and next (max-z) ₁ ) The term is (max-z) of the second depth vector of the spatial position ₁ ) An item.

The integrated circuit multiplies the depth vector for the given spatial location in the second input tensor by a second shift matrix for depth concatenation (step 610) to generate a second partial shifted depth vector.

Due to the structure of the second shift matrix, the second partial shift depth vector for a given spatial position is a vector with a (max) term, where the first (z) is ₁ +z ₂ -max) term is the rear (z) of the second depth vector for the spatial position ₁ +z ₂ -max), the remaining terms being zero or a garbage value.

The integrated circuit stores the second partial shifted depth vector and the intermediate concatenated depth vector as a representation of the concatenated depth vector for the spatial location (step 612). For example, the integrated circuit may store the second partial shift depth vector and the intermediate cascaded depth vector in a predetermined location of the unified buffer that is identified in the instruction as a location that will store two vectors of the cascaded depth vector representing the spatial location.

The integrated circuit may perform process 600 for each spatial location in the input tensor to generate an output for the deep cascaded layer.

Fig. 7 shows an example of a depth concatenation calculation 700 that requires a single shift matrix.

In the simplified example of fig. 7, a first input tensor having a dimension of 3 × 3 × 3 is to be depth-concatenated with a second input tensor having a dimension of 3 × 3 × 4 to produce an output tensor of 3 × 3 × 7, and the maximum vector length processed at one time by the matrix calculation unit is 8, so that the depth of the concatenated output tensor is smaller than the maximum vector length.

In part (a) of the calculating, the integrated circuit operates on a first depth vector from the first tensor and a corresponding second depth vector from the second tensor. Specifically, the first depth vector has 3 terms at a given spatial position in the first input tensor as its first 3 terms and 0 as its remaining terms, while the second depth vector has 4 terms at a given spatial position in the second input tensor as its first 4 terms and 0 as its remaining terms.

In the example of fig. 7, the first and second depth vectors have been padded with zeros, but in other examples, one or both of the depth vectors may be padded with garbage data.

To perform part (a) of the calculation, the integrated circuit multiplies the first depth vector by the modified unity weight matrix using a matrix calculation unit to generate another instance of the first depth vector in the output of the matrix calculation unit, i.e. moves the first depth vector to the output. In the example of fig. 7, the modified identity matrix has 1's along the first three entries of the main diagonal and 0's for the last four entries of the main diagonal. However, because the depth vector is filled with zeros instead of garbage values, the modified unity weight matrix may instead have other values for the last four terms of the principal diagonal and other terms of the filler term that was multiplied only by the first depth vector during multiplication.

The integrated circuit then multiplies the second depth vector by the shift matrix to generate a shifted depth vector. The shift matrix is a 7 x 7 matrix with entries that are zero except for diagonal rows that start with the 1 st entry of the 4 th column and end with the 4 th entry of the 7 th column. Due to the multiplication by the shift matrix, the shift depth matrix has 0 as its first 3 entries and 4 entries of the second depth vector as its subsequent 4 entries. As with the modified identity matrix, because the depth vector is filled with zeros instead of garbage values, the shift matrix may instead have values other than 0 for the entries that are only multiplied by the filler of the second depth vector during multiplication.

In part (b) of the calculating, the integrated circuit adds the first depth vector and the shifted second depth vector, i.e. by moving the shifted second depth vector into the output while the first depth vector is in the output, to generate a concatenated depth vector having the term of the first depth vector as its first 3 terms and the term of the second depth vector as its next 4 terms.

The integrated circuit may perform an example calculation for each spatial location in the input tensor to generate a corresponding concatenated depth vector for each spatial location.

Fig. 8 shows an example of a depth concatenation calculation 800 requiring two shift matrices.

In the simplified example of fig. 8, a first input tensor having a dimension of 3 × 3 × 3 will be depth-concatenated with a second input tensor having a dimension of 3 × 3 × 4 to produce a 3 × 3 × 7 output tensor, and the maximum vector length processed at one time by the matrix calculation unit is 5, so that the depth of the depth-concatenated output tensor is greater than the maximum vector length.

In parts (a) and (c) of the computation, the integrated circuit operates on two instances of a first depth vector from the first input tensor and a corresponding second depth vector from the second input tensor. Specifically, the first depth vector has 3 terms at a given spatial position in the first input tensor as its first 3 terms and 0 as its remaining terms, and the second depth vector has 4 terms at a given spatial position in the second input tensor as its first 4 terms and 0 as its remaining terms. In the example of fig. 8, the first and second depth vectors have been padded with zeros, but in other examples, some or all of the depth vectors may be padded with garbage data.

To perform part (a) of the calculation, the integrated circuit multiplies the first depth vector by the modified unity weight matrix using a matrix calculation unit to generate another instance of the first depth vector in the output of the matrix calculation unit, i.e. moves the first depth vector to the output.

The integrated circuit then multiplies the second depth vector by the first shift matrix to generate a first partially shifted depth vector. The first shift matrix is a 5 x 5 matrix with entries that are zero except for diagonal rows that start with the 1 st entry of the 4 th column and end with the 2 nd entry of the 5 th column. The first partial shifted depth matrix has 0 as its first 3 entries and the first 2 entries of the second depth vector as its second 2 entries due to multiplication by the shift matrix.

To perform part (b) of the calculation, the integrated circuit then adds the first depth vector and the first shifted depth vector, i.e. by moving the first shifted depth vector into the output while the first depth vector is in the output, to generate a first block of output depth vectors having the first depth vector's entry as its first 3 entries and the second depth vector's first 2 entries as its second 2 entries.

To perform part (c) of the calculation, i.e., generate the second block of output depth vectors, the integrated circuit multiplies the second depth vector by the second shift matrix to generate a second partial-shifted depth vector. The second shift matrix is a 5 x 5 matrix with entries that are zero except for diagonal rows that start with the 4 th entry of the 1 st column and end with the 5 th entry of the 2 nd column. The second partial shift depth vector has the second 2 entries of the second depth vector as its first 2 entries and 0 as its remaining entries due to multiplication by the shift matrix.

To perform part (d) of the computation, the integrated circuit stores the first block vector and the second block vector as representations of the concatenated depth vector for the spatial location, for example, by storing the two vectors in predetermined locations in a unified buffer of locations identified in the instruction as locations where the two vectors representing the concatenated depth vector for the spatial location are to be stored.

The integrated circuit may perform the example calculation for each spatial location in the input tensor to generate a corresponding concatenated depth vector for each spatial location.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied in computer software or firmware, in computer hardware (including the structures disclosed in this specification and their structural equivalents), or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modular processing devices of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for data execution processing.

The term "data processing apparatus" refers to data processing hardware and includes all types of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can alternatively include, in addition to hardware, code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program, which may also be referred to or described as a program, software application, app, module, software module, script, or code, may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and performed by, special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device, such as a mouse or a trackball, by which the user can provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, by sending a web page to a web browser on the user device in response to a request received from the web browser. Also, the computer may interact with the user in return by sending a text message or other form of message to a personal device (e.g., a smartphone) running the messaging application and receiving a response message from the user.

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component, such as a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN) such as the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data (e.g., HTML pages) to the user device, e.g., for displaying data to and receiving user input from a user interacting with the device acting as a client. Data generated at the user device, such as results of user interaction, may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Other examples include methods, systems, and apparatus, including computer programs encoded on computer storage media, for depth cascading using matrix computation elements, wherein one of the methods includes: receiving a request to process a network input to a neural network using an integrated circuit, the neural network comprising deep cascaded neural network layers; and generating instructions that, when executed by the integrated circuit, cause the integrated circuit to perform operations comprising: for each spatial location in the first input tensor for the depth cascade layer and the second input tensor for the depth cascade layer: multiplying, using the matrix computation unit, the second depth vector of the spatial location with a shift weight matrix of the depth cascade layer to generate a shifted second depth vector; and adding the shifted second depth vector and the first input depth vector of the spatial position to generate a concatenated depth vector.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method for deep cascading, comprising:

receiving a request to process a network input to a neural network using an integrated circuit that performs neural network computations in hardware using a matrix computation unit, the neural network including a deep cascaded neural network layer that is canonical with a dimension x ₁ By y ₁ By z ₁ And has a dimension x ₁ By y ₁ By z ₂ Is cascaded along the depth dimension to generate a data stream having a dimension x ₁ By y ₁ Multiplier (z) ₁ +z ₂ ) The output tensor of (a); and

generating instructions that, when executed by the integrated circuit, cause the integrated circuit to generate a layer output tensor that satisfies a specification of the deep cascaded neural network layer during processing of a network input by the neural network by performing operations comprising:

having a dimension x for the deep cascaded neural network layer ₁ By y ₁ By z ₁ And a first input tensor to the deep cascade neural network layer having a dimension x ₁ By y ₁ By z ₂ Each spatial position in the second input tensor of:

multiplying, using the matrix computation unit, a second depth vector of the spatial location in the second input tensor with a shift weight matrix of the depth cascade neural network layer to generate a shifted second depth vector having zero as a front z ₁ Term and front z of the second depth vector ₂ Term being after z ₂ An entry, wherein the second depth vector of the spatial location is a vector that includes all entries in the second input tensor at the spatial location and has a fill value as any remaining value of the second depth vector; and

adding the shifted second depth vector and a first input depth vector of the spatial location in the first input tensor to generate a concatenated depth vector that is preceded by a z-th of the first input depth vector ₁ Term being front z ₁ Term and with the back z of said shifted second depth vector ₂ Term being after z ₂ An entry, wherein the first input depth vector of the spatial position is a vector comprising all entries in the first input tensor at the spatial position and padding values as any remaining values of the first input depth vector.

2. The method of claim 1, the operations further comprising:

moving the first input depth vector to a set of output summing registers of the matrix computation unit; and

wherein adding the shifted second depth vector and the first input depth vector comprises:

shifting the shifted second depth vector into the set of output summing registers of the matrix calculation unit while storing the first input depth vector in the set of output summing registers of the matrix calculation unit.

3. The method of claim 2, wherein moving the first input depth vector comprises:

multiplying, using the matrix computation unit, the first input depth vector with a modified unity weight matrix of the depth cascade neural network layer.

4. The method of claim 3, further comprising:

generating the modified basis weight matrix for the deep cascaded neural network layer; and

storing the modified basis weight matrix of the deep cascaded neural network layer in a memory accessible to an application specific integrated circuit.

5. The method of claim 1, further comprising:

generating the shift weight matrix for the deep cascade neural network layer; and

storing the shift weight matrix of the deep cascaded neural network layer in a memory accessible to an application specific integrated circuit.

6. The method of claim 5, further comprising:

determining that a depth dimension in the output tensor does not exceed a maximum vector length of the matrix computation unit; and

generating the shift weight matrix for the deep cascaded neural network layer in response to determining that a depth dimension in the output tensor does not exceed the maximum vector length of the matrix computation unit.

7. The method of claim 1, wherein the shifting weight matrix of the deep cascaded neural network layer is (z) ₁ +z ₂ ) Multiplier (z) ₁ +z ₂ ) Except for the z-th from the matrix ₂ The entries on the right diagonal from the first entry in the column are all zeros except for ones.

8. A system for deep cascading, comprising one or more computers and one or more storage devices storing first instructions that, when executed by the one or more computers, cause the one or more computers to perform first operations comprising:

generating second instructions that, when executed by the integrated circuit, cause the integrated circuit to generate a layer output tensor that satisfies a specification of the deep cascaded neural network layer by performing a second operation during processing of a network input by the neural network, the second operation comprising:

having a dimension x for the deep cascaded neural network layer ₁ By y ₁ By z ₁ And a first input tensor to the deep cascade neural network layer having a dimension x ₁ By y ₁ By z ₂ Each spatial position in the second input tensor of (a):

using the matrix computation unit, sum a second depth vector of the spatial location in the second input tensor with the second depth vectorMultiplying the shift weight matrix of the depth cascade neural network layer to generate a shifted second depth vector with zero as the front z ₁ Term is followed by the front z of the second depth vector ₂ An entry, wherein the second depth vector of the spatial location is a vector that includes all entries in the second input tensor at the spatial location and has a fill value as any remaining value of the second depth vector; and

9. The system of claim 8, the second operations further comprising:

10. The system of claim 9, wherein moving the first input depth vector comprises:

11. The system of claim 10, the first operations further comprising:

12. The system of claim 8, the first operations further comprising:

13. The system of claim 12, the first operations further comprising:

14. The system of claim 8, wherein the shifting weight matrix of the deep cascade neural network layer is a matrix of: the matrix except for the z-th from the matrix ₂ The entries on the right diagonal from the first entry in the column are all zeros except for ones.

15. One or more computer-readable storage media encoded with first instructions that, when executed by one or more computers, cause the one or more computers to perform first operations comprising:

receiving use integrated circuitProcessing a request for a network input to a neural network, the integrated circuit performing neural network computations in hardware using a matrix computation unit, the neural network comprising a deep cascaded neural network layer, the deep cascaded neural network layer specification having a dimension x ₁ By y ₁ By z ₁ And has a dimension x ₁ By y ₁ By z ₂ Is cascaded along the depth dimension to generate a data stream having a dimension x ₁ By y ₁ Multiplier (z) ₁ +z ₂ ) The output tensor of (a); and

multiplying, using the matrix computation unit, a second depth vector of the spatial location in the second input tensor with a shift weight matrix of the depth cascade neural network layer to generate a shifted second depth vector having zero as a front z ₁ Term is followed by the front z of the second depth vector ₂ An entry, wherein the second depth vector of the spatial location is a vector that includes all entries in the second input tensor at the spatial location and has a fill value as any remaining value of the second depth vector; and

adding the shifted second depth vector and a first input depth vector of the spatial location in the first input tensor to generate a concatenated depth vector that is preceded by a z-th of the first input depth vector ₁ Term being front z ₁ Term and with the rear z of said shifted second depth vector ₂ The item isRear z ₂ An entry, wherein the first input depth vector of the spatial position is a vector comprising all entries in the first input tensor at the spatial position and padding values as any remaining values of the first input depth vector.

16. The computer-readable storage medium of claim 15, the second operations further comprising:

17. The computer-readable storage medium of claim 16, wherein moving the first input depth vector comprises:

18. The computer-readable storage medium of claim 17, the first operations further comprising:

19. The computer-readable storage medium of claim 15, the first operations further comprising:

20. The computer-readable storage medium of claim 15, wherein the shifting weight matrix of the deep cascaded neural network layer is a matrix of: the matrix except for the z-th from the matrix ₂ The entries on the right diagonal from the first entry in the column are all zeros except for ones.