HK1245462B

HK1245462B - Prefetching weights for use in a neural network processor

Info

Publication number: HK1245462B
Application number: HK18104415.9A
Authority: HK
Inventors: Ross Jonathan
Original assignee: Google Llc
Priority date: 2015-05-21
Filing date: 2016-04-29
Publication date: 2021-06-25

Description

Prefetching weights for neural network processors

技术领域Technical Field

本说明书涉及在硬件上计算神经网络推理。This specification is about computing neural network inference on hardware.

背景技术Background Art

神经网络是机器学习模型，其采用一层或多层模型来为接收的输入生成输出，例如分类。除了输出层以外，一些神经网络包括一个以上的隐藏层。各个隐藏层的输出被用作网络中下一层的输入，即网络的下一隐藏层或者输出层。网络的各层根据相应组的参数的当前值，从接收的输入生成输出。A neural network is a machine learning model that uses one or more layers to generate an output, such as a classification, based on input. In addition to the output layer, some neural networks include one or more hidden layers. The output of each hidden layer is used as input to the next layer in the network, the next hidden layer or output layer. Each layer of the network generates an output based on the current values of its corresponding set of parameters.

一些神经网络包括一个或多个卷积神经网络层。各个卷积神经网络层具有相关的一组核。各个核包括通过用户创建的神经网络模型所建立的值。在一些实现中，核识别特定的图像轮廓、形状或颜色。核能够被表示为权重输入的矩阵结构。各个卷积层还可以处理一组激活输入。所述一组激活输入也能够被表示为矩阵结构。Some neural networks include one or more convolutional neural network layers. Each convolutional neural network layer has an associated set of kernels. Each kernel includes values established by a user-created neural network model. In some implementations, the kernel recognizes specific image contours, shapes, or colors. The kernel can be represented as a matrix structure of weight inputs. Each convolutional layer can also process a set of activation inputs. The set of activation inputs can also be represented as a matrix structure.

一些现有的系统在软件中对给定的卷积层进行计算。例如，软件能够将用于层的各个核应用到所述一组激活输入。即，对于各个核，软件能够将能够多维地表示的核覆盖在能够多维地表示的激活输入的第一部分上。软件然后能够从重叠的元素中计算点积。点积能够对应于单个的激活输入，例如在重叠的多维空间中具有上左方位置的激活输入元素。例如，使用滑动窗，软件然后能够将核转移为覆盖激活输入的第二部分，并且计算与另一激活输入对应的另一点积。软件能够重复地执行该过程，直到各个激活输入具有对应的点积。在一些实现中，点积被输入到生成激活值的激活函数。激活值能够在发送到神经网络的下一层之前被组合，例如被合并。Some existing systems perform calculations for a given convolutional layer in software. For example, the software can apply each kernel for the layer to the set of activation inputs. That is, for each kernel, the software can overlay the kernel that can be represented multidimensionally on a first portion of the activation inputs that can be represented multidimensionally. The software can then calculate the dot product from the overlapping elements. The dot product can correspond to a single activation input, such as the activation input element with the upper left position in the overlapping multidimensional space. For example, using a sliding window, the software can then shift the kernel to cover the second portion of the activation input and calculate another dot product corresponding to another activation input. The software can repeatedly perform this process until each activation input has a corresponding dot product. In some implementations, the dot product is input to an activation function that generates an activation value. The activation values can be combined, for example, merged, before being sent to the next layer of the neural network.

卷积计算的一种计算方式需要在大尺寸空间中的大量矩阵乘法。处理器能够通过穷举法计算矩阵乘法。例如，尽管是计算密集型的和时间密集型的，但处理器可以重复计算用于卷积计算的各个总和及乘积。由于该架构，限制了处理器的并行计算的程度。One type of convolution calculation requires numerous matrix multiplications in a large space. Processors can perform these matrix multiplications through exhaustive methods. For example, although this is computationally and time-intensive, the processor can repeatedly compute the sums and products required for convolution. However, this architecture limits the extent of parallelism within the processor.

发明内容Summary of the Invention

大体上，本说明书描述了一种计算神经网络推理的专用的硬件电路。In general, this specification describes a dedicated hardware circuit for computing neural network inference.

大体上，本说明书中描述的本主题的一个创新性方面能够体现在用于对包括多个层的神经网络执行神经网络计算的电路中，所述电路包括：脉动阵列，该脉动阵列包括多个单元格；权重提取单元，对于多个神经网络层中的每个神经网络层，该权重提取单元被配置为：对于该神经网络层，将多个权重输入发送到沿着所述脉动阵列的第一维度的单元格；以及多个权重定序器单元，每个权重定序器单元均耦合到沿着所述脉动阵列的所述第一维度的不同的单元格，对于所述多个神经网络层中的每个神经网络层，所述多个权重定序器单元被配置为：对于该神经网络层，在多个时钟周期中将所述多个权重输入转移到沿着所述脉动阵列的第二维度的单元格，其中各个权重输入被存储在沿着所述第二维度的相应的单元格内，并且其中各个单元格被配置为使用乘法电路计算激活输入与相应的权重输入的乘积。In general, one innovative aspect of the subject matter described herein can be embodied in a circuit for performing neural network computations on a neural network comprising a plurality of layers, the circuit comprising: a systolic array comprising a plurality of cells; a weight extraction unit configured, for each of a plurality of neural network layers, to: send, for the neural network layer, a plurality of weight inputs to cells along a first dimension of the systolic array; and a plurality of weight sequencer units, each weight sequencer unit coupled to a different cell along the first dimension of the systolic array, the plurality of weight sequencer units configured, for each of the plurality of neural network layers, to: transfer, for the neural network layer, the plurality of weight inputs to cells along a second dimension of the systolic array in a plurality of clock cycles, wherein respective weight inputs are stored within respective cells along the second dimension, and wherein respective cells are configured to compute a product of an activation input and a respective weight input using a multiplication circuit.

实现可以包括以下中的一个或多个。值定序器单元，其配置为，对于所述多个神经网络层的每个神经网络层，将多个激活输入发送到沿着所述神经网络层的所述脉动阵列的所述第二维度的单元格。所述脉动阵列的所述第一维度对应于所述脉动阵列的行，并且其中所述脉动阵列的所述第二维度对应于所述脉动阵列的列。各个单元格被配置为将权重控制信号传递到相邻的单元格，所述权重控制信号使得相邻单元格中的电路为所述相邻单元格转移或加载权重输入。权重路径寄存器，其被配置为存储被转移到所述单元格的所述权重输入；权重寄存器，其耦合到所述权重路径寄存器；权重控制寄存器，其被配置为判定是否将所述权重输入存储在所述权重寄存器中；激活寄存器，其被配置为存储激活输入，并且被配置为将所述激活输入发送到沿着所述第一维度的第一相邻单元格中的另一激活寄存器；乘法电路，其耦合到所述权重寄存器和所述激活寄存器，其中，所述乘法电路被配置为输出所述权重输入与所述激活输入的乘积；加法电路，其耦合到所述乘法电路，并且被配置为接收所述乘积以及来自沿着所述第二维度的第二相邻单元格的第一局部和，其中，所述加法电路被配置为输出所述乘积与所述第一局部和的第二局部和；以及局部和寄存器，其耦合到所述加法电路，并且被配置为存储所述第二局部和，所述局部和寄存器被配置为将所述第二局部和发送到沿着所述第二维度的第三相邻单元格中的另一加法电路。各个权重定序器单元包括：暂停计数器，该暂停计数器与耦合到所述权重定序器单元的对应的单元格中的所述权重控制寄存器相对应；以及递减电路，该递减电路被配置为将到所述权重定序器单元的输入递减，以生成递减的输出，并且将所述递减的输出发送到所述暂停计数器。各个暂停计数器中的值是相同的，并且各个权重定序器单元被配置为将对应的权重输入加载到所述脉动阵列的对应的不同单元格，其中所述加载包括将所述权重输入发送到所述乘法电路。各个暂停计数器中的值达到了预定值，以使得多个权重定序器单元暂停多个权重输入沿着所述第二维度的转移。所述脉动阵列被配置为，对于所述多个神经网络层的每个神经网络层，从各个乘积生成用于所述神经网络层的累加输出。Implementations may include one or more of the following: a value sequencer unit configured to, for each of the plurality of neural network layers, send a plurality of activation inputs to cells along the second dimension of the systolic array of the neural network layer. The first dimension of the systolic array corresponds to rows of the systolic array, and wherein the second dimension of the systolic array corresponds to columns of the systolic array. Each cell is configured to pass a weight control signal to an adjacent cell, the weight control signal causing circuitry in the adjacent cell to transfer or load weight inputs for the adjacent cell. a weight path register configured to store the weight input transferred to the cell; a weight register coupled to the weight path register; a weight control register configured to determine whether the weight input is stored in the weight register; an activation register configured to store the activation input and configured to send the activation input to another activation register in a first adjacent cell along the first dimension; a multiplication circuit coupled to the weight register and the activation register, wherein the multiplication circuit is configured to output a product of the weight input and the activation input; an addition circuit coupled to the multiplication circuit and configured to receive the product and a first local sum from a second adjacent cell along the second dimension, wherein the addition circuit is configured to output a second local sum of the product and the first local sum; and a local sum register coupled to the addition circuit and configured to store the second local sum, wherein the local sum register is configured to send the second local sum to another addition circuit in a third adjacent cell along the second dimension. Each weight sequencer unit includes: a pause counter corresponding to the weight control register in a corresponding cell coupled to the weight sequencer unit; and a decrement circuit configured to decrement the input to the weight sequencer unit to generate a decremented output and send the decremented output to the pause counter. The value in each pause counter is the same, and each weight sequencer unit is configured to load the corresponding weight input to a corresponding different cell of the systolic array, wherein the loading includes sending the weight input to the multiplication circuit. The value in each pause counter reaches a predetermined value, causing the plurality of weight sequencer units to pause the transfer of the plurality of weight inputs along the second dimension. The systolic array is configured to generate, for each of the plurality of neural network layers, an accumulated output for the neural network layer from each product.

可以实现本说明书中描述的主题的特定实施例，以实现以下一个或多个优点。预取权重使神经网络处理器能够更有效地执行计算。处理器可以使用权重提取单元和权重定序器单元将加载权重输入协调到脉动阵列中，从而消除将外部存储器单元耦合到脉动阵列中的每个单元格的需要。处理器可以暂停，即，“冻结”，权重输入的转移，以同步执行多次卷积计算。Certain embodiments of the subject matter described in this specification can be implemented to achieve one or more of the following advantages. Prefetching weights enables a neural network processor to perform computations more efficiently. The processor can coordinate loading weight inputs into a systolic array using a weight fetch unit and a weight sequencer unit, thereby eliminating the need to couple an external memory unit to each cell in the systolic array. The processor can pause, i.e., "freeze," the transfer of weight inputs to synchronize the execution of multiple convolution computations.

本说明书的主题的一个或多个实施例的细节在附图和下面的描述中阐述。通过说明书、附图和权利要求，本主题的其它特征、方面和优点将变得显而易见。The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是用于对神经网络的给定层执行计算的示例性方法的流程图。1 is a flow chart of an exemplary method for performing computations on a given layer of a neural network.

图2示出示例性神经网络处理系统。FIG2 illustrates an exemplary neural network processing system.

图3示出包括矩阵计算单元的实例性架构。FIG3 shows an example architecture including a matrix computation unit.

图4示出脉动阵列内部的单元格的示例性架构。FIG4 illustrates an exemplary architecture of a unit cell within a systolic array.

图5示出具有空间维度和特征维度的示例性矩阵结构。FIG5 shows an exemplary matrix structure having a spatial dimension and a feature dimension.

图6示出如何将核矩阵结构发送到脉动阵列的示例性图示。FIG6 shows an exemplary diagram of how a core matrix structure is sent to a systolic array.

图7示出在三个时钟周期之后，单元格内部的权重输入的示例性图示。FIG. 7 shows an exemplary diagram of the weight inputs inside a cell after three clock cycles.

图8是控制信号如何使得激活输入被转移或加载的示例性图示。FIG. 8 is an exemplary illustration of how a control signal causes an activation input to be transferred or loaded.

相同的附图标记和标号表示相同的元件。Like reference numerals and numbers refer to like elements.

具体实施方式DETAILED DESCRIPTION

具有多层的神经网络能够用于计算推理。例如，给定一个输入，神经网络能够计算该输入的推理。神经网络通过处理通过神经网络的各个层的输入来计算该推理。特别地，神经网络的层按顺序排列，分别具有相应的一组权重。各个层接收输入，并且根据用于该层的一组权重处理输入以生成输出。Neural networks with multiple layers can be used to compute inferences. For example, given an input, a neural network can compute an inference about that input. The neural network computes the inference by processing the input through the various layers of the neural network. Specifically, the layers of a neural network are arranged in a sequence, each with a corresponding set of weights. Each layer receives an input and processes it according to the set of weights for that layer to generate an output.

因此，为了从接收的输入计算推理，神经网络接收输入并且按顺序经过各个神经网络层处理该输入以产生推理，来自一个神经网络层的输出作为输入提供给下一神经网络层。输入到神经网络层的数据，例如或者是到神经网络的输入或者是按顺序在该层之下的层的、到神经网络层的输出，可以被称为对层的激活输入。Thus, to compute inferences from received inputs, a neural network receives inputs and processes the inputs sequentially through various neural network layers to produce inferences, with the outputs from one neural network layer being provided as inputs to the next neural network layer. Data input to a neural network layer, e.g., either inputs to the neural network or outputs to the neural network layer of a layer sequentially below it, can be referred to as activation inputs to the layer.

在一些实现当中，神经网络的层布置在有向的图中。即，任何特定层都能够接收多个输入、多个输出或者二者均有。神经网络的层还能够布置成使得层的输出能够被作为输入发送回到在前的层。In some implementations, the layers of a neural network are arranged in a directed graph. That is, any particular layer can receive multiple inputs, multiple outputs, or both. The layers of a neural network can also be arranged so that the output of a layer can be sent back as input to the previous layer.

图1是使用专用的硬件电路执行对神经网络的给定层的计算的示例性过程100的流程图。为了方便，将关于具有执行方法100的一个或多个电路的系统来描述方法100。能够对神经网络的各层执行方法100，以便从接收的输入计算推理。1 is a flow chart of an exemplary process 100 for performing computations on a given layer of a neural network using dedicated hardware circuitry. For convenience, method 100 will be described with respect to a system having one or more circuits for performing method 100. Method 100 can be performed on various layers of a neural network to compute inferences from received inputs.

系统接收给定层的权重输入的组(步骤102)以及激活输入的组(步骤104)。权重输入的组和激活输入的组能够分别从专用的硬件电路的动态存储器和统一缓冲器来接收。在一些实现中，权重输入的组和激活输入的组均从统一缓冲器接收。The system receives a set of weight inputs for a given layer (step 102) and a set of activation inputs (step 104). The set of weight inputs and the set of activation inputs can be received from a dynamic memory of a dedicated hardware circuit and a unified buffer, respectively. In some implementations, the set of weight inputs and the set of activation inputs are both received from a unified buffer.

系统使用专用硬件电路的矩阵乘法单元，从权重输入和激活输入生成累加值(步骤106)。在一些实现中，累加值是权重输入的组以及激活输入的组的点积。即对于一组权重，系统能够使各个权重输入与各个激活输入相乘，并且将乘积加在一起以形成累加值。然后，系统能够计算其它组的权重与其它组的激活输入的点积。The system uses a matrix multiplication unit of a dedicated hardware circuit to generate a cumulative value from the weight inputs and the activation inputs (step 106). In some implementations, the cumulative value is the dot product of the group of weight inputs and the group of activation inputs. That is, for a set of weights, the system can multiply each weight input with each activation input and add the products together to form the cumulative value. The system can then calculate the dot product of other groups of weights with other groups of activation inputs.

系统能够使用专用硬件电路的矢量计算单元，从累加值生成层输出(步骤108)。在一些实现中，矢量计算单元对累加值应用激活函数。该层的输出能够存储在统一缓冲器中，用作对神经网络中的后续层的输入，或者能够用于确定推理。当接收的输入已经通过神经网络的各个层而被处理以生成用于接收的输入的推理时，系统完成了对神经网络的处理。The system can generate the layer output from the accumulated values using a vector computation unit of a dedicated hardware circuit (step 108). In some implementations, the vector computation unit applies an activation function to the accumulated values. The output of the layer can be stored in a unified buffer and used as input to subsequent layers in the neural network, or can be used to determine inferences. When the received input has been processed through the various layers of the neural network to generate inferences for the received input, the system has completed processing of the neural network.

图2示出用于执行神经网络计算的示例性专用集成电路200。系统200包括主机接口202。主机接口202能够接收包括用于神经网络计算的参数的指令。参数可以包括以下中的至少一个或多个：应该处理多少个层；用于所述层中的各个层的、对应的权重输入的组；激活输入的初始组，即到要从其计算推理的神经网络的输入；各个层的对应的输入大小和输出大小；用于神经网络计算的跨步值；以及要处理的层的类型，例如卷积层或者完全连接层。FIG2 illustrates an exemplary application-specific integrated circuit 200 for performing neural network computations. System 200 includes a host interface 202. Host interface 202 is capable of receiving instructions including parameters for neural network computations. The parameters may include at least one or more of the following: how many layers should be processed; a corresponding set of weight inputs for each of the layers; an initial set of activation inputs, i.e., inputs to the neural network from which inference is to be computed; corresponding input and output sizes for each layer; a stride value for the neural network computation; and the type of layer to be processed, e.g., a convolutional layer or a fully connected layer.

主机接口202能够向定序器206发送指令，该定序器206将指令转换成控制电路执行神经网络计算的低电平控制信号。在一些实现中，控制信号调节电路中的数据流，例如权重输入的组和激活输入的组如何通过电路流动。定序器206能够将控制信号发送到统一缓冲器208、矩阵计算单元212以及矢量计算单元214。在一些实现中，定序器206还向直接内存存取引擎204和动态存储器210发送控制信号。在一些实现中，定序器206是生成时钟信号的处理器。定序器206能够使用时钟信号的定时，以在适当的时刻将控制信号发送到电路200中的各个组件。在其它一些实现中，主机接口202中经过来自外部处理器的时钟信号。The host interface 202 can send instructions to the sequencer 206, which converts the instructions into low-level control signals that control the circuit to perform neural network calculations. In some implementations, the control signals regulate the flow of data in the circuit, such as how the groups of weight inputs and activation inputs flow through the circuit. The sequencer 206 can send control signals to the unified buffer 208, the matrix calculation unit 212, and the vector calculation unit 214. In some implementations, the sequencer 206 also sends control signals to the direct memory access engine 204 and the dynamic memory 210. In some implementations, the sequencer 206 is a processor that generates a clock signal. The sequencer 206 can use the timing of the clock signal to send control signals to various components in the circuit 200 at the appropriate time. In some other implementations, the clock signal from the external processor passes through the host interface 202.

主机接口202能够将权重输入的组和激活输入的初始组发送到直接内存存取引擎204。直接内存存取引擎204能够将激活输入的组存储在统一缓冲器208中。在一些实现中，直接内存访问将权重的组存储到可以是存储单元的动态存储器210中。在一些实现中，动态存储器位于电路的外部。The host interface 202 can send the set of weight inputs and the initial set of activation inputs to the direct memory access engine 204. The direct memory access engine 204 can store the set of activation inputs in a unified buffer 208. In some implementations, the direct memory access stores the set of weights in a dynamic memory 210, which can be a storage unit. In some implementations, the dynamic memory is external to the circuit.

统一缓冲器208是存储缓冲器。其能够用于存储来自直接内存存取引擎204的激活输入的组以及矢量计算单元214的输出。直接内存存取引擎204还能够从统一缓冲器208读取矢量计算单元214的输出。The unified buffer 208 is a storage buffer that can be used to store groups of active inputs from the direct memory access engine 204 and the outputs of the vector computation unit 214. The direct memory access engine 204 can also read the outputs of the vector computation unit 214 from the unified buffer 208.

动态存储器210和统一缓冲器208能够分别将权重输入的组和激活输入的组发送到矩阵计算单元212。在一些实现中，矩阵计算单元212是二维脉动阵列。矩阵计算单元212还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，矩阵计算单元212是通用的矩阵处理器。The dynamic memory 210 and the unified buffer 208 can send the groups of weight inputs and the groups of activation inputs, respectively, to the matrix calculation unit 212. In some implementations, the matrix calculation unit 212 is a two-dimensional systolic array. The matrix calculation unit 212 can also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the matrix calculation unit 212 is a general-purpose matrix processor.

矩阵计算单元212能够处理权重输入和激活输入，并且向矢量计算单元214提供输出的矢量。在一些实现中，矩阵计算单元将所述输出的矢量发送到统一缓冲器208，统一缓冲器208将所述输出的矢量发送到矢量计算单元214。矢量计算单元能够处理输出的矢量，并且将经处理的输出的矢量存储到统一缓冲器208。例如，矢量计算单元214可以将非线性函数应用到矩阵计算单元的输出，例如累加值的矢量，用以生成激活值。在一些实现中，矢量计算单元214生成归一化的值、合并值，或二者均有。处理过的输出的矢量能够用作到矩阵计算单元212的激活输入，例如用于在神经网络中的后续层中的使用。下面将参考图3和图4详细描述矩阵计算单元212。The matrix calculation unit 212 can process the weight inputs and activation inputs and provide a vector of outputs to the vector calculation unit 214. In some implementations, the matrix calculation unit sends the vector of outputs to the unified buffer 208, and the unified buffer 208 sends the vector of outputs to the vector calculation unit 214. The vector calculation unit can process the vector of outputs and store the processed vector of outputs in the unified buffer 208. For example, the vector calculation unit 214 can apply a nonlinear function to the output of the matrix calculation unit, such as a vector of accumulated values, to generate activation values. In some implementations, the vector calculation unit 214 generates normalized values, merged values, or both. The processed vector of outputs can be used as activation inputs to the matrix calculation unit 212, for example, for use in subsequent layers in a neural network. The matrix calculation unit 212 will be described in detail below with reference to Figures 3 and 4.

图3示出包括矩阵计算单元的示例性架构300。矩阵计算单元是二维的脉动阵列306。阵列306包括多个单元格304。在一些实现中，脉动阵列306的第一维度320对应于单元格的列，并且脉动阵列306的第二维度322对应于单元格的行。脉动阵列的行可以多于列，列可以多于行，或者行与列数量相等。FIG3 illustrates an exemplary architecture 300 including a matrix computation unit. The matrix computation unit is a two-dimensional systolic array 306. The array 306 includes a plurality of cells 304. In some implementations, a first dimension 320 of the systolic array 306 corresponds to columns of cells, and a second dimension 322 of the systolic array 306 corresponds to rows of cells. A systolic array can have more rows than columns, more columns than rows, or an equal number of rows and columns.

在图示的实例中，值加载器302将激活输入发送到阵列306的行，并且权重提取接口308将权重输入发送到阵列306的列。然而在其它一些实现中，激活输入传递到阵列306的列，并且权重输入被传递到阵列306的行。In the illustrated example, the value loader 302 sends activation inputs to the rows of the array 306, and the weight extraction interface 308 sends weight inputs to the columns of the array 306. However, in some other implementations, the activation inputs are passed to the columns of the array 306, and the weight inputs are passed to the rows of the array 306.

值加载器302可以从统一缓冲器(例如图2的统一缓冲器208)接收激活输入。各个值加载器能够将对应的激活输入发送到阵列306的不同的最左的单元格。最左的单元格可以是沿着阵列306的最左列的单元格。例如，值加载器312可以将激活输入发送到单元格314。值加载器还可以将激活输入发送到相邻的值加载器，并且能够在阵列306的另一最左的单元格处使用该激活输入。这允许激活输入被转移以用于阵列306的另一特定单元格。Value loader 302 can receive activation input from a unified buffer (e.g., unified buffer 208 of FIG. 2 ). Each value loader can send a corresponding activation input to a different leftmost cell of array 306. The leftmost cell can be a cell along the leftmost column of array 306. For example, value loader 312 can send activation input to cell 314. A value loader can also send activation input to an adjacent value loader, and the activation input can be used at another leftmost cell of array 306. This allows the activation input to be transferred for use in another specific cell of array 306.

权重提取接口308能够从存储单元(例如图2的动态存储器210)接收权重输入。权重提取接口308能够将对应的权重输入发送到阵列306的不同的最顶部的单元格。最顶部的单元格可以是沿着阵列306的最顶部的行的单元格。例如，权重提取接口308能够将权重输入发送到单元格314和316。The weight extraction interface 308 can receive weight inputs from a storage unit (e.g., dynamic memory 210 of FIG. 2 ). The weight extraction interface 308 can send corresponding weight inputs to different top-most cells of the array 306. The top-most cells can be cells along the top-most row of the array 306. For example, the weight extraction interface 308 can send weight inputs to cells 314 and 316.

在一些实现中，主机接口，例如图2的主机接口202，沿着一个维度、例如向右遍历阵列306地转移激活输入，同时沿着另一维度、例如向下遍历阵列306地转移权重输入。例如，在一个时钟周期过程中，单元格314处的激活输入可以转移到单元格314的右侧的单元格316中的激活寄存器。相似地，单元格316处的权重输入可以转移到单元格314下方的单元格318处的权重寄存器。In some implementations, a host interface, such as host interface 202 of FIG2 , transfers activation inputs along one dimension, such as rightward, across array 306, while simultaneously transferring weight inputs along another dimension, such as downward, across array 306. For example, during one clock cycle, an activation input at cell 314 can be transferred to an activation register in cell 316 to the right of cell 314. Similarly, a weight input at cell 316 can be transferred to a weight register at cell 318 below cell 314.

在各个时钟周期中，各个单元格能够处理给定的权重输入和给定的激活输入，以生成累加输出。累加输出还能够沿着与给定的权重输入相同的维度传递相邻的单元格。下面参考图4进一步描述单独的单元格。In each clock cycle, each cell can process a given weight input and a given activation input to generate an accumulated output. The accumulated output can also be passed to adjacent cells along the same dimension as the given weight input. The individual cells are further described below with reference to Figure 4.

在一些实现中，在给定的时钟周期期间，权重和激活被转移多于一个单元格，以从一个卷积计算转换到另一个卷积计算。In some implementations, weights and activations are transferred across more than one cell during a given clock cycle to transition from one convolutional computation to another.

累加输出能够沿着与权重输入相同的列，例如朝着阵列306的列的底部被传递。在一些实现中，在各个列的底部，阵列306能够包括累加器单元310，当对权重输入多于列的层或对激活输入多于行的层执行计算时，该累加器单元310存储和累加来自各个列的各个累加输出。在一些实现中，各个累加器单元存储多个并行的累加。这将在下文参考图6进一步描述。累加器单元310能够累加各个累加输出，以生成最终的累加值。最终的累加值能够传递到矢量计算单元。在其它一些实现中，当处理权重输入比列少的层或激活输入比行少的层时，累加器单元310将累加值传递到矢量计算单元，而不执行任何累加。The accumulated output can be passed along the same column as the weight input, for example, toward the bottom of the column of array 306. In some implementations, at the bottom of each column, array 306 can include an accumulator unit 310 that stores and accumulates the individual accumulated outputs from each column when performing calculations on a layer with more weight inputs than columns or a layer with more activation inputs than rows. In some implementations, each accumulator unit stores multiple parallel accumulations. This will be further described below with reference to Figure 6. The accumulator unit 310 can accumulate the individual accumulated outputs to generate a final accumulated value. The final accumulated value can be passed to the vector calculation unit. In some other implementations, when processing a layer with fewer weight inputs than columns or a layer with fewer activation inputs than rows, the accumulator unit 310 passes the accumulated value to the vector calculation unit without performing any accumulation.

随着激活输入和权重输入流通电路，电路能够“冻结”或暂停一组权重输入的流动，以准确地计算积加值。即，电路能够使一组权重输入暂停，使得特定的一组权重输入能够应用于特定的一组激活输入。As activation inputs and weight inputs flow through a circuit, the circuit can "freeze" or pause the flow of a set of weight inputs to accurately calculate the product value. That is, the circuit can pause a set of weight inputs so that a specific set of weight inputs can be applied to a specific set of activation inputs.

在一些实现中，权重定序器324配置权重输入是否移动到相邻单元格。权重定序器326能够从主机(例如，图2的主机接口202)或从外部处理器接收控制值。各个权重定序器能够将控制值传递到阵列306中的对应的单元格。特别地，控制值能够存储在单元格中的权重控制寄存器中，例如图4的权重控制寄存器414。控制值能够判定权重输入是沿着阵列的维度转移还是加载，这将在下文参考图8描述。权重定序器还能够将控制值发送到相邻的权重定序器，这能够调整用于对应的单元格的、对应的权重输入的转移或加载。In some implementations, the weight sequencer 324 configures whether the weight input is moved to an adjacent cell. The weight sequencer 326 can receive a control value from a host (e.g., the host interface 202 of Figure 2) or from an external processor. Each weight sequencer can pass the control value to the corresponding cell in the array 306. In particular, the control value can be stored in a weight control register in the cell, such as the weight control register 414 of Figure 4. The control value can determine whether the weight input is transferred or loaded along the dimension of the array, which will be described below with reference to Figure 8. The weight sequencer can also send the control value to an adjacent weight sequencer, which can adjust the transfer or loading of the corresponding weight input for the corresponding cell.

在一些实现中，控制值表示为整数。各个权重定序器可以包括存储该整数的暂停计数寄存器。权重定序器还可以在将控制值存储在暂停计数寄存器之前，递减该整数。在将控制值存储在暂停计数寄存器之后，权重定序器能够将该整数发送到相邻的权重定序器，并且将该整数发送到对应的单元格。例如，各个权重定序器可以具有配置为从控制值生成递减的整数的递减电路。递减的整数可以存储在暂停计数寄存器中。存储的控制值能够用于协调跨阵列的整个列的、同时的转移暂停，这将在下面参考图8进一步描述。In some implementations, the control value is represented as an integer. Each weight sequencer may include a pause count register that stores the integer. The weight sequencer may also decrement the integer before storing the control value in the pause count register. After storing the control value in the pause count register, the weight sequencer can send the integer to an adjacent weight sequencer and send the integer to the corresponding cell. For example, each weight sequencer may have a decrement circuit configured to generate a decrementing integer from the control value. The decrementing integer may be stored in the pause count register. The stored control value can be used to coordinate simultaneous transfer pauses across an entire column of the array, which will be further described below with reference to Figure 8.

在一些实现中，在电路中暂停权重使得开发人员能够调试电路。In some implementations, pausing weights in a circuit enables developers to debug the circuit.

其它暂停权重的方法是可行的。例如，代替于将暂停计数寄存器内的值传递到相邻的暂停计数寄存器，控制值可以使用树来传递。即，在给定的单元格处，信号能够传递到所有的相邻单元格，并且不是仅一个相邻的单元格，从而使得信号快速遍历脉动阵列地传播。Other approaches to pausing weights are possible. For example, instead of passing the value in a pause count register to adjacent pause count registers, control values can be passed using a tree. That is, at a given cell, the signal can be passed to all adjacent cells, not just one, allowing the signal to propagate quickly through the systolic array.

图4示出脉动阵列内部的单元格的示例性架构400，例如图3中的脉动阵列306。FIG. 4 illustrates an exemplary architecture 400 of a cell within a systolic array, such as systolic array 306 in FIG. 3 .

单元格可以包括存储激活输入的激活寄存器406。激活寄存器能够依据在脉动阵列内的单元格的位置，从左侧相邻单元格、即位于给定单元格的左侧的相邻单元格或从统一缓冲器，接收激活输入。单元格可以包括存储权重输入的权重寄存器402。依据在脉动阵列内的单元格的位置，权重输入能够从顶部的相邻单元格或从权重提取接口传递。单元格还可以包括总和寄存器(sum in registor)404。总和寄存器404可以存储来自顶部相邻单元格的累加值。乘法电路408能够用于使来自权重寄存器402的权重输入与来自激活寄存器406的激活输入相乘。乘法电路408能够将乘积输出到加法电路410。The cell may include an activation register 406 that stores an activation input. The activation register can receive activation inputs from a left neighboring cell, i.e., a neighboring cell to the left of a given cell, or from a unified buffer, depending on the position of the cell within the systolic array. The cell may include a weight register 402 that stores weight inputs. Depending on the position of the cell within the systolic array, the weight inputs can be passed from a top neighboring cell or from a weight extraction interface. The cell may also include a sum register 404. The sum register 404 can store an accumulated value from a top neighboring cell. A multiplication circuit 408 can be used to multiply the weight inputs from the weight register 402 with the activation inputs from the activation register 406. The multiplication circuit 408 can output the product to an addition circuit 410.

加法电路能够对乘积与来自总和寄存器404的累加值求和，以生成新的累加值。然后，加法电路410能够将新的累加值发送到位于底部的相邻单元格的另一总和寄存器。新的累加值能够用作底部的相邻单元格中求和的操作数。The adding circuit can sum the product with the accumulated value from the summing register 404 to generate a new accumulated value. The adding circuit 410 can then send the new accumulated value to another summing register located in the bottom adjacent cell. The new accumulated value can be used as an operand for the summation in the bottom adjacent cell.

在一些实现中，单元格还包括通用控制寄存器。该控制寄存器能够存储判定单元格是否应将权重输入还是应该将激活输入或激活输入转移到相邻单元格的控制信号。在一些实现中，转移权重输入或激活输入需要多于一个时钟周期。控制信号还能够判定是否将激活输入或权重输入传递到乘法电路408，或者能够判定乘法电路408是否对激活输入或权重输入进行运算。控制信号还能够例如使用导线被传递到一个以上的相邻单元格。In some implementations, the cell also includes a general control register. The control register can store a control signal that determines whether the cell should transfer the weight input or the activation input or the activation input to an adjacent cell. In some implementations, transferring the weight input or the activation input requires more than one clock cycle. The control signal can also determine whether to pass the activation input or the weight input to the multiplication circuit 408, or whether to determine whether the multiplication circuit 408 operates on the activation input or the weight input. The control signal can also be passed to more than one adjacent cell, for example, using a wire.

在一些实现中，权重被预先转移到权重路径寄存器412。权重路径寄存器412能够例如从顶部的相邻单元格接收权重输入，并且基于控制信号将权重输入传递到权重寄存器402。权重寄存器402能够静态地存储权重输入，使得当在多个时钟周期的过程中、激活输入例如通过激活寄存器406传递到单元格时，权重输入保留在单元格中，不被传递到相邻单元格。因此，能够使用乘法电路408将权重输入应用到多个激活输入，并且相应的累加值能够传递到相邻单元格。In some implementations, the weights are pre-transferred to the weight path register 412. The weight path register 412 can receive weight inputs, for example, from the top neighboring cell, and pass the weight inputs to the weight register 402 based on a control signal. The weight register 402 can statically store the weight inputs so that when activation inputs are passed to the cell, for example, via the activation register 406, over the course of multiple clock cycles, the weight inputs remain in the cell and are not passed to neighboring cells. Thus, the weight inputs can be applied to multiple activation inputs using the multiplication circuit 408, and the corresponding accumulated values can be passed to neighboring cells.

在一些实现中，权重控制寄存器414控制权重输入是否存储在权重寄存器402中。例如，如果权重控制寄存器414存储为0的控制值，则权重寄存器402能够存储由权重路径寄存器412发送的权重输入。在一些实现中，将权重输入存储到权重寄存器402中被称为加载权重输入。一旦加载了权重输入，权重输入就能够被发送到乘法电路408，进行处理。如果权重控制寄存器414存储了非零的控制值，则权重寄存器402能够忽略由权重路径寄存器412发送的权重输入。存储在权重控制寄存器414中的控制值能够被传递到一个以上的相邻单元格，例如对于给定的单元格，控制值能够被发送到位于给定单元格的右侧的单元格中的权重控制寄存器。In some implementations, the weight control register 414 controls whether the weight input is stored in the weight register 402. For example, if the weight control register 414 stores a control value of 0, the weight register 402 can store the weight input sent by the weight path register 412. In some implementations, storing the weight input in the weight register 402 is referred to as loading the weight input. Once the weight input is loaded, the weight input can be sent to the multiplication circuit 408 for processing. If the weight control register 414 stores a non-zero control value, the weight register 402 can ignore the weight input sent by the weight path register 412. The control value stored in the weight control register 414 can be passed to more than one adjacent cell, for example, for a given cell, the control value can be sent to the weight control register in the cell to the right of the given cell.

单元格还能够将权重输入和激活输入转移到相邻单元格。例如，权重路径寄存器412能够将权重输入发送到底部的相邻单元格中的另一权重路径寄存器。激活寄存器406能够将激活输入发送到右侧的相邻单元格中的另一激活寄存器。因此，权重输入以及激活输入均能够在随后的时钟周期中，被阵列中的其它单元格再次使用。Cells can also transfer weight inputs and activation inputs to adjacent cells. For example, weight path register 412 can send a weight input to another weight path register in the adjacent cell at the bottom. Activation register 406 can send an activation input to another activation register in the adjacent cell on the right. Thus, both the weight input and the activation input can be reused by other cells in the array in subsequent clock cycles.

图5示出具有空间维度和特征维度的示例性矩阵结构500。矩阵结构500能够表示一组激活输入或一组权重输入。在该说明书中，将用于一组激活输入的矩阵结构称为激活矩阵结构，并且在该说明书中，将用于一组权重输入的矩阵结构称为核矩阵结构。矩阵结构500具有三个维度：两个空间维度和一个特征维度。FIG5 shows an exemplary matrix structure 500 having a spatial dimension and a feature dimension. Matrix structure 500 can represent a set of activation inputs or a set of weight inputs. In this specification, a matrix structure for a set of activation inputs is referred to as an activation matrix structure, and in this specification, a matrix structure for a set of weight inputs is referred to as a core matrix structure. Matrix structure 500 has three dimensions: two spatial dimensions and one feature dimension.

在一些实现中，空间维度对应于一组激活输入的空间或位置。例如，如果神经网络正在处理具有两个维度的图像，则矩阵结构能够具有与图像的空间坐标、即XY坐标相对应的两个空间维度。In some implementations, the spatial dimensions correspond to the space or location of a set of activation inputs. For example, if the neural network is processing an image having two dimensions, the matrix structure can have two spatial dimensions corresponding to the spatial coordinates of the image, i.e., XY coordinates.

特征维度对应于激活输入的特征。各个特征维度能够具有深度级；例如，矩阵结构500具有深度级502、504和506。作为例示，如果矩阵结构500表示作为一组激活输入发送到第一层的3×3×3图像，则图像的X维度和Y维度(3×3)可以是空间维度，而Z维度(3)可以是与R、G和B值相对应的特征维度。即，深度级502可以对应于九个“1”激活输入的特征，例如红色值，深度级504可以对应于九个“2”激活输入的特征，例如绿色值，并且深度级506可以对应于九个“3”激活输入的特征，例如蓝色值。The feature dimensions correspond to features of the activation inputs. Each feature dimension can have a depth level; for example, the matrix structure 500 has depth levels 502, 504, and 506. As an example, if the matrix structure 500 represents a 3×3×3 image sent as a set of activation inputs to the first layer, the X and Y dimensions of the image (3×3) can be spatial dimensions, while the Z dimension (3) can be a feature dimension corresponding to the R, G, and B values. That is, depth level 502 can correspond to features of nine "1" activation inputs, such as red values, depth level 504 can correspond to features of nine "2" activation inputs, such as green values, and depth level 506 can correspond to features of nine "3" activation inputs, such as blue values.

虽然在图5的实例中，仅图示了特征维度的三个深度级，但是给定的特征维度能够具有大数量的、例如上百个的特征维度。类似地，虽然仅示出了一个特征维度，但是给定的矩阵结构能够具有多个特征维度。Although only three depth levels of feature dimensions are illustrated in the example of Figure 5 , a given feature dimension can have a large number, such as hundreds of feature dimensions. Similarly, although only one feature dimension is shown, a given matrix structure can have multiple feature dimensions.

为使用矩阵结构500对卷积层进行计算，系统需要将卷积计算转换为二维矩阵乘法。In order to perform calculations on the convolution layer using the matrix structure 500, the system needs to convert the convolution calculation into a two-dimensional matrix multiplication.

图6示出了如何在给定的卷积层处由脉动阵列606处理图5的矩阵结构500的示例性图示。矩阵结构600可以是一组激活输入。一般地，神经网络处理器能够将激活输入(例如，矩阵结构600中的元素)以及权重输入(例如，核A-D 610)分别发送到阵列的行和列。激活输入和权重输入能够分别向脉动阵列的右侧和底部转移，并且必须到达特定位置，例如，特定单元格处的特定寄存器。一旦例如通过检查控制信号而确定这些输入到位，则处理器能够使用存储在单元格中的输入进行计算，以生成给定层的输出。FIG6 shows an exemplary diagram of how the matrix structure 500 of FIG5 is processed by the systolic array 606 at a given convolutional layer. The matrix structure 600 can be a set of activation inputs. Generally, the neural network processor can send activation inputs (e.g., elements in the matrix structure 600) and weight inputs (e.g., kernels A-D 610) to the rows and columns of the array, respectively. The activation inputs and weight inputs can be transferred to the right and bottom of the systolic array, respectively, and must arrive at specific locations, such as specific registers at specific cells. Once these inputs are determined to be in place, for example by examining control signals, the processor can perform calculations using the inputs stored in the cells to generate the output of the given layer.

如上所述，神经网络处理器在将结构600的一部分发送到脉动阵列的行之前，使矩阵结构600“平坦化”。即，神经网络处理器能够分离矩阵结构600的深度层602，例如图6的深度层602、604和606，并且将各个深度层发送到不同的单元格。在一些实现中，各个深度层被发送到脉动阵列606的不同行上的单元格。例如，处理器能够将来自第一深度层的激活输入，例如九个“1”激活输入的矩阵，发送到脉动阵列606的第一行处的最左侧的单元格，将第二深度层，例如九个“2”激活输入的矩阵，发送到第二行处的最左侧的单元格，将第三深度层，例如九个“3”激活输入的矩阵，发送到第三行处的最左侧的单元格，等等。As described above, the neural network processor "flattens" the matrix structure 600 before sending a portion of the structure 600 to a row of the systolic array. That is, the neural network processor can separate the depth layers 602 of the matrix structure 600, such as depth layers 602, 604, and 606 of FIG. 6, and send each depth layer to a different cell. In some implementations, each depth layer is sent to a cell on a different row of the systolic array 606. For example, the processor can send activation inputs from a first depth layer, such as a matrix of nine "1" activation inputs, to the leftmost cell at the first row of the systolic array 606, send a second depth layer, such as a matrix of nine "2" activation inputs, to the leftmost cell at the second row, send a third depth layer, such as a matrix of nine "3" activation inputs, to the leftmost cell at the third row, and so on.

给定层可以具有多个核，例如核A-D 610。核A-D 610可以具有维度3×3×10的矩阵结构。处理器可以将各个核矩阵结构发送到脉动阵列606的不同列处的单元格。例如，核A可以发送到第一列中的顶部的单元格，核B可以发送到第二列中的顶部的单元格，等等。A given layer may have multiple cores, such as cores A-D 610. Cores A-D 610 may have a matrix structure of dimensions 3×3×10. The processor may send each core matrix structure to a cell at a different column of the systolic array 606. For example, core A may be sent to the top cell in the first column, core B may be sent to the top cell in the second column, and so on.

当矩阵结构发送到单元格时，矩阵的第一元素可以在一个时钟周期期间存储在单元格中。在下一个时钟周期，下一个元素可以存储在单元格中。存储的第一元素可以转移到相邻的单元格，如以上参考图4所述。输入的转移可以持续进行，直到矩阵结构的所有元素都存储在脉动阵列606中。激活输入和权重输入二者均可以在一个以上的时钟周期后遍历每个单元格转移。下面将参考图7进一步描述脉动阵列中的输入的转移。When the matrix structure is sent to a cell, the first element of the matrix can be stored in the cell during one clock cycle. In the next clock cycle, the next element can be stored in the cell. The stored first element can be transferred to an adjacent cell, as described above with reference to FIG4. The transfer of inputs can continue until all elements of the matrix structure are stored in the systolic array 606. Both the activation input and the weight input can be transferred after one or more clock cycles through each cell. The transfer of inputs in the systolic array will be further described below with reference to FIG7.

图7示出三个时钟周期之后，在示例性3×3脉动阵列的单元格内的权重输入的示例性图示700。各个单元格能够存储权重输入和激活输入，如以上参考图5所述。权重输入能够发送到脉动阵列的不同列，以用于卷积计算，如以上参考图7所述。作为例示，系统将具有1、2和4的权重输入的第一核矩阵结构发送到脉动阵列的第一列。系统将具有3、5和7的权重输入的第二核结构发送到第二列。系统将具有6、8和10的权重的第三核结构发送到第三列。在每个时钟周期之后，权重输入能够在一个维度上转移，例如从顶部向底部转移，同时，激活输入能够在另一个维度上转移(未示出)，例如从左侧向右侧转移。FIG7 shows an exemplary diagram 700 of weight inputs within a cell of an exemplary 3×3 systolic array after three clock cycles. Each cell can store weight inputs and activation inputs, as described above with reference to FIG5 . The weight inputs can be sent to different columns of the systolic array for convolution computations, as described above with reference to FIG7 . As an example, the system sends a first kernel matrix structure with weight inputs of 1, 2, and 4 to the first column of the systolic array. The system sends a second kernel structure with weight inputs of 3, 5, and 7 to the second column. The system sends a third kernel structure with weights of 6, 8, and 10 to the third column. After each clock cycle, the weight inputs can be shifted in one dimension, such as from top to bottom, while the activation inputs can be shifted in another dimension (not shown), such as from left to right.

权重输入能够以交错方式存储在单元格中。即，第一时钟周期702之后的脉动阵列的状态在顶部左侧单元格中显示“1”。“1”代表存储在单元格中的权重输入“1”。在下一个时钟周期704，“1”转移到顶部左侧单元格下方的单元格，并且来自核的另一个权重输入“2”存储在顶部左侧单元格中，以及权重输入“3”存储在第二列的最顶部单元格处。The weight inputs can be stored in cells in an interleaved manner. That is, the state of the systolic array after the first clock cycle 702 shows a "1" in the top left cell. The "1" represents the weight input "1" stored in the cell. In the next clock cycle 704, the "1" is transferred to the cell below the top left cell, and another weight input "2" from the core is stored in the top left cell, and the weight input "3" is stored at the top cell of the second column.

在第三时钟周期706，再次转移各个权重。在第一列中，最底部单元格存储权重输入“1”，权重输入“2”被存储于在上一个周期中存储权重输入“1”的位置，并且权重输入“4”存储在最顶部左侧的单元格中。类似地，在第二列中，“3”向下转移，并且权重输入“5”存储在顶部中间的单元格中。在第三列中，权重输入“6”存储在最顶部右侧的单元格中。In the third clock cycle 706, the weights are shifted again. In the first column, the bottom cell stores the weight input "1," the weight input "2" is stored in the location where the weight input "1" was stored in the previous cycle, and the weight input "4" is stored in the top left cell. Similarly, in the second column, "3" is shifted downward, and the weight input "5" is stored in the top middle cell. In the third column, the weight input "6" is stored in the top right cell.

在一些实现中，判定是否应该转移权重输入的、用于权重输入的控制信号也与权重输入一起转移。In some implementations, a control signal for the weight input that determines whether the weight input should be transferred is also transferred along with the weight input.

激活输入可以以相似的方式在其它维度上转移，例如，从左侧向右侧转移。Activation inputs can be shifted in other dimensions in a similar manner, for example, from left to right.

一旦激活输入和权重输入就位，则处理器可以例如通过使用单元格中的乘法电路和加法电路进行卷积计算，以生成要在矢量计算单元中使用的一组累加值。Once the activation inputs and weight inputs are in place, the processor can perform a convolution calculation, for example, by using the multiplication and addition circuits in the cell to generate a set of accumulated values to be used in the vector computation unit.

虽然以权重输入发送到阵列的列以及激活输入发送到阵列的行来描述了系统，但在一些实现中，权重输入被发送到阵列的行，而激活输入被发送到阵列的列。Although the system is described with weight inputs sent to the columns of the array and activation inputs sent to the rows of the array, in some implementations, weight inputs are sent to the rows of the array and activation inputs are sent to the columns of the array.

图8是控制值如何能够使得权重输入被转移或加载的示例性图示。控制值806能够由主机发送，并且能够由权重定序器存储，如以上参考图3所述。在每个时钟周期802的基础上，图中的值表示存储在分别与脉动阵列的行1-4 804相对应的权重定序器808-814中的控制值。FIG8 is an exemplary illustration of how a control value can cause a weight input to be transferred or loaded. The control value 806 can be sent by the host and can be stored by the weight sequencer, as described above with reference to FIG3. The values in the figure represent the control values stored in the weight sequencers 808-814 corresponding to rows 1-4 804 of the systolic array, respectively, on a per clock cycle 802 basis.

在一些实现中，如果给定的权重定序器中的控制值为非零值，则在脉动阵列的对应的单元格中的权重输入将被转移到相邻单元格。如果给定的权重定序器中的控制值是零，则权重输入能够被加载到对应的单元格中，并用于计算与单元格中的激活输入的乘积。In some implementations, if the control value in a given weight sequencer is non-zero, the weight input in the corresponding cell of the systolic array will be transferred to the adjacent cell. If the control value in a given weight sequencer is zero, the weight input can be loaded into the corresponding cell and used to calculate the product with the activation input in the cell.

作为例示，主机能够确定四个权重输入应该在加载之前被转移。在时钟周期0中，主机能够将控制值5发送到权重定序器808，即与行1对应的权重定序器。权重定序器808包括递减电路，该递送电路花费一个时钟周期以基于控制值5输出控制值4。因此，控制值4在随后的时钟周期、即时钟周期1中，被存储在权重定序器808中。As an example, the host can determine that four weight inputs should be transferred before loading. In clock cycle 0, the host can send a control value of 5 to the weight sequencer 808, i.e., the weight sequencer corresponding to row 1. The weight sequencer 808 includes a decrement circuit that takes one clock cycle to output a control value of 4 based on the control value 5. Therefore, the control value of 4 is stored in the weight sequencer 808 in the subsequent clock cycle, i.e., clock cycle 1.

在时钟周期1中，主机将控制值4发送到权重定序器808。因此，在时钟周期2中，权重定序器808例如使用递减电路存储控制值3。在时钟周期1中，权重定序器808能够将控制值4发送到权重定序器810。因此，在时钟周期2中，在利用权重定序器810的递减电路处理控制值4之后，权重定序器810能够存储控制值3。In clock cycle 1, the host sends a control value of 4 to the weight sequencer 808. Thus, in clock cycle 2, the weight sequencer 808 stores a control value of 3 using, for example, a decrement circuit. In clock cycle 1, the weight sequencer 808 can send the control value of 4 to the weight sequencer 810. Thus, in clock cycle 2, after processing the control value of 4 using the decrement circuit of the weight sequencer 810, the weight sequencer 810 can store the control value of 3.

相似地，主机能够分别在时钟周期2、3和4中发送控制值3、2和1。因为当递减控制值时，各个权重定序器808-814中的递减电路产生延迟，所以在各个时钟周期中递减控制值806能够最终使得每个权重定序器存储相同的控制值，即时钟周期4中的控制值1，以及时钟周期5中的控制值0。Similarly, the host can send control values of 3, 2, and 1 in clock cycles 2, 3, and 4, respectively. Because the decrement circuitry in each weight sequencer 808-814 introduces a delay when decrementing the control value, decrementing the control value 806 in each clock cycle can ultimately cause each weight sequencer to store the same control value, i.e., a control value of 1 in clock cycle 4 and a control value of 0 in clock cycle 5.

在一些实现中，当各个权重定序器输出控制值0时，脉动阵列暂停权重输入的转移，并且在各个单元格中加载权重输入。即，通过加载权重输入，脉动阵列使得权重输入能够用作点积计算中的操作数，从而开始处理神经网络中的层。In some implementations, when each weight sequencer outputs a control value of 0, the systolic array pauses the transfer of the weight input and loads the weight input in each cell. That is, by loading the weight input, the systolic array enables the weight input to be used as an operand in a dot product calculation, thereby starting processing the layer in the neural network.

在一些实现中，在完成计算之后，为了重新开始转移权重，主机将控制值变成非零值，例如在时钟周期7期间发送控制值5。转移过程能够如上所述地参考时钟周期0重复。In some implementations, after the calculations are completed, to restart the transfer of weights, the host changes the control value to a non-zero value, such as sending a control value of 5 during clock cycle 7. The transfer process can be repeated as described above with reference to clock cycle 0.

在一些实现中，控制值以另一偏置、例如1开始。In some implementations, the control value starts with another offset, such as 1.

本说明书中描述的主题和功能操作的实施例能够以数字电路实现，以有形实现的计算机软件或固件实现，以计算机硬件、包括本说明书中公开的结构及其结构等同物实现，或者以它们中的一个或多个的组合实现。本说明书中描述的主题的实施例可以被实现为一个或多个计算机程序，即在有形的非暂时性程序载体上编码的计算机程序指令的一个或多个模块，其用于由数据处理设备执行或者控制数据处理设备的操作。代替地或附加地，程序指令可以被编码在人工生成的传播信号上，例如机器生成的电信号、光信号或电磁信号，该传播信号被生成以对信息进行编码，用于传输到合适的接收机设备，以由数据处理设备执行。计算机存储介质可以是机器可读存储装置、机器可读存储基板、随机或串行存取存储装置，或它们中的一个或多个的组合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital circuitry, in tangibly implemented computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in a combination of one or more thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by a data processing device or for controlling the operation of the data processing device. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to a suitable receiver device for execution by the data processing device. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more thereof.

术语“数据处理设备”包含所有种类的用于处理数据的设备、装置和机器，例如包括可编程处理器、计算机，或多个处理器或计算机。该设备可以包括专用逻辑电路，例如FPGA(现场可编程门阵列)，或ASIC(专用集成电路)。除了硬件之外，该设备还可以包括为相关的计算机程序创建执行环境的代码，例如构成处理器固件、协议栈、数据库管理系统、操作系统或者它们中的一个或多个的组合的代码。The term "data processing equipment" encompasses all types of equipment, devices, and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. The equipment may include dedicated logic circuits, such as an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). In addition to hardware, the equipment may also include code that creates an execution environment for the associated computer program, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.

可以以任何形式的编程语言，包括编译语言或解释语言或者纯粹功能性语言或声明性语言或程序性语言，来编写计算机程序(其也可以被称为程序、软件、软件应用程序、模块、软件模块、脚本或代码)，并且该计算机程序可以以任何形式部署，包括作为独立程序或作为模块、组件、子程序或适用于计算环境的其它单元来部署。计算机程序可以但不必需对应于文件系统中的文件。程序可以存储在保存其它程序或数据的文件的一部分中，例如存储在标记语言文档中的一个或多个脚本，可以存储在专用于该程序的单个文件中，或者可以存储在多个协调的文件中，例如存储一个或多个模块、子程序或代码部分的文件。可以将计算机程序部署为在一台计算机上或在多台计算机上执行，所述多台计算机位于一个站点上或在分布在多个站点上并通过通信网络互连。A computer program (which may also be referred to as a program, software, software application, module, software module, script, or code) may be written in any form of programming language, including compiled or interpreted languages or purely functional or declarative or procedural languages, and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for a computing environment. A computer program may, but need not, correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program, or in multiple coordinated files, such as files that store one or more modules, subroutines, or code portions. A computer program may be deployed to execute on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communications network.

本说明书中描述的过程和逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行，以通过对输入数据进行操作并产生输出来执行功能。所述处理和逻辑流程也可以由专用逻辑电路，例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)，来执行，并且装置也可以被实现为专用逻辑电路。The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can be implemented as, special purpose logic circuitry, such as an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

例如，适用于执行计算机程序的计算机包括可以基于通用微处理器或专用微处理器或者这两者，或者基于任何其它类型的中央处理单元。一般地，中央处理单元将从只读存储器或随机存取存储器或从这两者接收指令和数据。计算机的基本要素是用于执行或运行指令的中央处理单元，以及用于存储指令和数据的一个或多个存储器件。一般地，计算机还会包括用于存储数据的一个或多个大容量存储设备，例如磁盘、磁光盘或光盘，或者在操作上耦合到该大容量存储设备，以从该大容量存储设备接收数据或者将数据传递到该大容量存储设备，或者二者。然而，计算机不必需具有这样的设备。而且，计算机可以嵌入在另一设备中，例如仅举几例，移动电话、个人数字助理(PDA)、移动音频播放器或移动视频播放器、游戏控制台、全球定位系统(GPS)接收器或便携式存储设备，例如通用串行总线(USB)闪存驱动器。For example, the computer that is suitable for executing a computer program includes and can be based on a general-purpose microprocessor or a special-purpose microprocessor or both, or based on a central processing unit of any other type. Generally, the central processing unit will receive instructions and data from a read-only memory or a random access memory or from both. The basic element of a computer is a central processing unit for executing or running instructions, and one or more storage devices for storing instructions and data. Generally, a computer also includes one or more large-capacity storage devices for storing data, such as magnetic disks, magneto-optical disks or optical disks, or is operationally coupled to the large-capacity storage device to receive data from the large-capacity storage device or data are transferred to the large-capacity storage device, or both. However, a computer does not necessarily have such a device. Moreover, a computer can be embedded in another device, such as, to name a few, a mobile phone, a personal digital assistant (PDA), a mobile audio player or a mobile video player, a game console, a global positioning system (GPS) receiver or a portable storage device, such as a universal serial bus (USB) flash drive.

适用于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、介质和存储器件，例如包括：半导体存储器件，例如EPROM、EEPROM和闪速存储器件；磁盘，例如内部硬盘，或可移动磁盘；磁光盘；以及CD ROM盘和DVD-ROM盘。处理器和存储器可由专用逻辑电路来补充，或集成在其中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and storage devices, including, for example: semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and memory may be supplemented by, or incorporated in, special purpose logic circuitry.

为发送与用户交互，本说明书中描述的主题的实施例可以在一计算机上实现，该计算机具有：显示设备，例如CRT(阴极射线管)监视器，或LCD(液晶显示器)监视器，以用于将信息显示给用户；以及键盘和指点设备，例如鼠标或轨迹球，用户可以利用该设备向计算机发送输入。也可以使用其它类型的设备发送，以用于与用户交互；例如，提供给用户的反馈可以是任何形式的感觉反馈，例如视觉反馈、听觉反馈或触觉反馈；并且可以以任何形式接收来自用户的输入，包括声音输入、语音输入或触觉输入。此外，计算机可以通过向用户使用的设备发送文档或从用户使用的设备接收文档，来与用户进行交互；利用，通过响应于从网络浏览器接收到的请求，将网页发送到用户的客户端设备上的网络浏览器。To communicate with a user, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device, such as a CRT (cathode ray tube) monitor or an LCD (liquid crystal display) monitor, for displaying information to the user; and a keyboard and pointing device, such as a mouse or trackball, with which the user can communicate input to the computer. Other types of devices can also be used to communicate with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including voice input, speech input, or tactile input. In addition, the computer can interact with the user by sending documents to or receiving documents from a device used by the user; and by sending web pages to a web browser on the user's client device in response to a request received from the web browser.

本说明书中描述的主题的实施例能够在包括例如作为数据服务器的后端组件的计算系统中实现；或者在包括中间件组件、例如应用服务器的计算系统中实现；或者在包括前端组件的计算系统中实现，所述前端组件例如是具有图形用户界面或网络浏览器的客户端计算机，用户可以通过该界面或浏览器与本说明书中描述的主题的实现进行交互；或者在包括一个或多个这样的后端组件、中间件组件或前端组件的任何组合的计算系统中实现。系统的组件可以通过任何形式或媒介的数字数据通信、例如通信网络互连。通信网络的实例包括局域网(“LAN”)和广域网(“WAN”)，例如因特网。Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, such as a data server; or in a computing system that includes a middleware component, such as an application server; or in a computing system that includes a front-end component, such as a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described in this specification; or in a computing system that includes any combination of one or more such back-end components, middleware components, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, such as a communication network. Examples of communication networks include local area networks ("LANs") and wide area networks ("WANs"), such as the Internet.

计算系统可以包括客户端和服务器。客户端和服务器通常互相远离，并且通常通过通信网络交互。客户端与服务器之间的关系借助于在相应的计算机上运行且彼此间具有客户端-服务器关系的计算机程序而产生。Computing systems can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship between client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

虽然本说明书包含许多具体的实现细节，但是这些不应被解释为限制任何发明的或可能要求保护内容的范围，而应被解释为对特定发明的特定实施例的具体特征的描述。本说明书中在分别实施例的上下文中描述的一些特征也可以在单个实施例中以组合方式实现。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适子组合的形式在多个实施例中实现。此外，虽然以上可以将特征描述为以一些组合的方式起作用，并且甚至最初要求这样进行保护，但要求保护的组合中的一个或多个特征在一些情况下可以从组合中去除，并且所要求的组合可以涉及子组合或子组合的变化。Although this specification contains many specific implementation details, these should not be interpreted as limiting the scope of any invention or the content that may be claimed for protection, but should be interpreted as a description of the specific features of the specific embodiments of a particular invention. Some features described in the context of each embodiment in this specification may also be implemented in combination in a single embodiment. Conversely, the various features described in the context of a single embodiment may also be implemented in multiple embodiments individually or in the form of any suitable sub-combination. In addition, although the features may be described above as working in some combinations, and even initially claimed to be protected in this way, one or more features in the claimed combination may be removed from the combination in some cases, and the claimed combination may involve a sub-combination or a change in the sub-combination.

类似地，虽然这些操作在附图中以特定顺序示出，但是这不应被理解成为实现期望的结果，需要这些操作以所示特定顺序或按次序来执行，或者需要执行所有所示的操作。在一些情况下，多任务和并行处理可能是有利的。此外，上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中都要求这样的分离，而应理解，所述的程序组件和系统通常可以一起集成在单个软件产品中，或打包成多个软件产品。Similarly, although the operations are shown in a particular order in the accompanying drawings, this should not be understood as requiring that the operations be performed in the particular order shown or in sequence, or that all of the operations shown be performed, in order to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments. Instead, it should be understood that the program components and systems described can generally be integrated together in a single software product, or packaged into multiple software products.

已经描述了主题的特定实施例。其它的实施例在所附权利要求的范围内。例如，权利要求中记载的动作可以以不同顺序执行，并且仍然实现期望的结果。作为一个示例，附图中所描绘的过程不一定要求所示的特定顺序或次序来实现期望的结果。在一些实现中，多任务和并行处理可能是有利的。Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired results. As an example, the processes depicted in the accompanying figures do not necessarily require the specific order or sequence shown to achieve the desired results. In some implementations, multitasking and parallel processing may be advantageous.

Claims

1. A circuit for performing neural network computation on a neural network comprising multiple layers, the circuit comprising:

The hardware matrix calculation unit includes circuitry for a pulsating array, the pulsating array comprising a plurality of cells, each of the plurality of cells including a weight register disposed within the cell for storing weight inputs received from a source outside the cell.

The hardware circuitry for the weight extraction unit is configured, for each of the plurality of neural network layers, as follows:

For this neural network layer, multiple weight inputs are sent to cells along the first dimension of the systolic array; and

Hardware circuitry for multiple weight sequencer units disposed outside each of the plurality of cells, each weight sequencer unit being coupled to a different cell along the first dimension of the systolic array, wherein for each of the plurality of neural network layers, each weight sequencer unit is configured as follows:

Control values are provided for storage in control registers set within the different cells coupled to the weight sequencer unit, the control values being used to transfer the plurality of weight inputs to cells along a second dimension of the systolic array for the neural network layer over multiple clock cycles, wherein each weight input is stored in a corresponding cell using the weight registers and along the second dimension, and wherein each of the cells is configured to use multiplication circuitry to compute the product of the activation input and the corresponding weight input.

2. The circuit according to claim 1 further includes:

The value sequencer unit is configured, for each of the plurality of neural network layers, to send a plurality of activation inputs to cells along the second dimension of the systolic array for that neural network layer.

3. The circuit of claim 1, wherein the first dimension of the pulsating array corresponds to a row of the pulsating array, and wherein the second dimension of the pulsating array corresponds to a column of the pulsating array.

4. The circuit of claim 1, wherein each cell is configured to pass a weight control signal to an adjacent cell, the weight control signal causing the circuit in the adjacent cell to transfer or load weight input for the adjacent cell.

5. The circuit of claim 1, wherein each cell includes hardware circuitry for:

A weight path register is located within the cell and coupled to the weight register, the weight path register being configured to store weight inputs transferred to the cell;

A weight control register is set within the cell and is used to store at least the control value provided by the weight sequencer or the weight control signal transmitted from the adjacent cell. The weight control register is configured to determine whether to store the weight input in the weight register.

An activation register is set within the cell and configured to store activation input, and configured to send the activation input to another activation register in a first adjacent cell along the first dimension;

A multiplication circuit, wherein the multiplication circuit is disposed within the cell and coupled to the weight register and the activation register, wherein the multiplication circuit is configured to output the product of the weight input and the activation input;

An addition circuit, disposed within the cell and coupled to the multiplication circuit, and configured to receive the product and a first partial sum from a second adjacent cell along the second dimension, wherein the addition circuit is configured to output a second partial sum of the product and the first partial sum; and

A local sum register, which is located within the cell and coupled to the adder circuit, and configured to store the second local sum, the local sum register being configured to send the second local sum to another adder circuit in a third adjacent cell along the second dimension.

6. The circuit of claim 5, wherein each weight register unit comprises:

A pause counter, the pause counter corresponding to the weight control register in the corresponding cell coupled to the weight sequencer unit; and

A decrementing circuit is configured to decrement the input to the weight sequencer unit to generate a decrementing output, and to send the decrementing output to the pause counter.

7. The circuit of claim 6, wherein the values in each pause counter are identical, and each weight sequencer unit is configured to load a corresponding weight input into a corresponding different cell of the pulsation array, wherein the loading includes sending the weight input to the multiplication circuit.

8. The circuit of claim 6, wherein the values in each pause counter are different, and each weight sequencer unit is configured to transfer the corresponding weight input to an adjacent weight sequencer unit along the second dimension.

9. The circuit of claim 6, wherein the values in each pause counter reach predetermined values to cause the plurality of weight sequencer units to pause the transfer of the plurality of weight inputs along the second dimension.

10. The circuit of claim 1, wherein the pulsation array is configured to: for each of the plurality of neural network layers, generate an accumulated output for the neural network layer from the respective product.

11. A method for performing neural network computation on a neural network comprising multiple layers, the method comprising, for each of the multiple neural network layers:

At the weight extraction unit, multiple weight inputs of cells along the first dimension of the systolic array, which includes circuitry for the systolic array, are sent to a hardware matrix calculation unit.

The plurality of weight inputs are stored in a corresponding weight register within each of the plurality of cells along the first dimension of the pulsating array; and

Hardware circuitry for each of the plurality of weight sequencer units provides control values for storage in a control register located in a specific cell along the first dimension of the systolic array, wherein the control values cause the plurality of weight inputs to be transferred to cells along a second dimension of the systolic array over a plurality of clock cycles, wherein each weight sequencer unit is located outside each of the plurality of cells and coupled to a different cell along the first dimension of the systolic array, wherein each weight input is stored using the weight register and in a corresponding cell along the second dimension, and wherein each cell is configured to compute the product of an activation input and a corresponding weight input using multiplication circuitry.

12. The method of claim 11, further comprising:

At the value sequencer unit, multiple activation inputs are sent to cells along the second dimension of the systolic array of the neural network layer.

13. The method of claim 11, wherein the first dimension of the pulsating array corresponds to a row of the pulsating array, and wherein the second dimension of the pulsating array corresponds to a column of the pulsating array.

14. The method of claim 11, further comprising: for each cell, transmitting a weight control signal to an adjacent cell, the weight control signal causing the circuit in the adjacent cell to transfer or load weight input for the adjacent cell.

15. The method of claim 11, wherein each cell includes hardware circuitry for:

A weight path register is located within the cell and coupled to the weight register, the weight path register being configured to store the weight input that is transferred to the cell;

A multiplication circuit is disposed within the cell and coupled to the weight register and the activation register, wherein the multiplication circuit is configured to output the product of the weight input and the activation input;

16. The method of claim 15, further comprising:

At the decrement circuit in each weight sequencer unit, the corresponding input to the weight sequencer unit is decremented to generate the corresponding decremented output.

For each weight sequencer unit, the corresponding decreasing output is sent to the corresponding pause counter, which corresponds to the weight control register in the corresponding cell coupled to the weight sequencer unit.

17. The method of claim 16, wherein the values in each pause counter are identical, and the method further comprises: at each weight sequencer unit, loading a corresponding weight input into a corresponding different cell of the pulsation array, wherein loading includes sending the weight input to the multiplication circuit.

18. The method of claim 16, wherein the values in each pause counter are different, and the method further comprises: at each weight sequencer unit, transferring the corresponding weight input to an adjacent weight sequencer unit along the second dimension.

19. The method of claim 16, wherein the values in each pause counter reach predetermined values to cause the plurality of weight sequencer units to pause the transfer of the plurality of weight inputs along the second dimension.

20. The method of claim 11, further comprising: generating a corresponding accumulated output for the neural network layer from the respective product at a systolic array for each of the plurality of neural network layers.