HK1259156B

HK1259156B - Circuit for transposing matrix and for transposing input vector

Info

Publication number: HK1259156B
Application number: HK19101203.0A
Authority: HK
Inventors: 乔纳森‧罗斯; 罗伯特‧大卫‧纳科尔斯; 克里斯托弗‧阿伦‧克拉克; 李展鹏; 格雷戈里‧米歇尔‧索尔森
Original assignee: 谷歌有限责任公司
Priority date: 2017-02-16
Filing date: 2019-01-23
Publication date: 2020-11-20

Description

Circuits for transposing matrices and for transposing input vectors

技术领域Technical Field

本申请涉及矩阵向量处理器中的转置。The present application relates to transposition in matrix-vector processors.

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2017年2月16日提交的美国临时申请号62/459,943的权利，其内容通过引用并入本文。This application claims the benefit of U.S. Provisional Application No. 62/459,943, filed February 16, 2017, the contents of which are incorporated herein by reference.

背景技术Background Art

本说明书涉及硬件中的计算矩阵转置。This specification deals with computing matrix transpose in hardware.

矩阵转置是矩阵通过其主对角线被反射的计算，其从左上(0,0)位置运行到右下(n,n)位置，其中，n是矩阵的维度中较小的。效果在于输入矩阵的行被输出为转置矩阵的列。即，对于输入矩阵A的第i行和第j列而言，[A^T]_ij＝[A]_ji。A matrix transpose is a calculation in which a matrix is reflected across its main diagonal, running from the upper left (0,0) position to the lower right (n,n) position, where n is the smaller of the matrix's dimensions. The effect is that the rows of the input matrix are output as the columns of the transposed matrix. That is, for the i-th row and j-th column of the input matrix A, [ ^AT ] _ij = [A] _ji .

发明内容Summary of the Invention

一般而言，本说明书描述计算矩阵转置的专用硬件电路。In general, this specification describes dedicated hardware circuits that compute matrix transposes.

一般而言，本说明书中所描述的主题的一个创新方面能够被实现在用于转置矩阵的电路中，该电路包括反转电路，其被配置成针对矩阵中的一个或多个对角线中的每个对角线将该矩阵的该对角线的各元素接收在第一向量中，并且针对该矩阵的一个或多个对角线中的每个对角线生成第二向量，其包括与第一向量中的该矩阵的该对角线的各元素的次序为相反次序的该矩阵的该对角线的各元素。该电路包括轮换电路，其被配置成针对该矩阵中的一个或多个对角线中的每个对角线确定用来轮换第二向量中的该矩阵的该对角线的各元素的位置数目；针对矩阵的一个或多个对角线中的每个对角线接收该矩阵的该对角线的各元素的第二向量；并且针对该矩阵的一个或多个对角线中的每个对角线生成第三向量，其包括为将第二向量中的该矩阵的该对角线的各元素以所确定的位置数目来轮换而形成的次序的该矩阵的该对角线的各元素。Generally speaking, one innovative aspect of the subject matter described herein can be implemented in a circuit for transposing a matrix, the circuit comprising an inversion circuit configured to, for each of one or more diagonals of a matrix, receive the elements of the diagonal of the matrix in a first vector, and, for each of the one or more diagonals of the matrix, generate a second vector comprising the elements of the diagonal of the matrix in an order reversed from the order of the elements of the diagonal of the matrix in the first vector. The circuit comprises a rotation circuit configured to, for each of the one or more diagonals of the matrix, determine a number of positions by which to rotate the elements of the diagonal of the matrix in the second vector; receive the second vector for each of the one or more diagonals of the matrix; and generate a third vector for each of the one or more diagonals of the matrix, comprising the elements of the diagonal of the matrix in an order formed to rotate the elements of the diagonal of the matrix in the second vector by the determined number of positions.

实施方式可以包括以下特征中的一个或多个。该电路包括计数电路，其被配置成向轮换电路并且针对该矩阵的一个或多个对角线中的每一个对角线输出用来轮换第二向量中的该矩阵的该对角线的各元素的位置数目；计数电路被配置成针对该矩阵的一个或多个对角线中的每个对角线输出一个值作为用来轮换第二向量中的矩阵的对角线的各元素的位置数目，其中，由计数电路所输出的初始值等于N-1，其中，N等于轮换电路的宽度；计数电路被配置成使针对该矩阵的一个或多个对角线中的每个对角线由计数电路所输出的值递减，并且在由计数电路所输出的值针对该矩阵的一个或多个对角线中的一个对角线是零之后，将该值重置到初始值。Implementations may include one or more of the following features. The circuit includes a counting circuit configured to output, to the rotation circuit and for each of the one or more diagonals of the matrix, the number of positions by which the elements of the diagonal of the matrix in the second vector are rotated; the counting circuit is configured to output, for each of the one or more diagonals of the matrix, a value as the number of positions by which the elements of the diagonal of the matrix in the second vector are rotated, wherein an initial value output by the counting circuit is equal to N-1, wherein N is equal to a width of the rotation circuit; the counting circuit is configured to decrement the value output by the counting circuit for each of the one or more diagonals of the matrix and reset the value to the initial value after the value output by the counting circuit is zero for one of the one or more diagonals of the matrix.

实施方式可以各自可选地包括以下特征中的一个或多个。该矩阵是第二矩阵的子矩阵；该电路包括交错存储器读取电路，其被配置成针对该矩阵的一个或多个对角线中的每个对角线访问该矩阵的该对角线的各元素，并且向反转电路输出该矩阵的该对角线的各元素作为第一向量；交错存储器读取电路包括M个复用器，其中，M等于反转电路的宽度，并且其中，每个复用器被配置成输出该矩阵的列的多个元素中的一个元素；交错存储器读取电路被配置成接收控制信号，该控制信号针对该M个复用器中的每一个复用器指定用于提供为该复用器的输出的该复用器的输入；该M个复用器中的每个复用器是N对1复用器，其中，N是能够由轮换电路接收的元素的数目；交错存储器读取电路被配置成接收第一控制信号，该第一控制信号针对该M个复用器中的第一一个或多个复用器指定用于提供为该复用器的输出的该复用器的输入，并且接收第二控制信号，该第二控制信号针对该M个复用器中的第二一个或多个复用器指定用于提供为该复用器的输出的该复用器的输入。Embodiments may each optionally include one or more of the following features. The matrix is a submatrix of a second matrix; the circuit includes an interleaved memory read circuit configured to access each element of the diagonal of the matrix for each of one or more diagonals of the matrix and output each element of the diagonal of the matrix to the inversion circuit as a first vector; the interleaved memory read circuit includes M multiplexers, where M is equal to the width of the inversion circuit, and where each multiplexer is configured to output one element of a plurality of elements of a column of the matrix; the interleaved memory read circuit is configured to receive a control signal, which is a control signal for each of the M multiplexers. The interleaved memory read circuit is configured to receive a first control signal that specifies, for a first one or more of the M multiplexers, an input of the multiplexer to be provided as an output of the multiplexer, and to receive a second control signal that specifies, for a second one or more of the M multiplexers, an input of the multiplexer to be provided as an output of the multiplexer.

实施方式可以各自可选地包括以下特征中的一个或多个。该电路包括交错存储器写入电路，其被配置成针对该矩阵的一个或多个对角线中的每个对角线将第三向量中的该矩阵的该对角线的各元素写入到存储器作为转置输出矩阵的对角线；该矩阵包括被作为单个矩阵存储在存储器中的两个或两个以上的矩阵；轮换电路被配置成执行以所确定的位置数目对第二向量中的该矩阵的该对角线的各元素的右轮换来生成第三向量；该矩阵被存储在电路可访问的静态随机存取存储器处；针对该矩阵的一个或多个对角线中的每个对角线，第三向量中的该矩阵的该对角线的各元素被存储在静态随机存取存储器中作为转置输出矩阵的对角线。Implementations may each optionally include one or more of the following features. The circuit includes an interleaved memory write circuit configured to write, for each of one or more diagonals of the matrix, the elements of the diagonal of the matrix in the third vector to the memory as diagonals of a transposed output matrix; the matrix includes two or more matrices stored in the memory as a single matrix; the rotation circuit is configured to perform a right rotation of the elements of the diagonal of the matrix in the second vector by a determined number of positions to generate a third vector; the matrix is stored in a static random access memory accessible to the circuit; for each of the one or more diagonals of the matrix, the elements of the diagonal of the matrix in the third vector are stored in the static random access memory as diagonals of the transposed output matrix.

实施方式可以各自可选地包括以下特征中的一个或多个。该电路包括第二轮换电路，其被配置成针对第二矩阵的一个或多个对角线中的每个对角线确定用来轮换第二矩阵的该对角线的各元素的位置数目，针对第二矩阵的一个或多个对角线中的每个对角线接收第四向量，该第四向量包括第二矩阵的该对角线的各元素，并且针对第二矩阵的一个或多个对角线中的每个对角线生成第五向量，该第五向量包括为将第四向量中的第二矩阵的该对角线的各元素以所确定的位置数目来轮换而形成的次序的第四向量中的第二矩阵的该对角线的各元素；该电路包括第二计数电路，其被配置成向第二轮换电路并且针对第二矩阵的一个或多个对角线中的每个对角线输出用来轮换第四向量中的第二矩阵的该对角线的各元素的位置数目。Embodiments may each optionally include one or more of the following features. The circuit includes a second rotation circuit configured to determine, for each of one or more diagonals of a second matrix, a number of positions by which to rotate the elements of the diagonal of the second matrix, receive a fourth vector for each of the one or more diagonals of the second matrix, the fourth vector including the elements of the diagonal of the second matrix, and generate a fifth vector for each of the one or more diagonals of the second matrix, the fifth vector including the elements of the diagonal of the second matrix in the fourth vector in an order formed to rotate the elements of the diagonal of the second matrix in the fourth vector by the determined number of positions; the circuit includes a second counting circuit configured to output, to the second rotation circuit and for each of the one or more diagonals of the second matrix, the number of positions by which to rotate the elements of the diagonal of the second matrix in the fourth vector.

本说明书中所描述的主题的另一创新方面能够被实现在用于转置输入向量的电路中，该电路包括反转电路，其被配置成针对输入向量的一个或多个元素中的每个元素接收第一向量，该第一向量包括该输入向量的该元素，并且针对该输入向量的一个或多个元素中的每个元素，生成第二向量，该第二向量包括与第一向量中的各元素的次序为相反次序的第一向量的各元素。该电路包括轮换电路，其被配置成针对该输入向量的一个或多个元素中的每个元素确定用来轮换第二向量中的各元素的位置数目；针对该输入向量的一个或多个元素中的每个元素接收元素的第二向量，并且针对该输入向量的一个或多个元素中的每个元素生成第三向量，该第三向量包括为将第二向量中的各元素的次序以所确定的位置数目来轮换而形成的次序的第二向量的各元素。Another innovative aspect of the subject matter described in this specification can be implemented in a circuit for transposing an input vector, the circuit comprising an inversion circuit configured to receive, for each of one or more elements of the input vector, a first vector comprising the element of the input vector, and, for each of the one or more elements of the input vector, generate a second vector comprising the elements of the first vector in an order reversed from the order of the elements of the first vector. The circuit comprises a rotation circuit configured to determine, for each of the one or more elements of the input vector, a number of positions by which to rotate the elements of the second vector; receive, for each of the one or more elements of the input vector, a second vector of elements, and, for each of the one or more elements of the input vector, generate a third vector comprising the elements of the second vector in an order that rotates the order of the elements of the second vector by the determined number of positions.

在本申请中所描述的主题的特定实施例能够被实现以便实现以下优点中的一个或多个。对应于输入矩阵的转置的转置输出矩阵能够由专用硬件电路以硬件生成。通过使用专用硬件电路生成适当的输出，在不将数据传送回到主机计算机的情况下(即，在不执行芯片外或软件中的计算的至少一部分的情况下)，能够执行矩阵转置计算。因此，避免处理由执行芯片外或软件中的转置计算而造成的延迟，其中，计算可以是要求大量的通用处理器(例如，GPU或CPU)周期来执行的昂贵的计算。The specific embodiment of the subject matter described in this application can be implemented to realize one or more of the following advantages. The transposed output matrix corresponding to the transposition of the input matrix can be generated in hardware by a dedicated hardware circuit. By using a dedicated hardware circuit to generate appropriate output, the matrix transpose calculation can be performed without transmitting data back to the host computer (that is, without performing at least a portion of the calculation outside the chip or in the software). Therefore, the delay caused by the transposed calculation outside the chip or in the software is avoided, where the calculation can be an expensive calculation that requires a large amount of general-purpose processors (e.g., GPU or CPU) cycles to perform.

与在通用矩阵处理硬件电路(例如，还被配置成执行矩阵卷积或其他操作的一个)中执行矩阵转置计算的系统相比，使用特别地被设计为执行矩阵转置计算的硬件电路还允许更高效的处理。在专用硬件电路上实现矩阵转置计算允许在不关注其它矩阵运算能力或效率的情况下高效地处理矩阵转置计算并且保留用于执行那些其它矩阵运算的其它矩阵处理硬件电路的设计，从而一般增加硬件中的矩阵计算的效率。The use of a hardware circuit specifically designed to perform the matrix transpose calculation also allows for more efficient processing compared to systems that perform the matrix transpose calculation in a general-purpose matrix processing hardware circuit (e.g., one that is also configured to perform matrix convolution or other operations). Implementing the matrix transpose calculation on a dedicated hardware circuit allows the matrix transpose calculation to be processed efficiently without concern for the power or efficiency of other matrix operations and preserves the design of other matrix processing hardware circuits used to perform those other matrix operations, thereby generally increasing the efficiency of matrix calculations in hardware.

在附图和以下描述中阐述本说明书的主题的一个或多个实施例的细节。主题的其它特征、方面和优点将从描述、附图和权利要求变得显而易见。The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1示出示例矩阵向量处理系统。FIG1 illustrates an example matrix-vector processing system.

图2示出包括转置单元的示例矩阵向量处理系统。FIG2 illustrates an example matrix-vector processing system including a transpose unit.

图3示出矩阵向量处理系统中的转置单元的示例架构。FIG3 illustrates an example architecture of a transpose unit in a matrix-vector processing system.

图4示出矩阵向量处理系统中的交错存储器写入单元的示例架构。FIG4 shows an example architecture of an interleaved memory write unit in a matrix-vector processing system.

图5是使用矩阵向量处理系统转置矩阵的示例方法的流程图。5 is a flow chart of an example method for transposing a matrix using a matrix vector processing system.

图6A-6C示出在矩阵向量处理器中转置矩阵的示例。6A-6C illustrate examples of transposing a matrix in a matrix-vector processor.

各附图中的相同附图标记和名称指示相同元件。Like reference numbers and designations in the various drawings refer to like elements.

具体实施方式DETAILED DESCRIPTION

矩阵转置计算产生其中输入矩阵的行被重写为输出矩阵的列的输出矩阵，即，对于输入矩阵A的第i行和第j列元素而言，[A^T]_ij＝[A]_ji。因此，转置输入矩阵通过其主对角线有效地反射输入矩阵，其从矩阵的(0,0)位置运行到矩阵的(n-1,n-1)位置，其中，n是矩阵的维度中较小的。The matrix transpose calculation produces an output matrix in which the rows of the input matrix are rewritten as columns of the output matrix, that is, [ ^AT ] _ij = [A] _ji for the i-th row and j-th column element of the input matrix A. Thus, transposing the input matrix effectively reflects the input matrix through its main diagonal, which runs from the (0,0) position of the matrix to the (n-1,n-1) position of the matrix, where n is the smaller of the dimensions of the matrix.

用于这些和其它矩阵转置计算的实际应用是多种多样的。例如，当训练神经网络时，可以计算矩阵转置。在这样的应用中，为了当训练神经网络时反向传播梯度，可以计算在实现神经网络的层中使用的权重矩阵的转置。在其他实例中，可以对由神经网络计算的推断执行矩阵转置，或者可以对神经网络的特定层的矩阵或向量输出执行矩阵转置。The practical applications for these and other matrix transpose calculations are varied. For example, the matrix transpose can be calculated when training a neural network. In such an application, the transpose of the weight matrices used in the layers that implement the neural network can be calculated in order to backpropagate gradients when training the neural network. In other examples, the matrix transpose can be performed on inferences calculated by the neural network, or on the matrix or vector outputs of a particular layer of the neural network.

矩阵转置计算频繁地被使用在线性代数的应用中。例如，矩阵转置计算被用于计算两个输入矩阵A和B的点积，使得A^TB＝A·B。点积可以例如被用于计算矩阵的角度和大小，因为A·B＝||A||||B||cosθ。点积还可以被使用在计算向量的线性函数中，其中，能够通过计算向量A与表示线性函数的向量的集合之间的点积来执行将向量A当作参量的线性函数。The matrix transpose calculation is frequently used in linear algebra applications. For example, the matrix transpose calculation is used to calculate the dot product of two input matrices A and B, such that A ^T B = A·B. The dot product can be used, for example, to calculate the angle and magnitude of matrices, since A·B = ||A||||B||cosθ. The dot product can also be used to calculate linear functions on vectors, where a linear function can be performed on a vector A by calculating the dot product between the vector A and a set of vectors representing the linear function.

还可以在图像处理应用中执行矩阵转置计算，诸如以执行图像翻转或轮换操作。被表示为矩阵的数字图像可以使用转置计算来操纵以生成数字图像的轮换或镜像图像。在信号处理和其他领域中，使用矩阵转置实现快速傅里叶变换(FFT)算法(例如，当执行多维度并行FFT算法时)。社交网络或其它网络分析还可以利用矩阵转置计算来确定网络中的节点之间的关系源，或者确定网络中的节点之间的关系模式。统计编程、地理信息系统和其它应用也频繁地利用矩阵转置计算。Matrix transposition calculations can also be performed in image processing applications, such as to perform image flip or rotation operations. A digital image represented as a matrix can be manipulated using transposition calculations to generate a rotation or mirror image of the digital image. In signal processing and other fields, matrix transposition is used to implement a fast Fourier transform (FFT) algorithm (e.g., when performing a multi-dimensional parallel FFT algorithm). Social networks or other network analyses can also utilize matrix transposition calculations to determine the relationship source between nodes in the network, or to determine the relationship pattern between nodes in the network. Statistical programming, geographic information systems, and other applications also frequently utilize matrix transposition calculations.

本说明书描述处理输入矩阵或向量来生成转置输出矩阵(即，输入矩阵或向量的转置)的专用硬件电路。This specification describes dedicated hardware circuitry that processes an input matrix or vector to generate a transposed output matrix (ie, the transpose of the input matrix or vector).

图1示出示例矩阵向量处理系统100。矩阵向量处理系统100是被实现为在其中能够实现下面所描述的系统、部件和技术的一个或多个位置中的一个或多个计算机的系统的示例。1 shows an example matrix-vector processing system 100. The matrix-vector processing system 100 is an example of a system implemented as one or more computers in one or more locations in which the systems, components, and techniques described below can be implemented.

矩阵向量处理系统100是使用专用硬件电路110执行矩阵或向量计算的系统。专用硬件电路110是用于执行矩阵或向量计算的集成电路，包括被配置成在硬件中计算矩阵转置的转置单元120。参考图2更详细地描述示例专用硬件电路110。The matrix-vector processing system 100 is a system that performs matrix or vector calculations using dedicated hardware circuitry 110. The dedicated hardware circuitry 110 is an integrated circuit for performing matrix or vector calculations and includes a transpose unit 120 configured to calculate a matrix transpose in hardware. An example dedicated hardware circuitry 110 is described in more detail with reference to FIG2 .

矩阵向量处理系统100接收在专用硬件电路110上执行矩阵或向量计算的请求，控制专用硬件电路110以执行矩阵或向量计算，并且输出由专用硬件电路110所生成的矩阵或向量计算的结果。例如，矩阵向量处理系统100可以接收计算输入矩阵的转置的请求，在专用硬件电路110上实现矩阵转置计算，并且响应于请求而输出生成的转置矩阵。除矩阵转置之外，专用硬件电路110可能能够执行附加计算。The matrix-vector processing system 100 receives a request to perform a matrix or vector calculation on the dedicated hardware circuit 110, controls the dedicated hardware circuit 110 to perform the matrix or vector calculation, and outputs the result of the matrix or vector calculation generated by the dedicated hardware circuit 110. For example, the matrix-vector processing system 100 may receive a request to calculate the transpose of an input matrix, implement the matrix transpose calculation on the dedicated hardware circuit 110, and output the generated transposed matrix in response to the request. In addition to matrix transpose, the dedicated hardware circuit 110 may be capable of performing additional calculations.

为了在专用硬件电路110上执行矩阵或向量计算，矩阵向量处理系统100包括矩阵向量处理引擎150。矩阵向量处理引擎150可以被实现为一个或多个物理位置中的一个或多个计算机上的一个或多个计算机程序。To perform matrix or vector calculations on the dedicated hardware circuitry 110, the matrix-vector processing system 100 includes a matrix-vector processing engine 150. The matrix-vector processing engine 150 may be implemented as one or more computer programs on one or more computers in one or more physical locations.

矩阵向量处理引擎150能够生成指令、提供控制信号或引导数据以控制专用硬件电路110来响应于请求而执行矩阵或向量计算。例如，矩阵向量处理系统100可以接收执行矩阵或向量函数的请求，并且矩阵处理引擎150能够确定用于计算函数的特定指令或控制信号，或者能够确定如何引导数据(例如，对应于输入矩阵或向量)用于计算。The matrix-vector processing engine 150 can generate instructions, provide control signals, or direct data to control the dedicated hardware circuit 110 to perform matrix or vector calculations in response to a request. For example, the matrix-vector processing system 100 can receive a request to perform a matrix or vector function, and the matrix processing engine 150 can determine the specific instructions or control signals for calculating the function, or can determine how to direct data (e.g., corresponding to an input matrix or vector) for calculation.

一旦矩阵向量处理引擎150确定如何实现对应于矩阵或向量计算请求的计算，矩阵向量处理引擎150就控制专用硬件电路110以执行计算。例如，矩阵向量处理引擎150可以将用于执行矩阵或向量计算(诸如输入矩阵或向量)的数据引导到专用硬件电路110。矩阵向量处理引擎150还可以将指令或控制信号传送到专用硬件电路110以控制专用硬件电路110以对从矩阵向量处理引擎150由其接收到的数据执行适当的计算。Once the matrix-vector processing engine 150 determines how to perform the calculation corresponding to the matrix or vector calculation request, the matrix-vector processing engine 150 controls the dedicated hardware circuit 110 to perform the calculation. For example, the matrix-vector processing engine 150 can direct data used to perform a matrix or vector calculation (such as an input matrix or vector) to the dedicated hardware circuit 110. The matrix-vector processing engine 150 can also transmit instructions or control signals to the dedicated hardware circuit 110 to control the dedicated hardware circuit 110 to perform appropriate calculations on the data received from the matrix-vector processing engine 150.

例如，矩阵向量处理系统100能够接收计算矩阵或向量函数的请求。所请求的函数可以是相对简单的(例如，计算点积)，或所请求的函数可以是更复杂的函数(例如，用于反向传播梯度以训练神经网络或用于执行涉及计算矩阵的转置的多维并行FFT的函数)。请求还可以标识或包括用于计算函数的一个或多个矩阵或向量(即，在其上应用函数的一个或多个参量)。矩阵向量处理引擎150能够接收请求并且能够生成用于计算输入矩阵或向量的函数的控制信号或指令。而且，矩阵向量处理引擎可以将输入矩阵或向量引导到专用硬件电路110。For example, the matrix-vector processing system 100 can receive a request to calculate a matrix or vector function. The requested function can be relatively simple (e.g., calculating a dot product), or the requested function can be a more complex function (e.g., a function for backpropagating gradients to train a neural network or for performing a multi-dimensional parallel FFT involving the transposition of a calculation matrix). The request can also identify or include one or more matrices or vectors for calculating the function (i.e., one or more parameters on which the function is applied). The matrix-vector processing engine 150 can receive the request and can generate a control signal or instruction for calculating the function of the input matrix or vector. Moreover, the matrix-vector processing engine can guide the input matrix or vector to the dedicated hardware circuit 110.

例如，为了计算矩阵转置，矩阵向量处理引擎150可以向专用硬件电路110提供对其执行转置的输入矩阵或向量，或者被生成为先前计算的输出的矩阵或向量，使得输入矩阵或向量被提供到转置单元120。矩阵向量处理引擎150还可以向专用硬件电路110提供用于在转置单元120上发起转置计算的控制信号。转置单元120可以接收输入矩阵或向量和用于发起转置计算的控制信号。转置单元120可以响应于接收到控制信号而执行转置计算，并且能够输出作为接收到的矩阵或向量的转置的矩阵或向量。由转置单元120所输出的转置矩阵可以由专用硬件单元110用于其它计算中，以用于计算所请求的函数。专用硬件电路110可以提供矩阵向量处理系统100能够响应于请求而返回的所请求的函数的输出。For example, to calculate a matrix transpose, the matrix-vector processing engine 150 may provide an input matrix or vector on which a transpose is to be performed, or a matrix or vector generated as an output of a previous calculation, to the dedicated hardware circuit 110, so that the input matrix or vector is provided to the transpose unit 120. The matrix-vector processing engine 150 may also provide a control signal to the dedicated hardware circuit 110 for initiating a transpose calculation on the transpose unit 120. The transpose unit 120 may receive an input matrix or vector and a control signal for initiating a transpose calculation. The transpose unit 120 may perform a transpose calculation in response to receiving the control signal and may output a matrix or vector that is the transpose of the received matrix or vector. The transposed matrix output by the transpose unit 120 may be used by the dedicated hardware unit 110 in other calculations to calculate the requested function. The dedicated hardware circuit 110 may provide the output of the requested function, which the matrix-vector processing system 100 may return in response to the request.

图2示出用于计算矩阵转置的示例专用硬件电路200。在一些实施方式中，电路200可以包括用于执行其它矩阵或向量计算的附加部件(未示出)。用于执行其它矩阵或其它计算的附加部件还可以利用图2中所示的部件中的一个或多个。FIG2 illustrates an example dedicated hardware circuit 200 for calculating a matrix transpose. In some embodiments, circuit 200 may include additional components (not shown) for performing other matrix or vector calculations. Additional components for performing other matrices or other calculations may also utilize one or more of the components shown in FIG2 .

电路200包括主机接口202。主机接口202能够接收用于转置计算的控制信号、指令或参量。参量能够包括例如对其执行转置计算的矩阵或向量。由主机接口202接收到的指令能够包括指示何处存储接收到的参量使得电路200可以计算矩阵转置的指令。由主机接口接收到的控制信号可以是用于发起转置计算的信号。Circuit 200 includes a host interface 202. Host interface 202 can receive control signals, instructions, or parameters for a transpose calculation. The parameters can include, for example, matrices or vectors on which the transpose calculation is to be performed. Instructions received by host interface 202 can include instructions indicating where to store the received parameters so that circuit 200 can calculate the matrix transpose. The control signal received by the host interface can be a signal for initiating the transpose calculation.

在一些实施方式中，主机接口202可以向定序器206提供指令，定序器206将指令转换为控制电路200以执行转置计算的低电平控制信号。例如，由定序器206所生成的控制信号可以调节电路200中的数据流(例如，其中，输入矩阵或向量应该被存储或该数据应该如何以其它方式被引导通过电路200)。定序器206可以接收在电路200上发起转置计算的指令，并且可以生成用于控制转置单元212以发起转置计算的控制信号。In some embodiments, host interface 202 can provide instructions to sequencer 206, which converts the instructions into low-level control signals that control circuit 200 to perform a transpose calculation. For example, the control signals generated by sequencer 206 can regulate the flow of data in circuit 200 (e.g., where an input matrix or vector should be stored or how the data should be directed through circuit 200 in other ways). Sequencer 206 can receive an instruction to initiate a transpose calculation on circuit 200 and can generate a control signal for controlling transpose unit 212 to initiate the transpose calculation.

定序器206可以将控制信号发送到存储器208和转置单元212。在一些实施方式中，顺序器206还向直接存储器存取引擎204发送控制信号。在一些实施方式中，顺序器206是生成控制信号的处理器。顺序器206能够使用控制信号的定时以在适当的时间将控制信号发送到电路200的合适部件。在一些实例中，定序器206可以从主机接口202接收控制信号，该控制信号是从电路200外部(例如，从图1的向量矩阵处理引擎150)传递的，使得不要求定序器206生成控制信号。在这样的实例中，定序器206可以在适当的时间将接收到的控制信号发送到电路200的部件。而且，在电路200被提供控制信号的情况下，定序器206可以是电路200的可选部件，即，使得电路200外部的部件(例如，矩阵向量处理引擎150)可以在适当的时间提供控制信号以控制电路200来执行矩阵转置操作。The sequencer 206 can send control signals to the memory 208 and the transpose unit 212. In some embodiments, the sequencer 206 also sends control signals to the direct memory access engine 204. In some embodiments, the sequencer 206 is a processor that generates control signals. The sequencer 206 can use the timing of the control signals to send the control signals to the appropriate components of the circuit 200 at the appropriate time. In some examples, the sequencer 206 can receive control signals from the host interface 202, which are passed from outside the circuit 200 (e.g., from the vector-matrix processing engine 150 of FIG. 1), so that the sequencer 206 is not required to generate control signals. In such examples, the sequencer 206 can send the received control signals to the components of the circuit 200 at the appropriate time. Moreover, in the case where the circuit 200 is provided with control signals, the sequencer 206 can be an optional component of the circuit 200, that is, so that the components outside the circuit 200 (e.g., the matrix-matrix processing engine 150) can provide control signals at the appropriate time to control the circuit 200 to perform the matrix transpose operation.

主机接口202可以将参量(例如，输入矩阵或向量)发送到直接存储器存取引擎204。直接存储器存取引擎204可以将参量存储在存储器208处。The host interface 202 may send parameters (eg, input matrices or vectors) to the direct memory access engine 204. The direct memory access engine 204 may store the parameters at the memory 208.

存储器208可以是存储器缓冲器(例如，统一缓冲器)，或者可以是动态存储器(例如，静态随机存取存储器(SRAM))。存储器208可以位于电路200上或与电路200分离。其能够被用于存储输入到电路200的参量(诸如矩阵或向量)。存储器208还可以存储转置单元212的输出(即，转置输出矩阵或向量)。在一些实施方式中，直接存储器存取引擎204可以从存储器208读取。例如，直接存储器存取引擎204可以从存储器208读取以从电路200返回执行矩阵转置的结果。The memory 208 may be a memory buffer (e.g., a unified buffer) or may be a dynamic memory (e.g., static random access memory (SRAM)). The memory 208 may be located on the circuit 200 or separate from the circuit 200. It can be used to store parameters (such as matrices or vectors) input to the circuit 200. The memory 208 may also store the output of the transpose unit 212 (i.e., the transposed output matrix or vector). In some embodiments, the direct memory access engine 204 may read from the memory 208. For example, the direct memory access engine 204 may read from the memory 208 to return the result of performing a matrix transpose from the circuit 200.

存储器208可以将参量发送到转置单元212用于转置。例如，在直接存储器存取引擎204将输入矩阵或向量存储在存储器208中之后，可以提供输入矩阵或向量或使转置单元212可访问，使得转置单元212可以计算输入矩阵或向量的转置。The memory 208 may send the parameters to the transpose unit 212 for transposition. For example, after the direct memory access engine 204 stores the input matrix or vector in the memory 208, the input matrix or vector may be provided or made accessible to the transpose unit 212 so that the transpose unit 212 can calculate the transpose of the input matrix or vector.

转置单元212是用于计算矩阵或向量转置的电路。在一些实施方式中，对转置单元212进行设计，使得转置单元可以被触发以基于接收到参量和用于发起转置计算的控制信号而计算矩阵转置。即，转置单元212可以被配置成仅要求单个控制信号以在参量上执行整个转置过程并且生成参量的转置(即，转置输出矩阵或向量)。The transposition unit 212 is a circuit for calculating the transpose of a matrix or vector. In some embodiments, the transposition unit 212 is designed so that the transposition unit can be triggered to calculate the matrix transpose based on receiving a parameter and a control signal for initiating the transposition calculation. That is, the transposition unit 212 can be configured to require only a single control signal to perform the entire transposition process on the parameter and generate the transpose of the parameter (i.e., the transposed output matrix or vector).

在这样的实施方式中，一旦转置计算被发起，转置单元212就可以以固定的方式执行整个转置计算，即，使得转置单元212将以相同的方式执行装置计算而不管被提供到转置单元212的参量如何。因此，转置单元212可以被配置成执行相同的计算而不管输入矩阵是否是64x64元素矩阵、128x128元素矩阵等。转置单元212将输出(即，转置输出矩阵或向量)存储在存储器208处。In such an embodiment, once the transpose calculation is initiated, the transpose unit 212 may perform the entire transpose calculation in a fixed manner, i.e., such that the transpose unit 212 will perform the device calculation in the same manner regardless of the parameters provided to the transpose unit 212. Thus, the transpose unit 212 may be configured to perform the same calculation regardless of whether the input matrix is a 64x64 element matrix, a 128x128 element matrix, etc. The transpose unit 212 stores the output (i.e., the transposed output matrix or vector) at the memory 208.

一般地，为了计算矩阵或向量转置，转置单元212执行被存储在存储器208中的参量的交错存储器读取。当参量是矩阵时，交错存储器读取使得转置单元212能够针对矩阵的每个对角线获得对应于寄存器中的矩阵的对角线的元素的向量。转置单元212反转被存储在寄存器中的矩阵的对角线的元素的次序以生成矩阵的对角线的元素的第二向量并且将其存储在例如相同的寄存器中或者第二寄存器中。第二向量的元素被移位确定的数目的位置以获得包括矩阵的对角线的元素的第三向量，其然后被存储在例如相同的寄存器中或者第三寄存器中。交错存储器写入被执行以将元素放置在第三向量中(例如，在第三寄存器中、在适当的存储器位置中)。过程针对矩阵的每个对角线被重复以获得被存储在存储器中作为矩阵的转置的转置输出矩阵。In general, to calculate the matrix or vector transpose, the transposition unit 212 performs an interleaved memory read of the parameters stored in the memory 208. When the parameters are matrices, the interleaved memory read enables the transposition unit 212 to obtain a vector of elements corresponding to the diagonal of the matrix in the register for each diagonal of the matrix. The transposition unit 212 reverses the order of the elements of the diagonal of the matrix stored in the register to generate a second vector of elements of the diagonal of the matrix and stores it in, for example, the same register or a second register. The elements of the second vector are shifted a certain number of positions to obtain a third vector including the elements of the diagonal of the matrix, which is then stored in, for example, the same register or a third register. An interleaved memory write is performed to place the elements in the third vector (e.g., in the third register, in an appropriate memory location). The process is repeated for each diagonal of the matrix to obtain a transposed output matrix stored in the memory as the transpose of the matrix.

如上文所讨论的，当参量是向量时，执行这些相同的操作。因此，当参量是向量时，交错存储器读取使得转置单元212能够获得用于过程的每个迭代的寄存器中的向量的单个元素。用于每个迭代的寄存器中的元素根据以上过程被操纵以获得向量的转置。在对向量执行转置计算的情况下，转置输出向量也将是向量，然而，输入列向量将被转换为行向量，并且行向量将被转换为列向量。As discussed above, these same operations are performed when the arguments are vectors. Thus, when the arguments are vectors, the interleaved memory reads enable the transpose unit 212 to obtain a single element of the vector in the register for each iteration of the process. The elements in the register for each iteration are manipulated according to the above process to obtain the transpose of the vector. In the case of performing a transpose calculation on a vector, the transposed output vector will also be a vector, however, the input column vector will be converted to a row vector, and the row vector will be converted to a column vector.

图3示出转置单元300的示例架构。在图示的示例中，交错存储器读取器310访问输入矩阵或向量，并且输出对应于输入矩阵的对角线的元素。交错存储器读取器能够处理从输入矩阵的(0,0)对角线开始的输入矩阵的每个对角线。输入矩阵的每个对角线是输入矩阵的从左下到右上延伸的元素的对角线(即，从输入矩阵的(n-1,0)元素到输入矩阵的(0,n-1)元素延伸的元素的对角线)。图4更详细地讨论了交错存储器读取器310的操作。FIG3 illustrates an example architecture of a transposition unit 300. In the illustrated example, an interleaved memory reader 310 accesses an input matrix or vector and outputs the elements corresponding to the diagonals of the input matrix. The interleaved memory reader is capable of processing each diagonal of the input matrix starting from the (0,0) diagonal of the input matrix. Each diagonal of the input matrix is a diagonal of elements extending from the lower left to the upper right of the input matrix (i.e., a diagonal of elements extending from the (n-1,0) element of the input matrix to the (0,n-1) element of the input matrix). FIG4 discusses the operation of the interleaved memory reader 310 in more detail.

由交错存储器读取器310所输出的输入矩阵的对角线的元素由值加载器320接收，其中，每个值加载器320对应于不同的数据列(即，由交错存储器读取器310访问的输入矩阵)。在图3中所示的示例转置单元300中，转置单元架构300能够计算转置直到4x4，然而，相同技术可以被扩展用于任何大小的转置单元。因此，当转置4x4输入矩阵时，值加载器320中的每个对应于输入矩阵的列。如果使用4x4转置单元300转置小于4x4的矩阵，则能够丢弃或忽略提供给上部值加载器的值。例如，如果3x3输入矩阵由交错存储器读取器310读取，则能够忽略或丢弃输出到值加载器[3]的值，因为其不对应于输入矩阵的元素。The elements of the diagonal of the input matrix output by the interleaved memory reader 310 are received by the value loaders 320, where each value loader 320 corresponds to a different data column (i.e., the input matrix accessed by the interleaved memory reader 310). In the example transpose unit 300 shown in Figure 3, the transpose unit architecture 300 is capable of calculating transposes up to 4x4, however, the same technology can be extended for transpose units of any size. Therefore, when transposing a 4x4 input matrix, each of the value loaders 320 corresponds to a column of the input matrix. If a 4x4 transpose unit 300 is used to transpose a matrix smaller than 4x4, the value provided to the upper value loader can be discarded or ignored. For example, if a 3x3 input matrix is read by the interleaved memory reader 310, the value output to the value loader [3] can be ignored or discarded because it does not correspond to an element of the input matrix.

值加载器320将接收到的元素传送到输入寄存器330，其中，输入寄存器将元素存储为第一向量。对于示例转置单元300而言，输入寄存器可以是对应于转置单元300能够处理的最大大小的输入矩阵的维度(即，4x4)的元素的1x4寄存器。因此，由值加载器[0]接收到的元素可以被存储在输入寄存器330的(0,0)元素中，由值加载器[1]接收到的元素可以被存储在输入寄存器330的(0,1)中等等。在一些实施方式中，如果输入到输入寄存器300的矩阵小于转置单元300的最大输入矩阵大小，则值加载器320可以不将不对应于输入矩阵的元素的值发送到输入寄存器330。例如，如果3x3矩阵被输入到4x4转置单元300，则值加载器[3]可以不将值发送到输入寄存器330。The value loader 320 transfers the received elements to the input register 330, where the input register stores the elements as a first vector. For the example transpose unit 300, the input register can be a 1x4 register of elements corresponding to the dimensions of the maximum size input matrix that the transpose unit 300 can process (i.e., 4x4). Thus, the element received by value loader [0] can be stored in the (0,0) element of the input register 330, the element received by value loader [1] can be stored in the (0,1) element of the input register 330, and so on. In some embodiments, if the matrix input to the input register 300 is smaller than the maximum input matrix size of the transpose unit 300, the value loader 320 may not send values that do not correspond to elements of the input matrix to the input register 330. For example, if a 3x3 matrix is input to a 4x4 transpose unit 300, the value loader [3] may not send values to the input register 330.

反向器340接收被存储在输入寄存器330中的元素并且反转元素的次序以生成元素的第二向量。在一些实施方式中，反向器340接收被存储在输入寄存器330处的元素的第一向量，并且反转第一向量的元素的次序以生成第二向量。例如，输入寄存器330的元素可以被发送到反向器340，并且反向器340可以以元素被存储在输入寄存器330中的相反次序将元素写入到另一寄存器。The inverter 340 receives the elements stored in the input register 330 and reverses the order of the elements to generate a second vector of elements. In some embodiments, the inverter 340 receives a first vector of elements stored at the input register 330 and reverses the order of the elements of the first vector to generate a second vector. For example, the elements of the input register 330 may be sent to the inverter 340, and the inverter 340 may write the elements to another register in the reverse order in which they were stored in the input register 330.

对于所图示的转置单元300而言，反转元素的次序可以包括将输入寄存器330中的[0]位置中的元素存储在反向器340的寄存器的[3]位置中，将输入寄存器330的[1]位置中的元素存储在反向器340的寄存器的[2]位置中，将输入寄存器330的[2]位置中的元素存储在反向器340的寄存器的[1]位置中，并且将输入寄存器330的[3]位置中的元素存储在反向器340的寄存器的[0]位置中。在一些实施方式中，反向器340可以通过具有如上文所指定的将输入寄存器330的相应位置和反向器340的寄存器连接的写入来线反转元素的次序，使得输入寄存器330中的元素的次序将被写入到反向器340的寄存器的适当的位置。由于从输入寄存器330接收到的元素对应于输入矩阵的对角线，因而反转输入矩阵的对角线的元素的次序有效地导致跨越输入矩阵的主对角线的那些元素的反射。For the illustrated transpose unit 300, reversing the order of the elements may include storing the element in the [0] position of the input register 330 in the [3] position of the register of the inverter 340, storing the element in the [1] position of the input register 330 in the [2] position of the register of the inverter 340, storing the element in the [2] position of the input register 330 in the [1] position of the register of the inverter 340, and storing the element in the [3] position of the input register 330 in the [0] position of the register of the inverter 340. In some embodiments, the inverter 340 may reverse the order of the elements by having a write connecting the corresponding positions of the input register 330 and the register of the inverter 340 as specified above, so that the order of the elements in the input register 330 will be written to the appropriate positions of the register of the inverter 340. Since the elements received from the input register 330 correspond to the diagonals of the input matrix, reversing the order of the elements of the diagonals of the input matrix effectively results in a reflection of those elements that span the main diagonal of the input matrix.

轮换器350接收被存储在反向器340的寄存器中的元素并且轮换元素的次序以生成元素的第三向量。在一些实施方式中，轮换器350接收被存储在反向器340的寄存器处的元素的第二向量，并且向右轮换(即，右比特式移位)元素以生成元素的第三向量。例如，被存储在反向器340的寄存器处的元素可以被发送到轮换器350，并且轮换器350可以以反映元素的轮换的次序将元素写入到另一寄存器。为了完成轮换，轮换器350可以具有能够通过使用组合逻辑(即，在不使用顺序逻辑的情况下)使反向器340的寄存器中的元素比特式移位指定数目的比特的桶形移位电路。The rotator 350 receives the elements stored in the register of the inverter 340 and rotates the order of the elements to generate the third vector of elements. In some embodiments, the rotator 350 receives the second vector of elements stored in the register of the inverter 340, and rotates (i.e., right bitwise shift) elements to generate the third vector of elements. For example, the elements stored in the register of the inverter 340 can be sent to the rotator 350, and the rotator 350 can write the elements to another register in the order reflecting the rotation of the elements. In order to complete the rotation, the rotator 350 can have a barrel shift circuit that can shift the elements in the register of the inverter 340 by a specified number of bits using combinational logic (i.e., without using sequential logic).

由轮换器350接收到的元素被轮换的位置的数目基于与轮换器350通信的计数器315而确定。计数器315响应于发起信号305而设定。例如，发起信号305可以是单个控制信号，其发起转置单元300的操作，包括设置计数器315。在一些实施方式中，发起信号305是由图2的定序器206所提供的控制信号，其中，控制信号可能已被提供到定序器206(例如，通过矩阵向量处理引擎150)或者可能已基于由主机接口202接收到的指令由顺序器206生成。The number of positions by which the elements received by the rotator 350 are rotated is determined based on a counter 315 that is in communication with the rotator 350. The counter 315 is set in response to the initiation signal 305. For example, the initiation signal 305 can be a single control signal that initiates the operation of the transposition unit 300, including setting the counter 315. In some embodiments, the initiation signal 305 is a control signal provided by the sequencer 206 of FIG. 2 , where the control signal may have been provided to the sequencer 206 (e.g., by the matrix-vector processing engine 150) or may have been generated by the sequencer 206 based on an instruction received by the host interface 202.

在其中轮换器350执行右轮换的实施方式中，发起信号305使得计数器315被设定为N-1的值，其中，N等于轮换器350能够接收的元素的数目(即，等于轮换器350的宽度)。对于图3的示例架构300而言，响应于发起信号305，计数器将因此被设定为3(即，4-1)。计数器315被配置成每次轮换器350接收输入矩阵的元素的不同的向量而减量(即，针对由轮换器350处理的输入矩阵的元素的每个对角线减量)。计数器315还被配成在轮换器350已对一组元素执行0位置轮换之后重置为N-1。备选地，轮换器350可以被配置成确定何时计数器315指定一组元素的0位置轮换，并且作为响应可以在不执行轮换操作的情况下通过轮换器350传递值。In an embodiment in which the rotator 350 performs a right rotation, the initiation signal 305 causes the counter 315 to be set to a value of N-1, where N is equal to the number of elements that the rotator 350 can receive (i.e., equal to the width of the rotator 350). For the example architecture 300 of Figure 3, in response to the initiation signal 305, the counter will therefore be set to 3 (i.e., 4-1). The counter 315 is configured to decrement each time the rotator 350 receives a different vector of elements of the input matrix (i.e., for each diagonal decrement of the elements of the input matrix processed by the rotator 350). The counter 315 is also configured to reset to N-1 after the rotator 350 has performed a 0 position rotation on a group of elements. Alternatively, the rotator 350 can be configured to determine when the counter 315 specifies a 0 position rotation of a group of elements, and in response, a value can be passed through the rotator 350 without performing a rotation operation.

因此，对于图3的转置单元300而言，计数器315将使得轮换器350使输入矩阵的元素的第一对角线轮换3个位置，使输入矩阵的元素的第二对角线轮换2个位置，使输入矩阵的元素的第三对角线轮换1个位置，使输入矩阵的元素的第四对角线轮换0个位置，并且然后从使输入矩阵的元素的第五对角线轮换3个位置开始，针对输入矩阵的元素的后续对角线重复该过程，。实际上，该轮换使从反向器340接收到的第二向量中的元素的位置移位，其表示将跨越主对角线的输入矩阵的对角线的元素反射到适当的位置以允许元素被写入作为转置输出矩阵的元素。Thus, for the transpose unit 300 of FIG3 , the counter 315 will cause the rotator 350 to rotate the first diagonal of the elements of the input matrix by 3 positions, the second diagonal of the elements of the input matrix by 2 positions, the third diagonal of the elements of the input matrix by 1 position, the fourth diagonal of the elements of the input matrix by 0 positions, and then repeat this process for subsequent diagonals of the elements of the input matrix, starting with rotating the fifth diagonal of the elements of the input matrix by 3 positions. In effect, this rotation shifts the positions of the elements in the second vector received from the inverter 340, which represents reflecting the elements of the diagonal of the input matrix that span the main diagonal to the appropriate positions to allow the elements to be written as elements of the transposed output matrix.

虽然上文被描述为执行右轮换，但是在一些实施方式中轮换器350执行左轮换。在这样的实施方式中，计数器可以响应于发起信号305初始地被设定为1，针对由轮换器350处理的每组元素增量直到轮换器350使一组元素轮换N-1个位置，并且然后在已执行使元素轮换N-1个位置之后被重置为0。Although described above as performing a right rotation, in some embodiments the rotator 350 performs a left rotation. In such embodiments, the counter may be initially set to 1 in response to the initiation signal 305, incremented for each group of elements processed by the rotator 350 until the rotator 350 rotates a group of elements N-1 positions, and then reset to 0 after rotating the elements N-1 positions has been performed.

被存储在轮换器350的寄存器处的元素可以由值输出360访问，其然后将元素提供到交错存储器写入器370用于写入到存储器中(例如，存储器208中)。例如，在将轮换的元素写入到轮换器350的寄存器中作为第三向量之后，值输出360中的每个可以访问轮换器350的寄存器的对应的元素。例如，值输出[0]360可以访问轮换器350的寄存器的[0]位置中的元素，值输出[1]360可以访问轮换器350的寄存器的[1]位置中的元素等等。The elements stored at the registers of rotator 350 can be accessed by value outputs 360, which then provide the elements to interleave memory writer 370 for writing to memory (e.g., memory 208). For example, after writing the rotated elements to the registers of rotator 350 as the third vector, each of value outputs 360 can access a corresponding element of the registers of rotator 350. For example, value output [0] 360 can access the element in the [0] position of the registers of rotator 350, value output [1] 360 can access the element in the [1] position of the registers of rotator 350, and so on.

交错存储器写入器370从值输出360接收元素并且将元素适当地写入存储器中，使得存储器存储输出矩阵(即，输入矩阵的转置)。例如，使用与针对交错存储器读取器310后续描述的那些技术类似的技术，交错存储器写入器370能够将元素存储在存储器208中，使得转置输出矩阵被适当地格式化。被存储在存储器208中的转置输出矩阵能够被返回作为由包括转置单元的专用硬件电路200计算的函数的结果，或者能够在专用硬件电路200内进一步处理来生成响应于请求可以由矩阵向量处理系统100返回的结果。The interleaved memory writer 370 receives elements from the value output 360 and writes the elements appropriately to the memory so that the memory stores the output matrix (i.e., the transposed output matrix of the input matrix). For example, using techniques similar to those described later for the interleaved memory reader 310, the interleaved memory writer 370 can store elements in the memory 208 so that the transposed output matrix is appropriately formatted. The transposed output matrix stored in the memory 208 can be returned as the result of a function calculated by the dedicated hardware circuit 200 including the transpose unit, or can be further processed within the dedicated hardware circuit 200 to generate a result that can be returned by the matrix-vector processing system 100 in response to a request.

在一些实施方式中，能够由输入寄存器330、反向器340和轮换器350接收的元素的数目可以是相同的(即，输入寄存器330、反向器340和轮换器350可以全部具有相同宽度)。在其它实施方式中，输入寄存器330、反向器340或轮换器350中的一个或多个可能能够接收不同数目的元素并且将那些元素作为向量存储在例如寄存器中。在一些实施方式中，值加载器320或值输出360可以是转置单元架构300的可选部件，例如，其中，交错存储器读取器310能够将数据直接地写入到输入寄存器330，或者其中，轮换器350能够将数据直接地发送到交错存储器写入器370。In some embodiments, the number of elements that can be received by the input register 330, the inverter 340, and the rotator 350 can be the same (i.e., the input register 330, the inverter 340, and the rotator 350 can all have the same width). In other embodiments, one or more of the input register 330, the inverter 340, or the rotator 350 can receive a different number of elements and store those elements as vectors in, for example, registers. In some embodiments, the value loader 320 or the value output 360 can be optional components of the transpose unit architecture 300, for example, where the interleave memory reader 310 can write data directly to the input register 330, or where the rotator 350 can send data directly to the interleave memory writer 370.

在一些实施方式中，转置单元300可以计算比转置单元300能够转置的最大维度矩阵更大的输入矩阵的转置。由于转置是递归计算，因而较大的矩阵的转置可以通过将矩阵分为一组较小的矩阵、单独地转置较小的矩阵并且平铺较小的转置矩阵以生成较大的矩阵的转置来获得。例如，4x4转置单元300可以通过将16x16矩阵分解为四个4x4矩阵、计算四个4x4矩阵中的每个的转置并且平铺四个4x4转置矩阵以获得16x16输入矩阵的转置来计算16x16输入矩阵的转置。In some embodiments, the transposition unit 300 can calculate the transpose of an input matrix that is larger than the maximum dimension matrix that the transposition unit 300 can transpose. Since the transposition is a recursive calculation, the transpose of a larger matrix can be obtained by dividing the matrix into a set of smaller matrices, transposing the smaller matrices individually, and tiling the smaller transposed matrices to generate the transpose of the larger matrix. For example, the 4x4 transposition unit 300 can calculate the transpose of a 16x16 input matrix by decomposing the 16x16 matrix into four 4x4 matrices, calculating the transpose of each of the four 4x4 matrices, and tiling the four 4x4 transposed matrices to obtain the transpose of the 16x16 input matrix.

在一些实施方式中，计算比转置单元300能够转置的最大维度矩阵更大的输入矩阵的转置需要通过转置单元300外部的部件处理输入矩阵。例如，图1的矩阵向量处理引擎150可以确定输入矩阵具有超过转置单元300能够处理的那些维度的维度，并且因此可以标识或生成能够被提供到转置单元300并且单独地由转置单元300处理的输入矩阵的子矩阵。矩阵向量处理引擎150可以接收子矩阵的转置并且平铺子矩阵的转置以获得输入矩阵的转置。在一些实施方式中，专用硬件电路110的转置单元300或其它部件可能能够分解输入矩阵和/或平铺硬件中的子矩阵的转置来生成输入矩阵的转置。例如，由专用硬件电路接收到的控制信号可以指定特定存储器位置(例如，在分割阵列中)以存储子矩阵的转置。In some embodiments, calculating the transpose of an input matrix that is larger than the maximum dimensional matrix that the transpose unit 300 can transpose requires processing the input matrix by a component external to the transpose unit 300. For example, the matrix-vector processing engine 150 of FIG. 1 can determine that the input matrix has dimensions that exceed those dimensions that the transpose unit 300 can process, and therefore can identify or generate a sub-matrix of the input matrix that can be provided to the transpose unit 300 and processed separately by the transpose unit 300. The matrix-vector processing engine 150 can receive the transpose of the sub-matrix and tile the transpose of the sub-matrix to obtain the transpose of the input matrix. In some embodiments, the transpose unit 300 or other components of the dedicated hardware circuit 110 may be capable of decomposing the input matrix and/or tiling the transpose of the sub-matrix in hardware to generate the transpose of the input matrix. For example, a control signal received by the dedicated hardware circuit can specify a specific memory location (e.g., in a partitioning array) to store the transpose of the sub-matrix.

图4示出交错存储器读取器400的示例架构。交错存储器读取器400访问输入矩阵的对角线的元素，并且将那些元素提供到转置单元的其它部件(例如，图3的转置单元300)以计算矩阵或向量转置。交错存储器读取器400能够访问其中输入矩阵或向量已被存储的存储器430(诸如存储器208)。例如，作为处理计算矩阵转置的请求或要求矩阵转置的函数的一部分，输入矩阵或向量可以被存储在存储器430处并且可以由交错存储器读取器400访问来计算输入矩阵的转置。FIG4 illustrates an example architecture of an interleaved memory reader 400. The interleaved memory reader 400 accesses the elements of the diagonal of an input matrix and provides those elements to other components of a transpose unit (e.g., transpose unit 300 of FIG3 ) to compute the matrix or vector transpose. The interleaved memory reader 400 can access a memory 430 (such as memory 208) where the input matrix or vector is stored. For example, as part of processing a request to compute a matrix transpose or a function requiring a matrix transpose, the input matrix or vector may be stored at memory 430 and may be accessed by the interleaved memory reader 400 to compute the transpose of the input matrix.

交错存储器读取器400包括复用器(Mux)430。在一些实施方式中，被包括在交错存储器读取器400中的复用器430的数目等于能够由图3的反向器340接收的元素的数目。在一些实施方式中，复用器的该数目还等于能够由轮换器350接收的元素的数目(即，当反向器340和轮换器350具有相同的宽度时)。在那些实例中，复用器的数目通常等于转置单元能够处理的最大维度矩阵。因此，图4中所示的示例交错存储器读取器400可以被用于能够转置矩阵直至4x4的大小的转置单元中。在其它示例中，交错存储器读取器400可以比反向器340或轮换器350具有更大数目的复用器430(即，具有更大的宽度)。The interleaved memory reader 400 includes multiplexers (Mux) 430. In some embodiments, the number of multiplexers 430 included in the interleaved memory reader 400 is equal to the number of elements that can be received by the inverter 340 of Figure 3. In some embodiments, the number of multiplexers is also equal to the number of elements that can be received by the rotator 350 (i.e., when the inverter 340 and the rotator 350 have the same width). In those instances, the number of multiplexers is generally equal to the maximum dimension matrix that the transposition unit can process. Therefore, the example interleaved memory reader 400 shown in Figure 4 can be used in a transposition unit that can transpose matrices up to a size of 4x4. In other examples, the interleaved memory reader 400 can have a greater number of multiplexers 430 (i.e., have a greater width) than the inverter 340 or the rotator 350.

复用器430中的每个可以是N对1复用器，其中，N等于能够由图3的轮换器350接收的元素的数目。例如，复用器430是要被用于能够对直至大小4x4的矩阵执行转置的转置单元中使得转置单元的轮换器350也将具有4的宽度的如图4中所示的4对1复用器。在输入矩阵具有能够由转置单元处理的最大大小的情况下，复用器430的相应输入将各自对应于输入矩阵的行，即，每个复用器430的第0个输入对应于输入矩阵的第0行，每个复用器430的第1个输入对应于输入矩阵的第1行等。附加地，复用器430中的每个对应于输入矩阵的列，直到复用器430对应于转置单元能够处理的输入矩阵的最大维度。即，在输入矩阵具有能够由转置单元处理的最大大小的情况下，复用器[0]将对应于输入矩阵的第0行，复用器[1]将对应于输入矩阵的第1行等。Each of the multiplexers 430 can be an N-to-1 multiplexer, where N is equal to the number of elements that can be received by the rotator 350 of FIG. 3 . For example, multiplexer 430 is to be used in a transpose unit capable of performing transposition on matrices up to a size of 4×4, such that the rotator 350 of the transpose unit will also have a width of 4, as shown in FIG. 4 . In the case where the input matrix has the maximum size that can be processed by the transpose unit, the respective inputs of multiplexers 430 will each correspond to a row of the input matrix, i.e., the 0th input of each multiplexer 430 corresponds to the 0th row of the input matrix, the 1st input of each multiplexer 430 corresponds to the 1st row of the input matrix, and so on. Additionally, each of multiplexers 430 corresponds to a column of the input matrix until the multiplexer 430 corresponds to the maximum dimension of the input matrix that the transpose unit can process. That is, in the case where the input matrix has the maximum size that can be processed by the transpose unit, multiplexer [0] will correspond to the 0th row of the input matrix, multiplexer [1] will correspond to the 1st row of the input matrix, and so on.

因此，复用器430启用对输入矩阵的每个元素的访问，直到能够由转置单元处理的最大维度矩阵。例如，复用器[2]的第0个输入提供对输入矩阵的(0,2)元素的访问，复用器[3]的第3个输入提供对输入矩阵的(3,3)元素的访问等。Thus, multiplexer 430 enables access to every element of the input matrix, up to the maximum dimension matrix that can be processed by the transpose unit. For example, the 0th input of multiplexer [2] provides access to the (0,2) element of the input matrix, the 3rd input of multiplexer [3] provides access to the (3,3) element of the input matrix, and so on.

为了使能交错存储器读取，交错存储器读取400包括增量器435，其将控制信号提供到复用器430中的每个。增量器435使以交错方式被传播到复用器430中的每个的控制信号增量。对于图4的示例架构400而言，增量器435初始地接收0的值并且提供控制信号作为复用器[0]的选择信号。在下一迭代中，0的值被增量到1，并且被提供为复用器[0]的选择信号。具有0的值的控制信号被传播到复用器[1]。控制信号继续以该方式传播，直到复用器[3]处的选择信号具有3的值(即，选择输入矩阵的(3,3)元素)。提供到每个复用器430的选择信号的模式因此有效地指定输入矩阵的对角线被读取用于由转置单元处理并且被给定在表450中的次序。To enable interleaved memory reading, interleaved memory reading 400 includes an incrementer 435 that provides a control signal to each of multiplexers 430. Incrementer 435 increments the control signal that is propagated to each of multiplexers 430 in an interleaved manner. For the example architecture 400 of FIG. 4 , incrementer 435 initially receives a value of 0 and provides a control signal as a select signal for multiplexer [0]. In the next iteration, the value of 0 is incremented to 1 and provided as the select signal for multiplexer [0]. The control signal with a value of 0 is propagated to multiplexer [1]. The control signal continues to propagate in this manner until the select signal at multiplexer [3] has a value of 3 (i.e., selecting the (3,3) element of the input matrix). The pattern of the select signals provided to each multiplexer 430 thus effectively specifies the order in which the diagonals of the input matrix are read for processing by the transpose unit and are given in table 450.

如在表450中所示，在周期0处，4x4输入矩阵的第一对角线(即，输入矩阵的(0,0)元素)由交错存储器读取器400读取并且被提供到值加载器420。在周期1处，对应于4x4输入矩阵的第二对角线的元素(1,0)和(0,1)被提供到值加载器420。在周期2处，输入矩阵的第三对角线的元素(2,0)、(1,1)和(0,2)被提供到值加载器420。该过程如根据表450所示继续直到4x4输入矩阵的全部元素已经以交错方式从存储器430读取并且被提供到值加载器420。值加载器420能够接收由复用器430输出的元素并且在每个周期处将那些元素提供到图3的输入寄存器330。如在表450中所示，对于许多周期而言，值加载器420中的一个或多个可以不接收对应于输入矩阵的元素的元素。对于这些未使用的值加载器420而言，可以忽略由它们从对应的复用器430接收或由它们输出到输入寄存器的数据。附加地或者备选地，值加载器420可以被配置成当其输入不对应于输入矩阵的元素时，放弃将数据输出到输入寄存器330。As shown in table 450, at cycle 0, the first diagonal of the 4x4 input matrix (i.e., the (0,0) element of the input matrix) is read by interleaved memory reader 400 and provided to value loader 420. At cycle 1, elements (1,0) and (0,1) corresponding to the second diagonal of the 4x4 input matrix are provided to value loader 420. At cycle 2, elements (2,0), (1,1), and (0,2) of the third diagonal of the input matrix are provided to value loader 420. This process continues as shown in table 450 until all elements of the 4x4 input matrix have been read from memory 430 in an interleaved manner and provided to value loader 420. Value loader 420 can receive elements output by multiplexer 430 and provide those elements to input register 330 of FIG. 3 at each cycle. As shown in table 450, for many cycles, one or more of value loaders 420 may not receive an element corresponding to an element of the input matrix. For these unused value loaders 420, the data received by them from the corresponding multiplexers 430 or output by them to the input registers can be ignored. Additionally or alternatively, the value loader 420 can be configured to abandon outputting data to the input register 330 when its input does not correspond to an element of the input matrix.

在一些实施方式中，两个控制信号可以被用于使得转置单元能够同时计算多个转置。例如，如果第一控制信号被提供到复用器430中的第一两个(例如，复用器[0]和[1])，并且第二控制信号被提供到复用器430中的第二两个(例如，复用器[2]和[3])，则4x4转置单元可以同时地计算两个2x2、3x2或4x2转置。每个控制信号能够使用上文所讨论的相同传播方案来使得4x4转置单元能够使用与计算2x2、3x2或4x2矩阵转置中的单独一个将需要的相同数目的周期计算两个2x2、3x2或4x2转置的转置。In some embodiments, two control signals can be used to enable the transpose unit to calculate multiple transpositions simultaneously. For example, if a first control signal is provided to the first two of multiplexers 430 (e.g., multiplexers [0] and [1]), and a second control signal is provided to the second two of multiplexers 430 (e.g., multiplexers [2] and [3]), then the 4x4 transpose unit can calculate two 2x2, 3x2, or 4x2 transpositions simultaneously. Each control signal can use the same propagation scheme discussed above to enable the 4x4 transpose unit to calculate the transposition of two 2x2, 3x2, or 4x2 transpositions using the same number of cycles as would be required to calculate a single one of the 2x2, 3x2, or 4x2 matrix transpositions.

在一些实施方式中，交错存储器读取器400能够支持“气泡”(即，对应于输入矩阵的存储器或数据流中的错误间隙)。为了处理这些错误，复用器430中的每个可以包括负载启用输入。可以对复用器430进行配置，使得负载使能指示“气泡”是否已发生，使得如果“气泡”的确发生，则复用器430不读取存储器并且转置过程有效地停止直到错误通过。负载使能可以被配置成自动地对“气泡”作出反应，并且在“气泡”经过之后，自动地切换以恢复转置过程。负载使能信号可以被配置成允许交错存储器读取器400支持在每个通道中同时地(即，同时在复用器430中的每个处)发生的“气泡”，或者可以被配置成允许交错存储器读取器400支持在选择通道中发生的“气泡”，例如，每个复用器430可以由分离的负载使能信号或由复用器430的子集共享的负载使能信号控制。In some embodiments, the interleaved memory reader 400 can support "bubbles" (i.e., gaps in memory or data streams corresponding to errors in the input matrix). To handle these errors, each of the multiplexers 430 can include a load enable input. The multiplexers 430 can be configured so that the load enable indicates whether a "bubble" has occurred, so that if a "bubble" does occur, the multiplexer 430 does not read the memory and the transposition process effectively stops until the error passes. The load enable can be configured to automatically react to a "bubble" and automatically switch to resume the transposition process after the "bubble" has passed. The load enable signal can be configured to allow the interleaved memory reader 400 to support "bubbles" occurring simultaneously in each channel (i.e., simultaneously at each of the multiplexers 430), or it can be configured to allow the interleaved memory reader 400 to support "bubbles" occurring in selected channels. For example, each multiplexer 430 can be controlled by a separate load enable signal or a load enable signal shared by a subset of the multiplexers 430.

在一些实施方式中，交错存储器写入器(诸如图3的交错存储器写入器370)根据类似原理操作。例如，交错存储器写入器可以包括N个复用器，其中，N是能够由轮换器350接收的元素的数目(即，值输出360的数目)。每个复用器可以是1对N并且可以在其输入处从值输出(例如，图3的值输出360)接收元素。与上文针对交错存储器读取器400讨论的控制信号类似的控制信号作为选择信号被提供到交错存储器写入器的每个复用器。用于控制交错存储器写入器的控制信号可以由与交错存储器读取器400的增量器435类似的增量器提供，并且被交错以与交错存储器读取器400类似地将选择信号提供到交错存储器写入器的复用器。复用器因此根据上文所讨论并且在表450处示出的相同模式以交错方式写入到存储器(例如，存储器208)。即，在周期0处，交错存储器写入器将元素存储在对应于转置输出矩阵的(0,0)位置的存储器中，在周期1处将元素存储在对应于转置输出矩阵的(1,0)和(0,1)位置的存储器中。In some embodiments, an interleaved memory writer (such as interleaved memory writer 370 of FIG. 3 ) operates according to similar principles. For example, the interleaved memory writer may include N multiplexers, where N is the number of elements that can be received by the rotator 350 (i.e., the number of value outputs 360 ). Each multiplexer may be 1 to N and may receive an element from a value output (e.g., value output 360 of FIG. 3 ) at its input. A control signal similar to the control signal discussed above for the interleaved memory reader 400 is provided as a select signal to each multiplexer of the interleaved memory writer. The control signal for controlling the interleaved memory writer may be provided by an incrementer similar to the incrementer 435 of the interleaved memory reader 400 and interleaved to provide select signals to the multiplexers of the interleaved memory writer similar to the interleaved memory reader 400. The multiplexers thus write to the memory (e.g., memory 208) in an interleaved manner according to the same pattern discussed above and shown at table 450. That is, at cycle 0, the interleaved memory writer stores the element in the memory corresponding to the (0,0) position of the transposed output matrix, and at cycle 1, stores the element in the memory corresponding to the (1,0) and (0,1) positions of the transposed output matrix.

图5是执行矩阵转置计算的示例过程500的流程图。通常，过程500由包括专用硬件电路(例如，包括转置单元120的图1的专用硬件电路110)的一个或多个计算机的系统执行。5 is a flow diagram of an example process 500 for performing a matrix transpose calculation. Typically, process 500 is performed by a system of one or more computers including dedicated hardware circuitry (eg, dedicated hardware circuitry 110 of FIG. 1 including transpose unit 120).

系统将矩阵的对角线的元素接收在第一向量中(502)。例如，转置单元的反转电路(例如，图3的反向器340)能够接收输入矩阵的对角线的元素。输入矩阵的对角线的元素能够由反转电路从例如如针对图3所讨论的输入寄存器330或值加载器320接收。对角线的元素可能已经由交错存储器读取器(例如，图4的交错存储器读取器400)从被存储在存储器(例如，静态随机存取存储器(SRAM))中的输入矩阵获得。反转电路能够将对角线的元素接收在反转电路的寄存器中。The system receives the diagonal elements of the matrix in a first vector (502). For example, the inversion circuit of the transpose unit (e.g., inverter 340 of FIG. 3 ) can receive the diagonal elements of the input matrix. The diagonal elements of the input matrix can be received by the inversion circuit from, for example, the input register 330 or the value loader 320 as discussed with respect to FIG. 3 . The diagonal elements may have been obtained by an interleaved memory reader (e.g., interleaved memory reader 400 of FIG. 4 ) from the input matrix stored in a memory (e.g., static random access memory (SRAM)). The inversion circuit can receive the diagonal elements in a register of the inversion circuit.

系统生成第二向量，其以第一向量中的矩阵的对角线的元素的次序的相反的次序包括矩阵的对角线的元素(504)。转置单元的反转电路能够以第一向量中的那些元素的次序的相反的次序将第一向量中的矩阵的对角线的元素存储在寄存器中。例如，图3的反向器340可以以那些元素被存储在输入寄存器330中的次序的相反的次序将从输入寄存器330或值加载器320接收到的输入矩阵的对角线的元素存储在反向器340的寄存器中。The system generates a second vector that includes the diagonal elements of the matrix in the first vector in an order that is opposite to the order of the elements of the diagonal elements of the matrix in the first vector (504). The inversion circuit of the transpose unit can store the diagonal elements of the matrix in the first vector in a register in an order that is opposite to the order of those elements in the first vector. For example, the inverter 340 of Figure 3 can store the diagonal elements of the input matrix received from the input register 330 or the value loader 320 in a register of the inverter 340 in an order that is opposite to the order in which those elements are stored in the input register 330.

系统确定轮换第二向量中的矩阵的对角线的元素的位置的数目(506)。例如，转置单元的轮换电路(例如，图3的轮换器350)能够确定轮换第二向量中的矩阵的对角线的元素的位置的数目。在一些实施方式中，轮换电路可以基于控制轮换电路或者由其可访问的计数器(例如，图3的计数器315)确定轮换第二矩阵中的元素的位置的数目。The system determines the number of positions of the elements of the diagonal of the matrix in the second vector to rotate (506). For example, the rotation circuit of the transposition unit (e.g., rotator 350 of FIG. 3 ) can determine the number of positions of the elements of the diagonal of the matrix in the second vector to rotate. In some embodiments, the rotation circuit can determine the number of positions of the elements of the second matrix to rotate based on a counter (e.g., counter 315 of FIG. 3 ) that controls or is accessible to the rotation circuit.

在一些实施方式中，计数器可以被初始化到N-1的值，其中，N等于轮换电路能够接收的元素的数目(即，轮换电路的寄存器的宽度)。计数器可以响应于发起信号(诸如触发专用硬件电路以执行计算转置的操作的控制信号)而初始化。计数器可以针对在其中轮换电路接收输入矩阵的对角线的元素的第二向量的每个周期减量。在轮换电路轮换元素的第二向量的位置的数目是零之后(即，在其中轮换电路不轮换第二向量的元素的周期之后)，计数器可以被重置为初始化的值。以这种方式，轮换电路能够使用仅要求单个发起控制信号来输出针对执行完全转置计算所要求的周期中的每个轮换第二向量中的元素的位置的正确数目的计数器确定轮换第二向量中的元素的位置的数目。In some embodiments, the counter can be initialized to a value of N-1, where N is equal to the number of elements that the round-robin circuit can receive (i.e., the width of the register of the round-robin circuit). The counter can be initialized in response to an initiation signal (such as a control signal that triggers a dedicated hardware circuit to perform an operation to calculate the transpose). The counter can be decremented for each cycle of the second vector of elements in which the round-robin circuit receives the diagonal elements of the input matrix. After the number of positions of the second vector of elements rotated by the round-robin circuit is zero (i.e., after a cycle in which the round-robin circuit does not rotate elements of the second vector), the counter can be reset to the initialized value. In this way, the round-robin circuit can determine the number of positions of elements in the second vector that are rotated using a counter that requires only a single initiation control signal to output the correct number of positions of elements in the second vector that are rotated for each cycle required to perform the full transpose calculation.

系统接收矩阵的对角线的元素的第二向量(508)。例如，系统的轮换电路能够接收由反转电路所生成的元素的第二向量。在一些实施方式中，轮换电路(例如，图3的轮换器350)通过访问保持第二向量的元素的反转电路的寄存器并且通过将第二向量的元素存储在轮换电路的寄存器中来接收由反转电路(例如，图3的反向器340)所生成的元素的第二向量。在其它实施方式中，轮换电路可以访问被存储在反转电路的寄存器处的第二向量而不将矩阵的元素的第二向量存储在轮换电路的寄存器处。The system receives a second vector of elements of the diagonal of the matrix (508). For example, the system's toroidal circuit can receive the second vector of elements generated by the inversion circuit. In some embodiments, the toroidal circuit (e.g., rotator 350 of FIG. 3 ) receives the second vector of elements generated by the inversion circuit (e.g., inverter 340 of FIG. 3 ) by accessing a register of the inversion circuit that holds the elements of the second vector and by storing the elements of the second vector in the register of the toroidal circuit. In other embodiments, the toroidal circuit can access the second vector stored at the register of the inversion circuit without storing the second vector of elements of the matrix at the register of the toroidal circuit.

系统生成第三向量，其以所确定的位置的数目轮换第二向量中的矩阵的对角线的元素的次序包括第二向量中的矩阵的对角线的元素(510)。例如，专用硬件电路的轮换电路(例如，图3的轮换器350)可以以轮换的次序将接收到的第二向量的元素的次序存储在轮换电路的寄存器中，使得轮换电路的寄存器中的元素的次序反映轮换接收到的第二向量的元素的所确定的位置的数目。在一些实施方式中，轮换电路可以在生成第三向量时执行第二向量的元素的右轮换。例如，右轮换被执行以轮换第二向量的元素如参考步骤508上文所描述的所确定的位置的数目(即，基于由计数器所指定的位置的数目)。轮换电路能够将右轮换的元素存储在轮换电路的寄存器中。The system generates a third vector that includes the elements of the diagonal of the matrix in the second vector in a rotation order of the elements of the diagonal of the matrix in the second vector by the determined number of positions (510). For example, a rotation circuit of a dedicated hardware circuit (e.g., rotator 350 of FIG. 3 ) can store the order of the elements of the received second vector in a register of the rotation circuit in a rotation order, such that the order of the elements in the register of the rotation circuit reflects the determined number of positions at which the elements of the received second vector were rotated. In some embodiments, the rotation circuit can perform a right rotation of the elements of the second vector when generating the third vector. For example, the right rotation is performed to rotate the elements of the second vector by the determined number of positions as described above with reference to step 508 (i.e., based on the number of positions specified by the counter). The rotation circuit can store the right rotated elements in the register of the rotation circuit.

在生成矩阵的对角线的元素的第三向量之后，第三向量的元素可以被存储在存储器(例如，存储器208)处。可以使用交错存储器写入器(例如，如在图3和图4处所讨论的)将第三向量的元素存储在存储器处以将被存储在轮换电路的寄存器处的第三向量的元素写入到存储器的适当的位置，这有效地存储输入矩阵的转置的对角线。After generating the third vector of elements of the diagonal of the matrix, the elements of the third vector may be stored in a memory, such as memory 208. The elements of the third vector may be stored in the memory using an interleaved memory writer (e.g., as discussed in connection with FIG. 3 and FIG. 4 ) to write the elements of the third vector stored in the registers of the round robin circuit to the appropriate locations of the memory, effectively storing the transposed diagonal of the input matrix.

可以针对输入矩阵的对角线中的每个重复过程500。例如，对于维度m×n的矩阵而言，过程500的(m+n)-1次迭代将针对系统被执行以输出输入矩阵的完全转置。The process 500 may be repeated for each of the diagonals of the input matrix. For example, for a matrix of dimension m x n, (m + n) - 1 iterations of the process 500 will be performed for the system to output the full transpose of the input matrix.

图6A-6C示出在矩阵向量处理器中转置矩阵的示例。在一些实施方式中，可以通过具有包括转置单元120的专用硬件电路110的图1的矩阵向量处理系统100执行图6A-6C的示例。具体地，图6A-6C图示在其中矩阵向量处理系统100能够顺序地计算多个矩阵转置使得第二矩阵转置可以当第一矩阵转置计算正在进行中时开始的示例。利用重叠执行顺序转置操作的能力增加在执行矩阵转置计算时矩阵向量处理系统100的效率。Figures 6A-6C illustrate examples of transposing a matrix in a matrix-vector processor. In some embodiments, the examples of Figures 6A-6C can be performed by the matrix-vector processing system 100 of Figure 1 having a dedicated hardware circuit 110 including a transposition unit 120. Specifically, Figures 6A-6C illustrate examples in which the matrix-vector processing system 100 is capable of sequentially calculating multiple matrix transpositions so that a second matrix transposition can be started while a first matrix transposition calculation is in progress. The ability to perform sequential transposition operations with overlap increases the efficiency of the matrix-vector processing system 100 when performing matrix transposition calculations.

在图6A-6C处所示的示例的每个周期中，存储器610(例如，可以被用于实现存储器208的静态随机存取存储器(SRAM))可以由交错存储器读取电路(例如，图4的交错存储器读取器400)访问，并且来自存储器610的数据可以被放置在输入寄存器620(例如，与图3的输入寄存器330类似)中。反转电路(例如，图3的反向器340)可以反转输入寄存器620中的值并且将经反转的值放置在轮换电路630的寄存器(例如，图3的轮换器350的寄存器)中。轮换电路630轮换轮换电路630的寄存器中的值所确定的位置的数目(例如，通过从与图3的计数器315类似的计数器确定位置的数目)。执行轮换的结果被放置在交错存储器写入电路640的寄存器(例如，交错存储器写入器370的寄存器，或者可选地，图3的值输出360)中。交错存储器将值写入在交错存储器写入电路640的寄存器中被执行以将值存储在存储器650(例如，可以被用于实现存储器208的随机存取存储器(SRAM))的适当的位置中。In each cycle of the example shown in Figures 6A-6C, memory 610 (e.g., a static random access memory (SRAM) that can be used to implement memory 208) can be accessed by an interleaved memory read circuit (e.g., interleaved memory reader 400 of Figure 4), and data from memory 610 can be placed in input register 620 (e.g., similar to input register 330 of Figure 3). An inversion circuit (e.g., inverter 340 of Figure 3) can invert the value in input register 620 and place the inverted value in a register of a toggle circuit 630 (e.g., a register of toggle circuit 350 of Figure 3). Toggle circuit 630 toggles the value in the register of toggle circuit 630 by a number of positions determined by the value (e.g., by determining the number of positions from a counter similar to counter 315 of Figure 3). The result of the toggle is placed in a register of an interleaved memory write circuit 640 (e.g., a register of interleaved memory writer 370, or alternatively, value output 360 of Figure 3). Interleave memory writes of values are performed in registers of interleave memory write circuitry 640 to store the values in appropriate locations of memory 650 (eg, random access memory (SRAM) that may be used to implement memory 208 ).

简单地，在图6A处所示的周期(a)处，对应于第一输入矩阵的第一对角线的第一值(0,0)根据所描述的方法(例如，图5的方法500)被接收并且被处理以将值存储在存储器650的第一位置中。在周期(b)处，第一输入矩阵的第二对角线的值被接收并且被操纵以将值存储在存储器650的适当的位置中。类似过程针对周期(c)和(d)被重复以将第一输入矩阵的第三对角线和第四对角线的值存储在存储器650处，使得第一输入矩阵被适当地转置(即，跨越其主对角线被反射)。Briefly, at period (a) shown at FIG6A , first values (0,0) corresponding to the first diagonal of the first input matrix are received and processed according to the described method (e.g., method 500 of FIG5 ) to store the values in the first location of the memory 650. At period (b), values for the second diagonal of the first input matrix are received and manipulated to store the values in the appropriate locations of the memory 650. A similar process is repeated for periods (c) and (d) to store the values for the third and fourth diagonals of the first input matrix at the memory 650, such that the first input matrix is appropriately transposed (i.e., reflected across its main diagonal).

在图6B处所示的周期(e)处，第一输入矩阵的第五对角线被操纵并且被存储在存储器650中，并且周期还捕获对应于被存储在存储器650中的适当的位置处的第二输入矩阵的第一对角线的元素(0,4)。因此，在周期(e)处，第二矩阵转置计算在没有对于转置单元的附加计算能力的要求的情况下开始。周期(f)和(g)示出第一输入矩阵的转置的计算的完成，其在周期(g)处完全地被存储在存储器650中并且是第一输入矩阵的适当的转置。这些相同周期还完成第二输入矩阵的第二对角线和第三对角线的处理。At cycle (e) shown in FIG6B , the fifth diagonal of the first input matrix is manipulated and stored in memory 650, and the cycle also captures the element (0,4) corresponding to the first diagonal of the second input matrix stored in the appropriate location in memory 650. Thus, at cycle (e), the second matrix transpose calculation begins without requiring additional computational power for the transpose unit. Cycles (f) and (g) illustrate the completion of the calculation of the transpose of the first input matrix, which is completely stored in memory 650 at cycle (g) and is the appropriate transpose of the first input matrix. These same cycles also complete the processing of the second and third diagonals of the second input matrix.

在图6B处所示的周期(h)和在图6C处所示的周期(i)、(j)和(k)处理第二输入矩阵的剩余的四个对角线，并且导致将第二输入矩阵的转置存储在存储器650中。因此，图6A-6C的示例示出多个输入矩阵能够由转置单元顺序地并且利用重叠处理以降低执行多个矩阵转置计算的计算成本。虽然在图6A-6C中被示出为在没有延迟的情况下顺序地被执行，但是在其它实施方式中两个输入矩阵之间的间隙可以是任意持续时间。在那些实例中，转置单元将仍然适当地计算输入矩阵的转置，以及在输入矩阵之间的间隙期间由转置单元所输出的数据是能够忽略或丢弃的数据。In the cycle (h) shown in Fig. 6 B place and the cycle (i), (j) and (k) shown in Fig. 6 C place, process remaining four diagonals of the second input matrix, and cause the transposition of the second input matrix to be stored in the memory 650.Therefore, the example of Fig. 6 A-6C illustrates that a plurality of input matrixes can be by transposition unit sequentially and utilize overlapping processing to reduce the computational cost that performs a plurality of matrix transpositions and calculates.Though being shown as being executed sequentially under the situation that does not delay in Fig. 6 A-6C, the gap between two input matrixes can be any duration in other embodiments.In those instances, transposition unit will still suitably calculate the transposition of input matrix, and the data that are output by transposition unit during the gap between input matrix are the data that can ignore or discard.

如上文所描述的，一种用于转置矩阵的电路，所述电路包括：反转电路，其针对矩阵的一个或多个对角线中的每个被配置成接收第一向量中的矩阵的元素并且生成第二向量，第二向量包括第一向量中的矩阵的元素的次序的相反的次序的矩阵的元素；以及轮换电路，其针对矩阵中的一个或多个对角线中的每个被配置成确定轮换第二向量中的矩阵的元素的位置的数目，接收矩阵的元素的第二向量，并且生成第三向量，第三向量包括第二向量中的矩阵的元素轮换所确定的位置的数目形成的次序的第二向量中的矩阵的元素。As described above, a circuit for transposing a matrix includes: an inversion circuit, which is configured to receive elements of the matrix in a first vector and generate a second vector for each of one or more diagonals of the matrix, the second vector including elements of the matrix in an order opposite to the order of the elements of the matrix in the first vector; and a rotation circuit, which is configured to determine the number of positions of the elements of the matrix in the second vector to be rotated for each of one or more diagonals in the matrix, receive the second vector of elements of the matrix, and generate a third vector, the third vector including elements of the matrix in the second vector in an order formed by rotating the elements of the matrix in the second vector by the number of positions determined.

能够在数字电子电路中、在有形实现的计算机软件或固件中、在计算机硬件中实现本说明书中所描述的主题和功能操作的实施例，包括本说明书中所公开的结构和其结构等同物或者其中的一个或多个的组合。本说明书中所描述的主题的实施例能够被实现为一个或多个计算机程序(即，被编码在有形非暂态程序载体上用于由数据处理装置执行或者控制数据处理装置的操作的计算机程序指令的一个或多个模块)。备选地或者附加地，程序指令能够被编码在人工生成的传播信号上，例如，机器生成的电、光学或电磁信号，其被生成以编码用于传输到适合的接收器装置用于由数据处理装置执行的信息。计算机存储介质能够是机器可读存储设备、机器可读存储衬底、随机或串行存取存储器设备或它们中的一个或多个的组合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuits, in tangibly implemented computer software or firmware, or in computer hardware, including the structures disclosed in this specification and their structural equivalents or a combination of one or more thereof. The embodiments of the subject matter described in this specification can be implemented as one or more computer programs (i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by a data processing device or for controlling the operation of a data processing device). Alternatively or additionally, the program instructions can be encoded on an artificially generated propagation signal, for example, a machine-generated electrical, optical, or electromagnetic signal, which is generated to encode information for transmission to a suitable receiver device for execution by a data processing device. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more thereof.

术语“数据处理装置”涵盖用于处理数据的所有种类的装置、设备和机器，例如包括可编程处理器、计算机或多个处理器或计算机。装置能够包括专用逻辑电路(例如，FPGA(现场可编程门阵列)或ASIC(专用集成电路))。除硬件外，装置还可以包括创建用于讨论中的计算机程序的执行环境的代码，例如构成处理器固件、协议栈、数据库管理系统、操作系统或者其中的一个或多个的组合的代码。The term "data processing apparatus" encompasses all kinds of apparatus, devices, and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. An apparatus can include special-purpose logic circuitry (e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit)). In addition to hardware, an apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.

计算机程序(其还可以被称为或被描述为程序、软件、软件应用、模块、软件模块、脚本或代码)能够以任何形式的编程语言编写，包括编译或者解译语言或者说明性或者程序性语言，并且其能够以任何形式部署，包括作为单独程序或者作为模块、组件、子例程或适于使用在计算环境中的其它单元。计算机程序可以但是不需要对应于文件系统中的文件。程序可以被存储在保持其它程序或数据(例如，被存储在标记语言文档中的一个或多个脚本)的文件的一部分中、被存储在专用于讨论中的程序的单个文件中或被存储在多个协调文件中(例如，存储一个或多个模块、子程序或代码的部分的文件)。计算机程序可以被部署以在位于一个地点处的一个计算机上或跨多个地点分布并且由通信网络相互连接的多个计算机上被执行。A computer program (which may also be referred to or described as a program, software, software application, module, software module, script or code) can be written in any form of programming language, including compiled or interpreted languages or illustrative or procedural languages, and can be deployed in any form, including as a separate program or as a module, component, subroutine or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store portions of one or more modules, subroutines or codes). A computer program may be deployed to be executed on a computer located at one location or on multiple computers distributed across multiple locations and interconnected by a communication network.

本说明书中所描述的过程和逻辑流可以通过一个或多个可编程处理器执行，可编程处理器通过对输入数据进行操作并且生成输出来执行一个或多个计算机程序以执行功能。过程和逻辑流还可以通过专用逻辑电路执行并且装置还可以实现为专用逻辑电路，例如FPGA(现场可编程门阵列)或者ASIC(专用集成电路)。The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can be implemented as, special purpose logic circuitry, such as an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

通过示例，适于计算机程序的执行的计算机包括可以基于通用微处理器或者专用微处理器或者二者，或者任何其它种类的中央处理单元。一般地，中央处理单元将从只读存储器或者随机存取存储器或者二者接收指令和数据。计算机的基本元件是用于执行或者运行指令的中央处理单元和用于存储指令和数据的一个或多个存储器设备。一般地，计算机还将包括用于存储数据的一个或多个大容量存储设备(例如，磁盘、磁光盘或光盘)或者操作性地耦合以从用于存储数据的一个或多个大容量存储设备接收数据或者向其传送数据或者二者。然而，计算机不需要具有这样的设备。而且，计算机能够被嵌入在另一设备中(例如，移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏控制台、全球定位系统(GPS)接收器)或者便携式存储设备(例如，通用串行总线(USB)闪盘驱动器)，等等。By way of example, a computer suitable for the execution of a computer program includes a central processing unit (CPU) that can be based on a general-purpose microprocessor or a special-purpose microprocessor or both, or any other type of microprocessor. Generally, a CPU will receive instructions and data from a read-only memory (ROM) or a random access memory (RAM) or both. The basic element of a computer is a CPU for executing or running instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include one or more large-capacity storage devices (e.g., magnetic disks, magneto-optical disks, or optical disks) for storing data or be operatively coupled to receive data from one or more large-capacity storage devices for storing data or to transmit data or both. However, a computer does not need to have such a device. Moreover, a computer can be embedded in another device (e.g., a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver) or a portable storage device (e.g., a universal serial bus (USB) flash drive), etc.

适于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、介质和存储器设备，例如包括半导体存储器设备，例如EPROM、EEPROM和闪速存储器设备；磁盘，例如内部硬盘或者可移动磁盘；磁光盘；和CD ROM和DVD-ROM磁盘。处理器和存储器可以通过专用逻辑电路补充或者合并在专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media, and storage devices, including, for example, semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and memory may be supplemented by, or incorporated in, special purpose logic circuitry.

为了发送与用户的交互，本说明书中所描述的主题的实施例能够被实现在计算机上，所述计算机具有用于将信息显示给用户的显示设备(例如，CRT(阴极射线管)或LCD(液晶显示器)监视器)和用户通过其能够向计算机发送输入的键盘和指示设备(例如，鼠标或轨迹球)。其它种类的设备也能够被用于发送与用户的交互；例如，提供给用户的反馈可以是任何形式的感觉反馈，例如视觉反馈、听觉反馈或者触觉反馈；并且来自用户的输入可以以任何形式接收，包括声音、语音或者触觉输入。另外，计算机可以通过将文档发送给由用户所使用的设备和从其接收文档来与用户交互；例如，通过响应于从web浏览器所接收的请求而将网页发送给用户的客户端设备上的web浏览器。To transmit interactions with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can transmit input to the computer. Other types of devices can also be used to transmit interactions with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including sound, voice, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device used by the user; for example, by sending a web page to a web browser on a user's client device in response to a request received from the web browser.

在本说明书中所描述的主题的实施例能够被实现在计算系统中，该计算机系统包括后端部件(例如，作为数据服务器)，或者其包括中间件部件(例如，应用服务器)，或者其包括前端部件(例如，具有用户通过其能够与本说明书中所描述的主题的实施方式交互的图形用户接口或web浏览器的客户端计算机)，或者一个或多个这样的后端、中间件或前端部件的任何组合。系统的部件能够以任何形式或数字数据通信的媒介例如通信网络相互连接。通信网络的示例包括局域网(“LAN”)和广域网(“WAN”)(例如，因特网)。Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected in any form or medium of digital data communication, such as a communication network. Examples of communication networks include local area networks ("LANs") and wide area networks ("WANs") (e.g., the Internet).

计算系统可以包括客户端和服务器。客户端和服务器通常远离彼此并且典型地通过通信网络交互。客户端和服务器的关系借助于在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序实现。A computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises through computer programs running on the respective computers and having a client-server relationship to each other.

虽然本说明书包含许多特定实施方式细节，但是这些不应该被解释为对任何发明或可以要求保护的范围的限制，而是可以特定于特定发明的特定实施例的特征的描述。不同的实施例的场境中的本说明书中所描述的某些特征还可以组合实现在单个实施例中。相反，单个实施例的场境中所描述的各种特征还可以分离地实现在多个实施例中或任何适合的子组合中。而且，尽管特征可以在上文中描述为在某些组合中作用并且甚至如此初始地要求保护，但是在一些情况下，可以从组合去除所要求保护的组合的一个或多个特征，并且所要求保护的组合可以涉及子组合或子组合的变型。Although this specification contains many specific implementation details, these should not be construed as limitations on any invention or the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features described in this specification in the context of different embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, in some cases one or more features of a claimed combination may be removed from the combination, and a claimed combination may involve subcombinations or variations of subcombinations.

类似地，尽管以特定的次序在附图中描绘了操作，但是这不应当理解为要求这样的操作以所示的特定次序或者以顺序次序执行或者所有图示的操作被执行以实现期望的结果。在某些情况下，多任务和并行处理可以是有利的。而且，上文所描述的实施例中的各种系统模块和组件的分离不应该被理解为要求所有实施例中的这样的分离，并且应该理解的是，所描述的程序组件和系统一般地可以一起集成在单个软件产品或者封装到多个软件产品中。Similarly, although operations are depicted in the accompanying drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in a sequential order, or that all illustrated operations be performed to achieve the desired results. In some cases, multitasking and parallel processing can be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

已经描述了本主题的特定实施例。其他实施例在以下权利要求的范围内。例如，权利要求中所记载的动作可以以不同的次序执行并且仍然实现期望的结果。作为一个示例，附图中所描绘的处理不必要求所示的特定次序或顺序次序来实现期望的结果。在某些实施方式中，多任务和并行处理可以是有利的。Specific embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desired results. As an example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desired results. In certain embodiments, multitasking and parallel processing may be advantageous.

Claims

1. A circuit for transposing a matrix, the circuit comprising:

An interleaved memory read circuit including multiple multiplexers, wherein the interleaved memory read circuit is configured to read from each of one or more diagonals of the matrix:

Access each element of the diagonal; and

The multiple multiplexers output each element of the diagonal into a corresponding first vector;

Inverting circuit, the inverting circuit being configured to:

The elements of the diagonal are received in the corresponding first vector from the plurality of multiplexers of the interleaved memory reading circuit and for each of the one or more diagonals of the matrix, and

Generate a corresponding second vector for each of the one or more diagonals of the matrix, the corresponding second vector comprising the elements of the diagonals of the corresponding first vector in reverse order of the elements of the diagonals in the corresponding first vector; and

A switching circuit, wherein the switching circuit is configured to:

For each of the one or more diagonals of the matrix, determine the number of positions used to rotate the elements of the diagonals in the corresponding second vector.

For each of the one or more diagonals of the matrix, the elements of the diagonal in the corresponding second vector are received; and

For each of the one or more diagonals of the matrix, a corresponding third vector is generated, the corresponding third vector comprising the elements of the diagonals in the corresponding second vector in an order formed by rotating the elements of the diagonals in the corresponding second vector by a determined number of positions.

2. The circuit according to claim 1, further comprising:

A counting circuit configured to output to the rotation circuit and for each of the one or more diagonals of the matrix a determined number of positions for rotating the elements of the diagonals in the corresponding second vector.

3. The circuit of claim 2, wherein the counting circuit is further configured to output a value as the number of positions for each of the one or more diagonals of the matrix, wherein the initial value output by the counting circuit is equal to N-1, wherein N is equal to the width of the rotation circuit.

4. The circuit of claim 3, wherein the counting circuit is further configured to adjust the value for each of the one or more diagonals of the matrix, wherein adjusting the value includes:

When the value is positive, the value output by the counting circuit is decremented; and

When the value is zero, the value is reset to the initial value.

5. The circuit according to claim 1, wherein the matrix is a submatrix.

6. The circuit of claim 1, wherein the plurality of multiplexers comprises M multiplexers, wherein M is equal to the width of the inverting circuit, and wherein each multiplexer is configured as one of a plurality of elements of a column of an output matrix.

7. The circuit of claim 6, wherein the interleaved memory read circuit is further configured to receive a control signal that specifies the input of each of the M multiplexers.

8. The circuit of claim 6, wherein each of the M multiplexers is an N-to-1 multiplexer, wherein N is the number of elements that can be received by the switching circuit.

9. The circuit of claim 6, wherein the interleaved memory read circuit is further configured to:

Receive a first control signal, the first control signal specifying the input of the multiplexer for each multiplexer in a first subset of the M multiplexers; and

A second control signal is received, which specifies the input of the multiplexer for each multiplexer in a second subset of the M multiplexers.

10. The circuit according to claim 1, further comprising:

An interleaved memory write circuit is configured to write the elements of the diagonal of the corresponding third vector into memory as the diagonal of the transposed output matrix for each of the one or more diagonals of the matrix.

11. The circuit of claim 1, wherein the matrix comprises two or more matrices stored in memory as a single matrix.

12. The circuit of claim 1, wherein the rotation circuit is configured to generate the corresponding third vector by performing a right rotation of each element of the diagonal in the corresponding second vector at a determined number of positions.

13. The circuit of claim 1, wherein the matrix is stored in a static random access memory accessible to the circuit.

14. The circuit of claim 1, wherein, for each of the one or more diagonals of the matrix, the elements of the diagonal in the corresponding third vector are stored in a static random access memory as the diagonal of the transposed output matrix.

15. The circuit of claim 1, further comprising a second switching circuit, the second switching circuit being configured to:

For each of one or more diagonals of the second matrix, determine the number of second positions used to rotate the elements of the diagonal;

For each of the one or more diagonals of the second matrix, a corresponding fourth vector comprising the elements of the diagonal is received; and

For each of the one or more diagonals of the second matrix, a corresponding fifth vector is generated, the corresponding fifth vector comprising the elements of the diagonals in the corresponding fourth vector in an order formed by rotating the elements of the diagonals in the corresponding fourth vector by a determined number of second positions.

16. The circuit of claim 15, further comprising:

A second counting circuit is configured to output a determined number of second positions to the second rotation circuit and for each of the one or more diagonals of the second matrix for rotating the elements of the diagonals in the corresponding fourth vector.

17. A circuit for transposing an input vector, the circuit comprising:

An interleaved memory read circuit including multiple multiplexers, wherein the interleaved memory read circuit is configured to:

Access each element of the input vector; and

The multiple multiplexers output each element of the input vector into a first vector;

Inverting circuit, the inverting circuit being configured to:

For each element of one or more elements in the input vector, receive the first vector comprising each element of the input vector, and

For each element of one or more elements in the input vector, a second vector is generated, the second vector comprising the elements of the first vector in the reverse order of the elements in the first vector; and

A switching circuit, wherein the switching circuit is configured to:

For each element in one or more elements of the input vector, determine the number of positions used to rotate the elements in the second vector.

For each element of one or more elements in the input vector, receive elements of the second vector, and

For each element of one or more elements in the input vector, a third vector is generated, the third vector comprising the elements of the second vector in an order formed by rotating the elements of the second vector by a determined number of positions.