WO2021088563A1

WO2021088563A1 - Convolution operation circuit, apparatus and method

Info

Publication number: WO2021088563A1
Application number: PCT/CN2020/118070
Authority: WO
Inventors: 王维伟; 罗飞
Original assignee: Stream Computing Inc
Current assignee: Stream Computing Inc
Priority date: 2019-11-04
Filing date: 2020-09-27
Publication date: 2021-05-14
Anticipated expiration: 2022-05-04
Also published as: CN112784973A; CN112784973B

Abstract

Disclosed in the embodiments of the present disclosure are a convolution operation circuit, an apparatus and a method. The convolution operation circuit comprises: a convolution operation control circuit; an operation unit array, which comprises a plurality of operation units; the convolution operation control circuit is configured to receive a convolution operation instruction, and to transmit, according to an operation order indicated by the convolution operation instruction, inputted data and weight data one by one to operation units, participating in the operation, among the plurality of operation units; and the operation units participating in the operation performs convolution operation on the inputted data and the weight data according to the indication of the instruction, the instruction being a single instruction. Said method solves the technical problems in the prior art of low calculation efficiency and high power consumption during convolution calculation.

Description

Convolution operation circuit, device and method

Technical field

本公开涉及神经网络计算领域，尤其涉及一种卷积运算电路、装置以及方法。The present disclosure relates to the field of neural network computing, and in particular to a convolution operation circuit, device and method.

Background technique

随着科学技术的发展，人类社会正在快速进入智能时代。智能时代的重要特点，就是人们获得数据的种类越来越多，获得数据的量越来越大，而对处理数据的速度要求越来越高.芯片是数据处理的基石，它从根本上决定了人们处理数据的能力。从应用领域来看，芯片主要有两条路线：一条是通用芯片路线，例如CPU(Central Processing Unit，中央处理器)等，它们能提供极大的灵活性，但是在处理特定领域算法时有效算力比较低；另一条是专用芯片路线，例如TPU(Tensor Processing Unit，张量处理器)等，它们在某些特定领域，能发挥较高的有效算力，但是面对灵活多变的比较通用的领域，它们处理能力比较差甚至无法处理。由于智能时代的数据种类繁多且数量巨大，所以要求芯片既具有极高的灵活性，能处理不同领域且日新月异的算法，又具有极强的处理能力，能快速处理极大的且急剧增长的数据量。With the development of science and technology, human society is rapidly entering the era of intelligence. The important feature of the intelligent age is that people are getting more and more types of data, the amount of data is getting bigger and bigger, and the speed of processing data is getting higher and higher. Chips are the cornerstone of data processing, it is fundamentally determined The ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (Tensor Processing Unit, tensor processor), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle. Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. the amount.

在人工智能计算中常常需要实现卷积计算，现有的实现卷积计算的方案中，通常有两种方案：Convolution calculations are often required in artificial intelligence calculations. In the existing schemes for realizing convolution calculations, there are usually two schemes:

(1)CPU方案：此方案中，如果是单核CPU，会将卷积计算中涉及到的矩阵拆解成标量进行运算，通过组合标量指令实现卷积运算；如果是多核CPU，可能会通过多个核并行执行各自的标量指令，组合实现卷积运算。但是使用该方案有如下缺点：底层程序复杂，一般需要多层循环实现卷积运算；通过通用计算指令实现卷积运算，效率低，需要多次分支跳转；CPU的缓存有限，实现比较大的卷积运算需要多次从片外搬数，影响效率；CPU需要多次进行数据的存取，会增加实现卷积运算的计算时间；CPU需要多次进行数据的存取，会增加实现卷积运算加的计算功耗；如果是多核并行计算，核间的通信复杂，通信性能可能成为瓶颈。(1) CPU scheme: In this scheme, if it is a single-core CPU, the matrix involved in the convolution calculation will be disassembled into scalars for operation, and the convolution operation will be realized by combining scalar instructions; if it is a multi-core CPU, it may pass Multiple cores execute their scalar instructions in parallel, and combine them to implement convolution operations. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multi-layer loops to implement convolution operations; convolution operations are implemented through general calculation instructions, which is inefficient and require multiple branch jumps; CPU cache is limited, and the implementation is relatively large Convolution operation needs to move data from off-chip multiple times, which affects efficiency; CPU needs to access data multiple times, which will increase the calculation time for convolution operation; CPU needs to access data multiple times, which will increase the realization of convolution The computational power consumption added by the calculation; if it is a multi-core parallel calculation, the communication between the cores is complicated, and the communication performance may become a bottleneck.

(2)GPU(Graphics Processing Unit，图形处理器)方案：此方案中，GPU会将卷积运算拆解成多条指令运算，这些指令主要是向量指令，通过组合执行向量指令实现卷积运算。但是使用该方案有如下缺点：底层程序复杂，一般需要多层循环实现卷积运算；通过向量指令多次组合实现卷积运算，效率较低；GPU需要多次进行数据的存取，会增加实现卷积运算的计算时间；GPU需要多次进行数据的存取，会增加实现卷积运算的计算功耗；GPU的缓存有限，实现比较大的卷积运算需要多次从片外搬数，影响效率。(2) GPU (Graphics Processing Unit) solution: In this solution, the GPU will disassemble the convolution operation into multiple instruction operations. These instructions are mainly vector instructions, and the convolution operation is realized by combining and executing vector instructions. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multiple layers of loops to implement convolution operations; the convolution operation is achieved through multiple combinations of vector instructions, which is inefficient; GPU requires multiple data access, which will increase the implementation The calculation time of convolution operation; GPU needs to access data multiple times, which will increase the calculation power consumption of convolution operation; GPU has limited cache, and the realization of relatively large convolution operation requires multiple transfers from outside the chip. effectiveness.

发明内容Summary of the invention

提供该发明内容部分以便以简要的形式介绍构思，这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征，也不旨在用于限制所要求的保护的技术方案的范围。The content of the invention is provided to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments. The content of the invention is not intended to identify the key features or essential features of the technical solution required to be protected, nor is it intended to be used to limit the scope of the technical solution required to be protected.

为了解决现有技术中的在进行卷积计算时计算效率低、功耗大的技术问题，本公开实施例提出如下技术方案:In order to solve the technical problems of low calculation efficiency and high power consumption when performing convolution calculations in the prior art, the embodiments of the present disclosure propose the following technical solutions:

第一方面，本公开实施例提供一种卷积运算电路，包括：In the first aspect, an embodiment of the present disclosure provides a convolution operation circuit, including:

卷积运算控制电路；Convolution operation control circuit;

运算单元阵列，所述运算单元阵列包括多个运算单元；An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units;

所述卷积运算控制电路用于接收卷积运算指令，按照所述卷积运算指令所指示的运算顺序逐个将所述输入数据和所述权重数据传送至所述多个运算单元中参与运算的运算单元，所述参与运算的运算单元根据所述指令的指示，对所述输入数据和所述权重数据执行卷积运算操作，其中，所述指令为单条指令。The convolution operation control circuit is configured to receive a convolution operation instruction, and transmit the input data and the weight data one by one to those involved in the operation in the plurality of operation units according to the operation sequence indicated by the convolution operation instruction. An arithmetic unit, where the arithmetic unit participating in the operation performs a convolution operation on the input data and the weight data according to the instruction of the instruction, wherein the instruction is a single instruction.

进一步的，所述卷积运算指令包括指令名、输入数据的首地址、权重数据的首地址和输出数据的首地址。Further, the convolution operation instruction includes an instruction name, a first address of input data, a first address of weight data, and a first address of output data.

进一步的，所述运算单元包括运算器，所述运算器至少包括乘法器和加法器；Further, the arithmetic unit includes an arithmetic unit, and the arithmetic unit includes at least a multiplier and an adder;

所述运算单元用于根据所述卷积运算指令组合所述乘法器和所述加法器以执行卷积运算。The operation unit is configured to combine the multiplier and the adder according to the convolution operation instruction to perform a convolution operation.

进一步的，所述参与运算的运算单元根据所述指令的指示，对所述输入数据和所述权重数据的执行卷积运算操作，包括：Further, the operation of the operation unit participating in the operation to perform a convolution operation on the input data and the weight data according to the instruction of the instruction includes:

通过乘法器计算所述输入数据和所述权重数据的乘积；Calculating the product of the input data and the weight data by a multiplier;

通过加法器计算所述乘积的累加值；Calculating the accumulated value of the product through an adder;

将所述累加值输出。The accumulated value is output.

进一步的，所述输入数据为第一输入矩阵的行向量数据；所述权重数据为卷积核的行向量数据。Further, the input data is the row vector data of the first input matrix; the weight data is the row vector data of the convolution kernel.

第二方面，本公开实施例提供一种卷积运算装置，包括：In the second aspect, an embodiment of the present disclosure provides a convolution operation device, including:

存储器，用于存储卷积运算指令、输入数据、权重数据以及输出数据；Memory, used to store convolution operation instructions, input data, weight data and output data;

取指模块，与所述存储器相连，用于从所述存储器中获取所述卷积运算指令；An instruction fetching module, connected to the memory, and configured to acquire the convolution operation instruction from the memory;

解码模块，与所述取指模块相连，用于对所述取指模块所获取到的所述卷积运算指令进行解码；A decoding module, connected to the instruction fetching module, and configured to decode the convolution operation instruction acquired by the instruction fetching module;

寄存器，用于存储所述输入数据的属性数据、所述权重数据的属性数据和所述输出数据的属性数据；Register for storing the attribute data of the input data, the attribute data of the weight data, and the attribute data of the output data;

执行模块，与所述解码模块、所述存储器和所述寄存器相连，包括如权利要求1-6所述的卷积运算电路，用于执行所述解码后的卷积运算指令。The execution module is connected to the decoding module, the memory and the register, and includes the convolution operation circuit according to claims 1-6, which is used to execute the decoded convolution operation instruction.

进一步的，所述执行模块从所述解码模块中获取所述解码后的卷积运算指令；Further, the execution module obtains the decoded convolution operation instruction from the decoding module;

所述执行模块从所述寄存器中获取所述输入数据的属性数据、所述权重数据的属性数据和所述输出数据的属性数据；The execution module obtains the attribute data of the input data, the attribute data of the weight data, and the attribute data of the output data from the register;

所述执行模块根据所述输入数据的属性数据和所述权重数据的属性数据从所述存储器中获取用于计算的所述输入数据和所述权重数据；The execution module obtains the input data and the weight data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data;

所述执行模块根据所述解码后的卷积运算指令对所述输入数据和权重数据进行计算得到输出数据；The execution module calculates the input data and weight data according to the decoded convolution operation instruction to obtain output data;

所述执行模块根据所述输出数据的属性数据将所述输出数据存入所述存储器中。The execution module stores the output data in the memory according to the attribute data of the output data.

进一步的，所述输入数据为第一输入矩阵的数据，所述权重数据为卷积核的数据，所述输出数据为卷积得到的输出矩阵的数据；所述输入数据的属性数据包括所述第一输入矩阵的行数、列数、深度以及深度存储间隔；所述权重数据的属性数据包括所述卷积核的行数、列数、步长以及卷积核的行存储间隔；所述输出数据的属性数据包括所述输出矩阵的行数、列数、深度以及深度存储间隔。Further, the input data is data of a first input matrix, the weight data is data of a convolution kernel, and the output data is data of an output matrix obtained by convolution; the attribute data of the input data includes the The row number, column number, depth, and depth storage interval of the first input matrix; the attribute data of the weight data includes the row number, column number, step length of the convolution kernel, and row storage interval of the convolution kernel; The attribute data of the output data includes the number of rows, the number of columns, the depth, and the depth storage interval of the output matrix.

进一步的，所述执行模块根据所述输入数据的属性数据和所述权重数据的属性数据从所述存储器中获取用于计算的所述输入数据和所述权重数据，包括：Further, the execution module acquiring the input data and the weight data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data includes:

所述执行模块根据预设的第一读取方式以及所述输入数据的属性数据读取所述第一输入矩阵的数据；The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the input data;

所述执行模块根据预设的第二读取方式以及所述权重数据的属性数据读取所述卷积核的数据。The execution module reads the data of the convolution kernel according to the preset second reading method and the attribute data of the weight data.

进一步的，所述第一读取方式为按行读取或按列读取；所述第二读取方式为按行读取或按列读取。Further, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.

第三方面，本公开实施例提供一种矩阵运算方法，是基于前述第一方面中任一所述的卷积运算电路的矩阵运算方法，其特征在于，包括：In a third aspect, embodiments of the present disclosure provide a matrix operation method, which is based on the matrix operation method of the convolution operation circuit described in any one of the foregoing first aspects, and is characterized in that it includes:

从存储器中取出卷积运算指令；Fetch the convolution operation instruction from the memory;

对所述卷积运算指令进行解码，并将所述解码后的卷积运算指令发送至所述卷积运算电路；Decoding the convolution operation instruction, and sending the decoded convolution operation instruction to the convolution operation circuit;

基于所述解码后的卷积运算指令，所述卷积运算电路从所述存储器中获取输入数据和权重数据并进行运算，在运算完成后将运算结果存储到所述存储器中。Based on the decoded convolution operation instruction, the convolution operation circuit obtains input data and weight data from the memory and performs operations, and stores the operation result in the memory after the operation is completed.

第四方面，本公开实施例提供一种电子设备，包括：存储器，用于存储计算机可读指令；以及一个或多个处理器，用于运行所述计算机可读指令，使得所述处理器运行时实现前述第三方面中的任一所述卷积运算方法。In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run The convolution operation method described in any one of the foregoing third aspects is realized at the time.

第五方面，本公开实施例提供一种非暂态计算机可读存储介质，其特征在于，该非暂态计算机可读存储介质存储计算机指令，该计算机指令用于使计算机执行前述第三方面中的任一所述卷积运算方法。In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned third aspect Any of the convolution operation methods described above.

第六方面，本公开实施例提供一种计算机程序产品，其中，其特征在于：包括计算机指令，当所述计算机指令被计算设备执行时，所述计算设备可以执行前述第三方面中的任一所述卷积运算方法。In a sixth aspect, embodiments of the present disclosure provide a computer program product, which is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects. The convolution operation method.

第七方面，本公开实施例提供一种芯片，其特征在于，包括第一方面中的任一所述的矩阵运算电路。In a seventh aspect, an embodiment of the present disclosure provides a chip, which is characterized by comprising the matrix operation circuit described in any one of the first aspect.

第八方面，本公开实施例提供一种计算装置，其特征在于，包括所述第七方面中的任一所述的芯片。In an eighth aspect, an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any one of the seventh aspects.

本公开实施例公开了一种卷积运算电路、装置以及方法。其中该卷积运算电路包括：控制电路；卷积运算控制电路；运算单元阵列，所述运算单元阵列包括多个运算单元；所述卷积运算控制电路用于接收卷积运算指令，按照所述卷积运算指令所指示的运算顺序逐个将所述输入数据和所述权重数据传送至所述多个运算单元中参与运算的运算单元，所述参与运算的运算单元根据所述指令的指示，对所述输入数据和所述权重数据执行卷积运算操作，其中，所述指令为单条指令。通过上述方法，解决了现有技术中的在进行卷积计算时计算效率低、功耗大的技术问题。The embodiments of the present disclosure disclose a convolution operation circuit, device and method. Wherein the convolution operation circuit includes: a control circuit; a convolution operation control circuit; an arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units; the convolution operation control circuit is used to receive a convolution operation instruction, according to the The operation sequence indicated by the convolution operation instruction transmits the input data and the weight data one by one to the operation unit participating in the operation among the plurality of operation units, and the operation unit participating in the operation performs the operation according to the instruction of the instruction. The input data and the weight data perform a convolution operation operation, wherein the instruction is a single instruction. Through the above method, the technical problems of low calculation efficiency and high power consumption in the prior art when performing convolution calculations are solved.

上述说明仅是本公开技术方案的概述，为了能更清楚了解本公开的技术手段，而可依照说明书的内容予以实施，并且为让本公开的上述和其他目的、特征和优点能够更明显易懂，以下特举较佳实施例，并配合附图，详细说明如下。The above description is only an overview of the technical solutions of the present disclosure. In order to understand the technical means of the present disclosure more clearly, they can be implemented in accordance with the content of the specification, and to make the above and other objectives, features and advantages of the present disclosure more obvious and understandable. In the following, the preferred embodiments are cited in conjunction with the drawings, and the detailed description is as follows.

Description of the drawings

结合附图并参考以下具体实施方式，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.

图1为本公开实施例提供的矩阵运算电路的结构示意图；FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure;

图2为本公开实施例提供的运算单元的结构示意图；FIG. 2 is a schematic structural diagram of an arithmetic unit provided by an embodiment of the disclosure;

图3为本公开实施例提供的运算单元的进一步的结构示意图；FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure;

图4a-图4b本公开实施例中的卷积运算的整体示意图；Figures 4a-4b are overall schematic diagrams of convolution operations in an embodiment of the present disclosure;

图5-图9为本公开实施例提供的卷积运算的分解计算过程；5-9 are the decomposition calculation process of the convolution operation provided by the embodiments of the disclosure;

图10为本公开实施例中卷积运算的一个具体实例的示意图。FIG. 10 is a schematic diagram of a specific example of a convolution operation in an embodiment of the disclosure.

图11为本公开实施例提供的矩阵计算装置的结构示意图；FIG. 11 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure;

图12a-12e为本公开中的输入数据、权重数据和输出数据的存储格式的示意图；Figures 12a-12e are schematic diagrams of the storage formats of input data, weight data, and output data in this disclosure;

图13为本公开中的卷积运算的分块计算的示意图；FIG. 13 is a schematic diagram of block calculation of convolution operation in the present disclosure;

图14a-图14f为使用本公开实施例中所公开的卷积运算电路进行卷积运算的实例示意图。14a-14f are schematic diagrams of examples of performing convolution operations using the convolution operation circuit disclosed in an embodiment of the present disclosure.

Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for Have a more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for exemplary purposes, and are not used to limit the protection scope of the present disclosure.

应当理解，本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the steps recorded in the method embodiments of the present disclosure may be executed in a different order, and/or executed in parallel. In addition, method implementations may include additional steps and/or omit to perform the illustrated steps. The scope of the present disclosure is not limited in this respect.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。The term "including" and its variations as used herein are open-ended includes, that is, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Related definitions of other terms will be given in the following description.

需要注意，本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

需要注意，本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of “a” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as “one or Multiple".

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.

图1为本公开实施例提供的卷积运算电路的结构示意图。如图1所述，卷积运算电路100包括卷积运算控制电路101和运算单元阵列102，其中所述运算单元阵列包括多个运算单元(PU)103，所述运算单元103如图2所示，包括第一输入寄存器(Rin1)104、第二输入寄存器(Rin2)105以及输出寄存器(Rout)106，其中所述第一输入寄存器104用于接收输入数据，所述第二输入寄存器105用于接收权重数据；所述卷积运算控制电路101用于接收卷积运算指令，按照所述卷积运算指令所指示的运算顺序逐个将所述输入数据和所述权重数据传送至所述多个运算单元103中参与运算的运算单元103，所述参与运算的运算单元103根据所述指令的指示，对所述输入数据和所述权重数据执行卷积运算操作，其中，所述指令为单条指令；所述输出寄存器106用于存储所述运算操作的运算结果，即输出数据。FIG. 1 is a schematic structural diagram of a convolution operation circuit provided by an embodiment of the disclosure. As shown in FIG. 1, the convolution operation circuit 100 includes a convolution operation control circuit 101 and an arithmetic unit array 102, wherein the arithmetic unit array includes a plurality of arithmetic units (PU) 103, and the arithmetic unit 103 is shown in FIG. 2 , Including a first input register (Rin1) 104, a second input register (Rin2) 105, and an output register (Rout) 106, wherein the first input register 104 is used for receiving input data, and the second input register 105 is used for Receiving weight data; the convolution operation control circuit 101 is configured to receive a convolution operation instruction, and transmit the input data and the weight data to the multiple operations one by one according to the operation sequence indicated by the convolution operation instruction The arithmetic unit 103 that participates in the operation in the unit 103, and the arithmetic unit 103 that participates in the operation performs a convolution operation on the input data and the weight data according to the instruction of the instruction, wherein the instruction is a single instruction; The output register 106 is used to store the operation result of the operation operation, that is, output data.

其中所述卷积运算指令所指示的运算顺序是指在卷积运算时，卷积运算被分为多个步骤来执行，步骤之间的前后关系即为卷积运算指令所指示的运算顺序。具体的运算顺序可以参考下文对指令的说明。The operation order indicated by the convolution operation instruction refers to that during the convolution operation, the convolution operation is divided into multiple steps to be executed, and the context between the steps is the operation order indicated by the convolution operation instruction. For the specific operation sequence, please refer to the description of the instructions below.

参与运算的运算单元是指多个运算单元中执行具体运算的至少一个运算单元。The arithmetic unit involved in the operation refers to at least one arithmetic unit that performs a specific operation among a plurality of arithmetic units.

在本公开中，所述的卷积运算指令包括指令名、输入数据的首地址、权重数据的首地址和输出数据的首地址。如下表所示为一个示例性的指令格式：In the present disclosure, the convolution operation instruction includes the instruction name, the first address of input data, the first address of weight data, and the first address of output data. The following table shows an exemplary command format:

其中指令名与指令含义、格式以及实现的运算对应，输入数据的首地址、权重数据的首地址分别定义了指令的两个源操作数的读取地址，输出数据的首地址定义了指令的目的操作数的存储地址。所述卷积指令中包括乘法运算操作和加法运算操作，其将输入数据和权重数据先执行乘法运算，再对乘法运算的结果进行累加，得到输入数据与权重数据的卷积结果，即输出数据。其中，一个输出数据的计算过程如下：The name of the instruction corresponds to the meaning, format, and operation of the instruction. The first address of the input data and the first address of the weight data respectively define the read addresses of the two source operands of the instruction, and the first address of the output data defines the purpose of the instruction. The storage address of the operand. The convolution instruction includes a multiplication operation and an addition operation. It first performs a multiplication operation on the input data and the weight data, and then accumulates the result of the multiplication operation to obtain the convolution result of the input data and the weight data, that is, the output data . Among them, the calculation process of an output data is as follows:

其中

among them

图3为本公开实施例提供的运算单元的进一步的结构示意图。如图3所示，所述运算单元除了第一输入寄存器、第二输入寄存器以及输出寄存器之外，还包括运算器，所述运算器至少包括乘法器301以及加法器302，其中所述运算单元用于根据所述卷积运算指令组合所述乘法器和所述加法器以执行卷积运算。FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure. As shown in FIG. 3, in addition to the first input register, the second input register, and the output register, the arithmetic unit also includes an arithmetic unit. The arithmetic unit includes at least a multiplier 301 and an adder 302, wherein the arithmetic unit For combining the multiplier and the adder according to the convolution operation instruction to perform a convolution operation.

具体的，所述参与运算的运算单元根据所述指令的指示，对所述输入数据和所述权重数据的执行卷积运算操作，包括：通过所述乘法器计算所述输入数据和所述权重数据的乘积；通过所述加法器计算所述乘积的累加值；将所述累加值存入所述输出寄存器。其中，所述第一输入寄存器中的输入数据，是根据所述输入数据的首地址依次读入到所述第一输入寄存器中的数据；所述第二输入寄存器中的权重数据，是根据所述权重数据的首地址依次读入到所述第二输入寄存器中的数据。Specifically, the operation unit participating in the operation performs a convolution operation operation on the input data and the weight data according to the instruction of the instruction, including: calculating the input data and the weight through the multiplier The product of data; the accumulated value of the product is calculated by the adder; and the accumulated value is stored in the output register. Wherein, the input data in the first input register is the data sequentially read into the first input register according to the first address of the input data; the weight data in the second input register is according to the The first address of the weight data is sequentially read into the data in the second input register.

例如，在卷积运算控制电路将输入数据和权重数据传送至运算单元之后，运算单元通过乘法器计算第一输入寄存器中的输入数据和第二输入寄存器中的权重数据的乘积，如上示例中，在一个时钟周期中，如果第一输入寄存器中接收到的数据为a ₁，第二输入寄存器中接收到的数据为b ₁,则运算单元中的乘法器计算a ₁*b ₁的乘积，之后将所述a ₁*b ₁的乘积送入加法器做累加，由于此时累加器在上一时钟周期没有输入，因此累加器的累加结果为a ₁*b ₁；之后在下一个时钟周期，继续上述计算操作，此时第一输入寄存器中接收到的数据为a ₂，第二输入寄存器中接收到的数据为b ₂,则运算单元中的乘法器计算a ₂*b ₂的乘积，之后将所述a ₂*b ₂的乘积送入加法器做累加，累加器上一时钟周期的输入为a ₁*b ₁，因此在本时钟周期计算a ₁*b ₁+a ₂*b ₂；持续做上述操作，直至输入数据的一行和权重数据的一列计算完毕，得到最终的累加值，之后将累加值存入所述输出寄存器中。卷积运算控制电路根据所述指令中的输出数据首地址，将所述输出寄存器中的累加值存储系统存储器中。 For example, after the convolution operation control circuit transmits the input data and weight data to the arithmetic unit, the arithmetic unit calculates the product of the input data in the first input register and the weight data in the second input register through a multiplier, as in the above example, In one clock cycle, if the data received in the first input register is a ₁ and the data received in the second input register is b ₁ , the multiplier in the arithmetic unit calculates the product of _{a 1} *b _{1, and then} Send the product of a ₁ *b _{1 to} the adder for accumulation. At this time, the accumulator has no input in the previous clock cycle, so the accumulation result of the accumulator is a ₁ *b ₁ ; then in the next clock cycle, continue The above calculation operation, at this time the data received in the first input register is a ₂ , and the data received in the second input register is b ₂ , then the multiplier in the arithmetic unit calculates the product of _{a 2} *b _{2, and then} The product of a ₂ *b ₂ is sent to the adder for accumulation. The input of the previous clock cycle of the accumulator is a ₁ *b ₁ , so calculate a ₁ *b ₁ +a ₂ *b ₂ in this clock cycle; continue Do the above operations until one row of input data and one column of weight data are calculated, and the final accumulated value is obtained, and then the accumulated value is stored in the output register. The convolution operation control circuit stores the accumulated value in the output register in the system memory according to the first address of the output data in the instruction.

一个示例性的卷积运算如下：An exemplary convolution operation is as follows:

图4a-图4b为本公开实施例中的卷积运算的示意图。如图4a为卷积运算过程的整体示意图，其中的图示解释如下：4a-4b are schematic diagrams of convolution operations in an embodiment of the disclosure. Figure 4a is the overall schematic diagram of the convolution operation process, and the diagram is explained as follows:

Win：输入特征图(Feature Map)的宽度；Win: Enter the width of the feature map (Feature Map);

Hin：输入特征图的高度；Hin: input the height of the feature map;

Cin：输入特征图的通道数，后面统称输入特征图的深度；Cin: The number of channels of the input feature map, collectively referred to as the depth of the input feature map later;

S_Cin：输入特征图的深度上的物理存储的间隔；S_Cin: the interval of physical storage on the depth of the input feature map;

Wout：输出特征图的宽度；Wout: the width of the output feature map;

Hout：输出特征图的高度；Hout: the height of the output feature map;

Cout：输出特征图的通道数，后面统称输出特征图的深度；Cout: The number of channels of the output feature map, collectively referred to as the depth of the output feature map later;

S_Cout：输出特征图的深度上的物理存储的间隔；S_Cout: the interval of physical storage in the depth of the output feature map;

Kw：卷积核的宽度；Kw: the width of the convolution kernel;

Kh：卷积核的高度；Kh: the height of the convolution kernel;

S_kernel：卷积核在物理存储上的间隔；S_kernel: the interval between the convolution kernel on the physical storage;

N_dilation：卷积核的膨胀值；N_dilation: the dilation value of the convolution kernel;

S_K：卷积滑动的步长；S_K: the step size of convolution sliding;

N_pad_u：上方补0数；N_pad_u: add 0 number above;

N_pad_d：下方补0数；N_pad_d: add 0 number below;

N_pad_l：左方补0数；N_pad_l: add 0 to the left;

N_pad_r：由方补0数。N_pad_r: Add 0 to the party.

其中输入特征图上的特征点构成为第一输入矩阵；输出特征图为输出矩阵，输出特征图上的一个特征点为输出矩阵上的一个数据。在进行卷积运算时，卷积核会在输入特征图上滑动，每滑动一次，就会与输入特征图中对应的数据进行数据乘累加，提取一个输出特征点，即输出矩阵上的一个数据。The feature points on the input feature map constitute the first input matrix; the output feature map is the output matrix, and a feature point on the output feature map is a piece of data on the output matrix. During the convolution operation, the convolution kernel will slide on the input feature map. Each time it slides, it will multiply and accumulate data with the corresponding data in the input feature map to extract an output feature point, that is, a data on the output matrix. .

图4b为一个带深度的输入特征点的计算示意图。如图4b所示，卷积核在输入特征图上滑动，当它停在一个位置，会与该位置处的输入特征图中的特征点进行对应数据乘累加，得到与该位置对应的输出特征点；共有Cout个卷积核，每一个卷积核均会与此同一位置出的输入特征图中的特征点进行数据乘累加，得到深度方向的Cout个输出特征点；Cout个输出特征点组成整个输出特征图上的一个带深度的特征点，此点的深度即Cout；卷积核会滑完整个输入特征图，从而得到整个输出特征图。Figure 4b is a schematic diagram of the calculation of an input feature point with depth. As shown in Figure 4b, the convolution kernel slides on the input feature map. When it stops at a position, it will multiply and accumulate the corresponding data with the feature points in the input feature map at that position to obtain the output feature corresponding to the position. Points; there are Cout convolution kernels, and each convolution kernel will multiply and accumulate data with the feature points in the input feature map at the same position to obtain Cout output feature points in the depth direction; Cout output feature points are composed A feature point with depth on the entire output feature map, the depth of this point is Cout; the convolution kernel will slide the entire input feature map to obtain the entire output feature map.

对于处于深度l(1<＝l<＝Cout)上的某个卷积核，它进行特征提取的公式如下：For a certain convolution kernel at depth l (1<=l<=Cout), its feature extraction formula is as follows:

Dout是输出特征图中的某个带深度的点，其上标l对应输出的深度；Din是指输入特征图中对应于卷积核的数据，其上标i对应输入特征图的深度，j和k分别对应卷积核对应的宽度和高度；w是卷积核，其上标l和i分别对应输出特征图的深度和输入特征图的深度，j和k分别对应此卷积核的宽度和高度。Dout is a point with depth in the output feature map, and its superscript l corresponds to the depth of the output; Din refers to the data corresponding to the convolution kernel in the input feature map, and its superscript i corresponds to the depth of the input feature map, j And k respectively correspond to the width and height of the convolution kernel; w is the convolution kernel, and its superscripts l and i correspond to the depth of the output feature map and the depth of the input feature map, respectively, and j and k correspond to the width of the convolution kernel. And height.

对于尺寸为Kh*Kw*Cin的卷积核，可以分成Kh个Kw*Cin的部分卷积核，进行部分特征提取，每次实现的是整个特征提取的1/Kh，也就是Kw*Cin的部分卷积核对应的特征，得到的部分结果是：

最后将这Kh个部分结果相加即得到最终结果

For the convolution kernel of size Kh*Kw*Cin, it can be divided into Kh Kw*Cin partial convolution kernels, and partial feature extraction is performed. Each time, 1/Kh of the entire feature extraction is realized, which is Kw*Cin Part of the features corresponding to the convolution kernel, and the partial results obtained are:

Finally, add these Kh partial results to get the final result

其中，

又可以分成Kw步，每一步实现

然后将Kw个部分结果相加即得到最终结果

的实现，是一个一行的输入数据矩阵(即部分的第一输入矩阵)和一个一列的权重矩阵(即部分的卷积核)相乘，其实现如图5所示。 among them,

It can be divided into Kw steps, each step is realized

Then add the Kw partial results to get the final result

The realization of is to multiply a row of input data matrix (that is, part of the first input matrix) and a column of weight matrix (that is, part of the convolution kernel), and the realization is shown in Fig. 5.

的实现，同样是一个一行的输入数据矩阵和一个一列的权重矩阵相乘，只是此时的行和列中数据的个数是

中数据的个数的Kw倍，其实现如图6所示。

The realization of the same is the multiplication of a row of input data matrix and a column of weight matrix, but the number of data in rows and columns at this time

Kw times the number of data in the middle, its realization is shown in Figure 6.

而对于得到一个完整的输出特征图的点的卷积，也就是

的实现，还是一个一行的输入数据矩阵和一个一列的权重矩阵相乘，只是此行和此列中元素的个数，是

中元素的个数的Kh倍，其实现如图7所示。 And for the convolution of the points to get a complete output feature map, that is

The realization is that a row of input data matrix is multiplied by a column of weight matrix, but the number of elements in this row and this column is

Kh times the number of elements in the middle, the realization is shown in Figure 7.

卷积核的数量是Cout个，因此，输出特征点的深度是Cout，可以将一行的输入数据矩阵与由Cout个卷积核构成的Cout列的卷积核矩阵即权重矩阵相乘，得出一个带深度的特征点，此特征点是一个向量，向量的长度即输出特征点的深度Cout，其实现如图8所示。The number of convolution kernels is Cout. Therefore, the depth of the output feature point is Cout. The input data matrix of one row can be multiplied by the convolution kernel matrix of the Cout column composed of Cout convolution kernels, that is, the weight matrix. A feature point with depth, this feature point is a vector, the length of the vector is the depth Cout of the output feature point, and its realization is shown in Figure 8.

又由于神经网络实现卷积或者部分卷积的过程，就是卷积核在输入特征图上的滑动过程，可以看成是输入特征图的数据随着滑动变化，而权重不变的过程，这样神经网络实现卷积的过程就变成了卷积核在输出特征图上滑动Wout*Hout次，Wout*Hout行的输入数据矩阵和Cout列的权重矩阵的相乘，得出Wout*Hout行、Cout列的输出数据矩阵，其实现如图9示。And because the process of convolution or partial convolution of the neural network is the sliding process of the convolution kernel on the input feature map, it can be regarded as the process in which the data of the input feature map changes with the sliding, while the weight remains unchanged. The process of convolution by the network becomes the convolution kernel sliding Wout*Hout times on the output feature map. The input data matrix of the Wout*Hout row and the weight matrix of the Cout column are multiplied to obtain the Wout*Hout row and Cout. The output data matrix of the column, its realization is shown in Figure 9.

上述的卷积运算，在本公开中只需要使用一条单独的指令，即上述卷积运算指令即可完成整个卷积过程，其只需要输入数据、权重以及输出数据会以预定的方式存储，使用单独的卷积运算电路来即可实现整个卷积。The above-mentioned convolution operation in this disclosure only needs to use a single instruction, that is, the above-mentioned convolution operation instruction can complete the entire convolution process. It only requires input data, weights, and output data to be stored in a predetermined manner. The entire convolution can be realized by a single convolution operation circuit.

具体的，在上述实施例中的卷积运算电路中：所述输入数据为第一输入矩阵的行向量数据；所述权重数据为卷积核的行向量数据。也就是说，第一输入寄存器中的数据和第二输入寄存器中的数据，均需要是矩阵的行向量数据，这样计算出来的数据才是卷积结果。如图10所示为上述卷积计算的示意图，以第一输入矩阵为一个3*3*2的矩阵，卷积核为两个2*2*2的矩阵为例，步长为1，输出矩阵为一个2*2*2的矩阵。如图10所示，进行卷积时，所述卷积核在所述第一输入矩阵上滑动，读取出的第一输入矩阵的数据如1001的所示，每行为卷积核1002的一个位置上对应的第一输入矩阵的数据，包括深度为2的4个数，一共8个数据；1002中的一列为一个卷积核以深度优先的方式按行读出的数据，包括深度的为2的4个数，一共8个数据,；1001中的一行数据与1002中的一列数据做乘累加运算，得到1003中的一个数据，1003中的一个数据为输出矩阵上的一个深度为2的数据中的2个值中的一个，输出数据的每一行，对应于输出矩阵中的一个深度为2的数据。Specifically, in the convolution operation circuit in the foregoing embodiment: the input data is the row vector data of the first input matrix; the weight data is the row vector data of the convolution kernel. In other words, the data in the first input register and the data in the second input register both need to be row vector data of the matrix, so that the calculated data is the convolution result. Figure 10 is a schematic diagram of the above convolution calculation. Take the first input matrix as a 3*3*2 matrix and the convolution kernel as two 2*2*2 matrices as an example, the step size is 1, the output The matrix is a 2*2*2 matrix. As shown in FIG. 10, when performing convolution, the convolution kernel slides on the first input matrix, and the data of the first input matrix read is as shown in 1001, and each row is a convolution kernel 1002. The data of the first input matrix corresponding to the position includes 4 numbers with a depth of 2, a total of 8 data; one column in 1002 is the data read by a convolution kernel in a depth-first manner, including the depth of The 4 numbers of 2, a total of 8 data,; a row of data in 1001 and a column of data in 1002 are multiplied and accumulated to obtain a data in 1003, and a data in 1003 is a depth of 2 on the output matrix One of the two values in the data, each row of the output data, corresponds to a data with a depth of 2 in the output matrix.

结合图1-3所示的PU阵列，简要说明一下PU阵列如何实现卷积的计算。Combined with the PU array shown in Figures 1-3, briefly explain how the PU array implements the calculation of convolution.

如图10中所示，1001为第一输入矩阵的存储结构，1002为卷积核的存储结构，1003为输出矩阵的存储结构，1004为按照卷积的运算顺序，控制单元从所述1001中读取出的数据，每一行数据表示卷积核在第一输入矩阵上滑动时所对应的数据块。运算单元将数据读出后，逐个将数据送入参与运算的运算单元中。例如：将图10中第一输入矩阵的数据1004的第一行的第一个数值1送入PU11的Rin1和PU12的Rin1，第一输入矩阵的数据1004的第二行的第一个数值3送入PU21的Rin1和PU22的Rin1，第一输入矩阵数据1004的第三行的第一个数值7送入PU31的Rin1和PU32的Rin1，第一输入矩阵数据1004的第四行的第一个数值9送入PU41的Rin1和PU42的Rin1；将卷积核1002中的第一列的第一个数值0.1送入PU11的Rin2、PU21的Rin2、PU31的Rin2以及PU41的Rin2，卷积核1002中的第二列的第一个数值0.9送入PU12的Rin2、PU22的Rin2、PU32的Rin2和PU42的Rin2；PU11、PU12、PU21、PU22、PU31、PU32、PU41、PU42分别执行数据乘操作，并将结果送入输出寄存器保存；在下一个时钟周期，第一输入矩阵的数据1004的第一行的第二个数值2送入PU11的Rin1和PU12的Rin1，第一输入矩阵的数据1004的第二行的第二个数值4送入PU21的Rin1和PU22的Rin1，第一输入矩阵数据1004的第三行的第二个数值8送入PU31的Rin1和PU32的Rin1，第一输入矩阵数据1004的第四行的第二个数值10送入PU41的Rin1和PU42的Rin1；将卷积核1002中的第一列的第二个数值0.2送入PU11的Rin2、PU21的Rin2、PU31的Rin2以及PU41的Rin2，卷积核1002中的第二列的第二个数值1.0送入PU12的Rin2、PU22的Rin2、PU32的Rin2和PU42的Rin2，PU11、PU12、PU21、PU22、PU31、PU32、PU41、PU42分别执行此次的数据乘操作，并将结果送入输出寄存器与上次保存的结果进行累加。As shown in Figure 10, 1001 is the storage structure of the first input matrix, 1002 is the storage structure of the convolution kernel, 1003 is the storage structure of the output matrix, and 1004 is the operation sequence of the convolution. In the read data, each row of data represents the data block corresponding to when the convolution kernel slides on the first input matrix. After the arithmetic unit reads out the data, it sends the data to the arithmetic units participating in the calculation one by one. For example: the first value 1 in the first row of data 1004 of the first input matrix in Figure 10 is sent to Rin1 of PU11 and Rin1 of PU12, and the first value of data 1004 in the second row of the first input matrix is 3 Rin1 of PU21 and Rin1 of PU22, the first value 7 of the third row of the first input matrix data 1004 is sent to Rin1 of PU31 and Rin1 of PU32, the first of the fourth row of the first input matrix data 1004 The value 9 is sent to Rin1 of PU41 and Rin1 of PU42; the first value 0.1 in the first column of convolution kernel 1002 is sent to Rin2 of PU11, Rin2 of PU21, Rin2 of PU31, and Rin2 of PU41, convolution kernel 1002 The first value of 0.9 in the second column in is sent to Rin2 of PU12, Rin2 of PU22, Rin2 of PU32, and Rin2 of PU42; PU11, PU12, PU21, PU22, PU31, PU32, PU41, and PU42 perform data multiplication operations respectively, And send the result to the output register to save; in the next clock cycle, the second value 2 of the first row of the data 1004 of the first input matrix is sent to Rin1 of PU11 and Rin1 of PU12, and the data of the first input matrix 1004 is the first The second value 4 of the two rows is sent to Rin1 of PU21 and Rin1 of PU22, the second value 8 of the third row of the first input matrix data 1004 is sent to Rin1 of PU31 and Rin1 of PU32, and the first input matrix data 1004 The second value 10 in the fourth row of the is sent to Rin1 of PU41 and Rin1 of PU42; the second value of 0.2 in the first column of the convolution kernel 1002 is sent to Rin2 of PU11, Rin2 of PU21, Rin2 of PU31, and Rin2 of PU41, the second value 1.0 in the second column of convolution kernel 1002 is sent to Rin2 of PU12, Rin2 of PU22, Rin2 of PU32 and Rin2 of PU42, PU11, PU12, PU21, PU22, PU31, PU32, PU41 , PU42 performs this data multiplication operation respectively, and sends the result to the output register and accumulates the result saved last time.

依次类推，最后求得卷积的乘累加结果。By analogy, the result of multiplication and accumulation of convolution is finally obtained.

图11为本公开实施例提供的矩阵计算装置的结构示意图。如图11所示，所述卷积运算装置1100包括：存储器1101，用于存储卷积运算指令、输入数据、权重数据以及输出数据；取指模块1102，与所述存储器1101相连，用于从所述存储器1101中获取所述卷积运算指令；解码模块1103，与所述取指模块1102相连，用于对所述取指模块1102所获取到的所述卷积运算指令进行解码；寄存器1104，用于存储所述输入数据的属性数据、所述权重数据的属性数据和所述输出数据的属性数据；执行模块1105，与所述解码模块1103、所述存储器1101和所述寄存器1104相连，包括上述实施例中的卷积运算电路，用于执行所述解码后的卷积运算指令。FIG. 11 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure. As shown in FIG. 11, the convolution operation device 1100 includes: a memory 1101 for storing convolution operation instructions, input data, weight data, and output data; and an instruction fetching module 1102, which is connected to the memory 1101, and is used to store convolution operation instructions, input data, weight data, and output data; The memory 1101 obtains the convolution operation instruction; the decoding module 1103 is connected to the instruction fetching module 1102 and is used to decode the convolution operation instruction obtained by the instruction fetching module 1102; the register 1104 , Used to store the attribute data of the input data, the attribute data of the weight data, and the attribute data of the output data; the execution module 1105 is connected to the decoding module 1103, the memory 1101 and the register 1104, It includes the convolution operation circuit in the above embodiment, and is used to execute the decoded convolution operation instruction.

在一个实施例中，所述执行模块从所述解码模块中获取所述解码后的矩阵运算指令；所述执行模块从所述寄存器中获取所述输入数据的属性数据、所述权重数据的属性数据和所述输出数据的属性数据；所述执行模块根据所述输入数据的属性数据和所述权重数据的属性数据从所述存储器中获取用于计算的所述输入数据和所述权重数据；所述执行模块根据所述解码后的卷积运算指令对所述输入数据和权重数据进行计算得到输出数据；所述执行模块根据所述输出数据的属性数据将所述输出数据存入所述存储器中。其中，所述输入数据为第一输入矩阵的数据，所述权重数据为卷积核的数据，所述输出数据为卷积得到的输出矩阵的数据；所述输入数据的属性数据包括所述第一输入矩阵的行数、列数、深度以及深度存储间隔；所述权重数据的属性数据包括所述卷积核的行数、列数、步长以及卷积核之间的存储间隔；所述输出数据的属性数据包括所述输出矩阵的行数、列数、深度以及深度存储间隔。其中，所述行数和列数定义了矩阵的大小，所述输入数据存储间隔定义了的输入数据在深度上的存储地址差，所述输出数据存储间隔定义了输出数据在深度上的存储地址差，所述卷积核的存储间隔定义了两个卷积核之间的存储地址差。例如第一输入矩阵在深度方向上有10个int8的矩阵元素，如果是连续存储，那么行存储间隔就是10Byte，如果是以一定的间隔存储，例如20Byte，那么有10Byte是矩阵元素，另外10Byte的内容不属于本矩阵，可能是无效的数据，也可能是其他用途的数据。In one embodiment, the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the input data and the attribute of the weight data from the register Data and attribute data of the output data; the execution module obtains the input data and the weight data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data; The execution module calculates the input data and weight data to obtain output data according to the decoded convolution operation instruction; the execution module stores the output data in the memory according to the attribute data of the output data in. Wherein, the input data is data of a first input matrix, the weight data is data of a convolution kernel, and the output data is data of an output matrix obtained by convolution; the attribute data of the input data includes the first The number of rows, the number of columns, the depth, and the depth storage interval of an input matrix; the attribute data of the weight data includes the number of rows, the number of columns, the step length of the convolution kernel, and the storage interval between the convolution kernels; The attribute data of the output data includes the number of rows, the number of columns, the depth, and the depth storage interval of the output matrix. Wherein, the number of rows and the number of columns defines the size of the matrix, the input data storage interval defines the storage address difference of the input data in depth, and the output data storage interval defines the storage address of the output data in depth Difference, the storage interval of the convolution kernel defines the storage address difference between the two convolution kernels. For example, the first input matrix has 10 int8 matrix elements in the depth direction. If it is stored continuously, the row storage interval is 10Byte. If it is stored at a certain interval, such as 20Byte, then 10Byte is a matrix element, and the other 10Byte is The content does not belong to this matrix, it may be invalid data, or it may be data for other purposes.

可选的，所述执行模块根据所述输入数据的属性数据和所述权重数据的属性数据从所述存储器中获取用于计算的所述输入数据和所述权重数据，包括：所述执行模块根据预设的第一读取方式以及所述输入数据的属性数据读取所述第一输入矩阵的数据；所述执行模块根据预设的第二读取方式以及所述权重数据的属性数据读取所述卷积核的数据。可选的，所述第一读取方式为按行读取或按列读取；所述第二读取方式为按行读取或按列读取。Optionally, the execution module obtains the input data and the weight data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data, including: the execution module Read the data of the first input matrix according to the preset first reading method and the attribute data of the input data; the execution module reads the data of the first input matrix according to the preset second reading method and the attribute data of the weight data Take the data of the convolution kernel. Optionally, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.

举例来说，如果所述属性数据定义了第一输入矩阵的行数为5行，列数为5列，深度为2，深度存储间距为4，预设的读取方式为按行读取，则根据指令中的第一输入矩阵的首地址、列数、深度以及深度存储间隔读取第一输入矩阵的第一行，通过列数可以知道该第一行有5个矩阵元素，每个矩阵元素的深度为2，则读取2*5＝10个数据之后，得到第一输入矩阵的一行数据，之后将首地址加上5个深度存储间隔作为新的首地址，读取第二行，同样是5个矩阵元素，这样依次读取5次，则所述执行模块可以获取到第一输入矩阵的所有数据。For example, if the attribute data defines that the number of rows of the first input matrix is 5 rows, the number of columns is 5 columns, the depth is 2, and the depth storage interval is 4, the preset reading method is reading by row, Then read the first row of the first input matrix according to the first address, number of columns, depth, and depth storage interval of the first input matrix in the instruction. You can know that the first row has 5 matrix elements through the number of columns. The depth of the element is 2, then after reading 2*5=10 data, one row of data of the first input matrix is obtained, and then the first address plus 5 depth storage intervals are used as the new first address, and the second row is read, There are also 5 matrix elements, so read 5 times in sequence, the execution module can obtain all the data of the first input matrix.

同样的，所述卷积核的数据按照同样的方式获取。可选的，所述执行模块根据所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中，包括：所述执行模块根据预设的存储方式以及所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中。其中，所述预定的存储方式为按行存储或者按列存储，具体的存储方式与读取类似，只是方向相反，在此不再赘述。Similarly, the data of the convolution kernel is obtained in the same way. Optionally, the execution module storing the data of the output matrix in the memory according to the attribute data of the output matrix includes: the execution module according to a preset storage mode and the attribute data of the output matrix The data of the output matrix is stored in the memory. Wherein, the predetermined storage mode is row storage or column storage, and the specific storage mode is similar to reading, but the direction is opposite, so it will not be repeated here.

图12a为本公开中的第一输入矩阵的存储顺序和格式的示意图。如图12a所示，为上述实施例中的第一输入矩阵的一个示例，其在存储器中存储时，按照深度Cin优先，然后是宽度Win，最后是高度Hin的方式存储，深度的间隔为S_Cin。以Cin＝2、Win＝3、Hin＝3、S_Cin＝4为例，先存储Hin＝3行中的第一行的第一个点，由于带深度，所述一个点包含2个数据，由于S_Cin＝4，因此在存储的首地址上加上4个数据的存储地址，再存第二个点，也包含2个数据，之后再在该点的首地址上加上4个数据的存储地址，再存第三个点，也包含2个数据，直到存完第3个点，第一行一共有3*2＝6个数据，包含Win方向上的3个点；之后开始存Hin＝3行中的第二行的第一个点，依次这样存储，直到所有的点被存储完毕。第一输入矩阵的存储顺序和格式的实例如图12b所示。Fig. 12a is a schematic diagram of the storage order and format of the first input matrix in the present disclosure. As shown in Figure 12a, it is an example of the first input matrix in the above embodiment. When it is stored in the memory, it is stored in a manner of depth Cin first, then width Win, and finally height Hin. The depth interval is S_Cin. . Taking Cin=2, Win=3, Hin=3, S_Cin=4 as an example, first store the first point of the first line in Hin=3 rows. Due to the depth, the one point contains 2 data. S_Cin=4, so add 4 data storage addresses to the first address of storage, and then store the second point, which also contains 2 data, and then add 4 data storage addresses to the first address of the point , And then save the third point, which also contains 2 data, until the third point is stored, there are a total of 3*2=6 data in the first row, including 3 points in the Win direction; after that, Hin=3 The first point of the second line in the row is stored in this way until all the points are stored. An example of the storage order and format of the first input matrix is shown in Figure 12b.

图12c为本公开的卷积核的存储顺序和格式的示意图，如图12c所示，为上述实施例中的卷积核的一个示例，其在存储器中存储时，按照Cout(卷积核的个数)优先的方式按行存储，每一列存储一个卷积核，在列方向上，按照卷积核的深度Cin优先，然后是宽度Kw，然后是高度Kh的方式存储。以Cout＝2、Cin＝2、Kw＝2、Kh＝2、S_kernel＝2为例，一共会存储Cin*Kw*Kh＝8行数据，每行存储2个数据。先存储第一个卷积核的第一行第一列的深度为2的点中的第一个数据，再存储第二个卷积核的第一行第一列的深度为2的点中的第一个数据，完成卷积核的第一行数据的存储；之后通过卷积核的首地址加上S_kernel计算第二行的首地址，存储第一个卷积核的第一行第一列的深度为2的点中的第二个数据，再存储第二个卷积核的第一行第一列的深度为2的点中的第二个数据，完成卷积核的第二行数据的存储,依次这样存储，直到所有的点被存储完毕。卷积核的存储顺序和格式的实例如图12d所示。Fig. 12c is a schematic diagram of the storage order and format of the convolution kernel of the present disclosure. As shown in Fig. 12c, it is an example of the convolution kernel in the above-mentioned embodiment. The number) priority is stored in rows, and each column stores a convolution kernel. In the column direction, the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh. Taking Cout=2, Cin=2, Kw=2, Kh=2, S_kernel=2 as an example, a total of Cin*Kw*Kh=8 rows of data will be stored, and each row will store 2 data. First store the first data in the depth 2 point in the first row and first column of the first convolution kernel, and then store the depth 2 point in the first row and first column of the second convolution kernel Complete the storage of the first row of data of the convolution kernel; then calculate the first address of the second row by adding S_kernel to the first address of the convolution kernel, and store the first row of the first convolution kernel. The second data in the point with a depth of 2 in the column, and then the second data in the point with a depth of 2 in the first row and the first column of the second convolution kernel is stored to complete the second row of the convolution kernel The data is stored in this way, until all the points are stored. An example of the storage order and format of the convolution kernel is shown in Figure 12d.

因为输出矩阵可以用作下一次卷积计算的第一输入矩阵，因此在一个实施例中，所述输出矩阵的存储顺序和格式与第一输入矩阵相同，如图12e所示为本公开的输出矩阵的存储顺序和格式的示意图。Because the output matrix can be used as the first input matrix for the next convolution calculation, in one embodiment, the storage order and format of the output matrix are the same as the first input matrix, as shown in FIG. 12e, which is the output of the present disclosure. Schematic diagram of the storage order and format of the matrix.

如图12a中所示的rs1即为第一输入矩阵的首地址。通过存储格式的设定以及读取方式的设定可以控制读入的矩阵的数据，如按照上述示例中的按深度方向顺序存储，配合按行读取以及深度间隔，则可以读取出第一输入矩阵的行向量数据。如图12c中所示的rs2即为卷积核的首地址，通过存储格式的设定以及读取方式的设定可以控制读入的矩阵的数据，如按照上述示例中的顺序进行存储，配合按行读取以及卷积核的行间隔，则可以读取出卷积核的行向量数据。第一输入矩阵和卷积核的数据读出之后，利用上述卷积运算指令完成单指令的卷积运算。如图12e中所示的rd即为输出矩阵的首地址。上述卷积运算指令所计算出的输出数据，按照图12e中的方式从rd开始存储，直至所有的输出数据存储完毕。As shown in Figure 12a, rs1 is the first address of the first input matrix. The data of the matrix that is read in can be controlled by setting the storage format and the setting of the reading method. For example, according to the order of the depth direction in the above example, with the row reading and the depth interval, you can read the first Enter the row vector data of the matrix. As shown in Figure 12c, rs2 is the first address of the convolution kernel. The data of the read matrix can be controlled by setting the storage format and the setting of the reading mode. For example, store the data in the order in the above example. According to the row reading and the row interval of the convolution kernel, the row vector data of the convolution kernel can be read. After the data of the first input matrix and the convolution kernel are read, the convolution operation of the single instruction is completed by using the convolution operation instruction. The rd shown in Figure 12e is the first address of the output matrix. The output data calculated by the above convolution operation instruction is stored from rd in the manner in FIG. 12e until all the output data is stored.

在调用上述卷积运算指令之前，需要定义卷积运算的参数，在本公开中设置了寄存器1104来存储卷积的参数，可以对所述第一输入矩阵、所述卷积核和所述输出矩阵的属性数据进行设置。寄存器1104的一个示例配置如下表所示：Before calling the above-mentioned convolution operation instruction, it is necessary to define the parameters of the convolution operation. In the present disclosure, a register 1104 is set to store the parameters of the convolution. The first input matrix, the convolution kernel, and the output The attribute data of the matrix is set. An example configuration of register 1104 is shown in the following table:

寄存器名Register name 寄存器功能描述(31:0)Register function description (31:0) Conv_FM_inConv_FM_in 31:16(FM_in的宽度Win)；15:0(FM_in的高度Hin)31:16 (the width of FM_in Win); 15:0 (the height of FM_in Hin) Conv_Depth_inConv_Depth_in 31:16(FM_in的深度上的间隔即S_Cin)；15:0(FM_in的深度即Cin)31:16 (the interval on the depth of FM_in is S_Cin); 15:0 (the depth of FM_in is Cin) Conv_FM_outConv_FM_out 31:16(FM_out的宽度Wout)；15:0(FM_out的高度Hout)31:16 (width Wout of FM_out); 15:0 (height Hout of FM_out) Conv_Depth_outConv_Depth_out 31:16(FM_out的深度上的间隔即S_Cout)；15:0(FM_out的深度即Cout)31:16 (the interval on the depth of FM_out is S_Cout); 15:0 (the depth of FM_out is Cout) Conv_S_kernelConv_S_kernel 15:0(卷积核在物理存储上面行间隔即S_kernel)15:0 (the line interval between the convolution kernel in the physical storage is S_kernel) Conv_kernelConv_kernel 31:24(Kernel的宽Kw)；23:16(Kernel的高Kh)；15:8(Kernel的N_dilation)；7:0(卷积滑窗步长S_K)31:24 (Kernel width Kw); 23:16 (Kernel height Kh); 15:8 (Kernel N_dilation); 7:0 (convolution sliding window step size S_K) Conv_paddingConv_padding 31:24(上补0数N_pad_u)；23:16(下补0数N_pad_d)；15:8(左补0数N_pad_l)；7:0(右补0数N_pad_r)31:24 (upper complement 0 number N_pad_u); 23:16 (lower complement 0 number N_pad_d); 15:8 (left complement 0 number N_pad_l); 7:0 (right complement 0 number N_pad_r)

其中，FM_in为第一输入矩阵，FM_out为输出矩阵，Kernel为卷积核Among them, FM_in is the first input matrix, FM_out is the output matrix, and Kernel is the convolution kernel

在卷积运算中，卷积运算控制电路，按逻辑运算的顺序，计算输入数据、权重数据和输出数据的地址；运算单元阵列根据地址，从存储器中读取输入数据和权重数据，进行计算，然后将计算结果根据输出数据的地址存入存储器，完成整个卷积运算。In the convolution operation, the convolution operation control circuit calculates the addresses of input data, weight data and output data in the order of logical operation; the arithmetic unit array reads the input data and weight data from the memory according to the address, and performs calculations. Then the calculation result is stored in the memory according to the address of the output data, and the entire convolution operation is completed.

但是由于运算单元阵列中的运算单元个数有限，例如只有M*N个，而输出矩阵的元素个数远多于M*N个，因此在卷积运算中，通常会将输入数据和权重分块读入，而对输出数据也是分块输出存储，其卷积运算的示意图如图13所示，将输入数据和权重数据分为多个M*N的数据块输入运算单元阵列，将计算的结果也分为多个M*N的数据块分别输出。However, due to the limited number of arithmetic units in the arithmetic unit array, such as only M*N, and the number of elements in the output matrix is much more than M*N, the input data and weights are usually divided into convolution operations. Block read in, and block output and storage for output data. The schematic diagram of the convolution operation is shown in Figure 13. The input data and weight data are divided into multiple M*N data blocks into the arithmetic unit array, and the calculated The result is also divided into multiple M*N data blocks and output separately.

如图14a-14f为上述卷积运算电路根据所述卷积运算指令进行卷积运算的一个实例。在该实例中，卷积运算的参数如下：Figures 14a-14f show an example of the convolution operation circuit performing the convolution operation according to the convolution operation instruction. In this example, the parameters of the convolution operation are as follows:

运算单元阵列M*N是2*2；The arithmetic unit array M*N is 2*2;

输入数据(第一输入矩阵)的属性数据为：Win＝Hin＝Cin＝4,S_Cin＝8；The attribute data of the input data (the first input matrix) is: Win=Hin=Cin=4, S_Cin=8;

输出数据(输出矩阵)的属性数据为：Wout＝Hout＝Cout＝4,S_Cout＝8；The attribute data of the output data (output matrix) is: Wout=Hout=Cout=4, S_Cout=8;

权重数据(卷积核)的属性数据为：S_kernel＝8,Kw＝Kh＝1,N_dilation＝1,S_K＝1；The attribute data of the weight data (convolution kernel) is: S_kernel=8, Kw=Kh=1, N_dilation=1, S_K=1;

Padding的参数为：N_pad_u＝N_pad_d＝N_pad_l＝N_pad_r＝0。The parameter of Padding is: N_pad_u=N_pad_d=N_pad_l=N_pad_r=0.

整个卷积示意图如图14a所示：4*4*4的输入数据与4个1*1*4的权重数据进行卷积运算得到4*4*4的输出数据。The entire convolution diagram is shown in Figure 14a: 4*4*4 input data and 4 1*1*4 weight data are convolved to obtain 4*4*4 output data.

其中，所述输入数据的存储格式如图14b所示，输入数据存储格式中的每一行存储了一个深度为4的输入数据中的点，相邻两点之间的间隔为8。权重数据的存储格式如图14c所示，权重数据的存储格式中的每一列存储了一个卷积核的4个点，每一行存储的是同深度上的4个卷积核的点，行间隔为8。输出数据的存储格式如图14d所示，输出数据的存储格式与输入数据的存储格式相同，输出数据存储格式中的每一行存储了一个深度为4的输出数据中的点，相邻两点之间的间隔为8。The storage format of the input data is shown in FIG. 14b. Each row in the input data storage format stores a point in the input data with a depth of 4, and the interval between two adjacent points is 8. The storage format of the weight data is shown in Figure 14c. Each column in the storage format of the weight data stores 4 points of a convolution kernel, and each row stores the points of 4 convolution kernels at the same depth. Is 8. The storage format of the output data is shown in Figure 14d. The storage format of the output data is the same as the storage format of the input data. Each row in the output data storage format stores a point in the output data with a depth of 4. The interval between is 8.

卷积的实际计算过程如图14e和图14f所示。具体的，在一个时钟周期中，卷积运算控制电路从第一输入矩阵中读取输入数据，将输入数据送入运算单元的第一输入寄存器中，卷积运算控制电路从卷积核中读取权重数据，将权重数据送入运算单元的第二输入寄存器中，在下一个时钟周期中，运算单元中的运算器使用第一输入寄存器中的输入数据和第二输入寄存器中权重数据进行乘运算，并与之前的周期中的乘加运算的累计值进行累加。如图14e所示，由于运算单元阵列的大小为2*2,因此以2*2为单元分别取输入数据和权重数据进行计算。第一次取数据和权重进行矩阵乘加运算，之后暂存中间数据。每一个运算单元PU实现的是其对应的输入数据的行和权重数据的列的乘加，例如在输入数据和权重数据已经被送入运算单元的第一输入寄存器和第二输入寄存器中时，在第一个时钟周期，PU11计算1*0.1＝0.1，由于0.1只是整个卷积的一部分，所以将它暂时存在输出寄存器，而不是写入存储器中。同样，PU12计算1*0.5＝0.5；PU21计算5*0.1＝0.5；PU22计算5*0.5＝2.5。在计算完上述数据之后，在第二个时钟周期，PU11计算2*0.2＝0.4，且累加上次的计算结果0.1，得到：0.4+0.1＝0.5，将它暂时存在输出寄存器；PU12计算2*0.6＝1.2，且累加上次的计算结果0.5，得到：1.2+0.5＝1.7，将它暂时存在输出寄存器；PU21计算6*0.2＝1.2，且累加上次的计算结果0.5，得到：1.2+0.5＝1.7，将它暂时存在输出寄存器；PU22计算6*0.6＝3.6，且累加上次的计算结果2.5，得到：3.6+2.5＝6.1，将它暂时存在输出寄存器。在第三个时钟周期，PU11计算3*0.3＝0.9，且累加上次的计算结果0.5，得到：0.9+0.5＝1.4，将它暂时存在输出寄存器；PU12计算3*0.7＝2.1，且累加上次的计算结果1.7，得到：2.1+1.7＝3.8，将它暂时存在输出寄存器；PU21计算7*0.3＝2.1，且累加上次的计算结果1.7，得到：2.1+1.7＝3.8，将它暂时存在输出寄存器；PU22计算7*0.7＝4.9，且累加上次的计算结果6.1，得到：4.9+6.1＝11，将它暂时存在输出寄存器。在第四个时钟周期，PU11计算4*0.4＝1.6，且累加上次的计算结果1.4，得到：1.6+1.4＝3，将它暂时存在输出寄存器；PU12计算4*0.8＝3.2，且累加上次的计算结果3.8，得到：3.2+3.8＝7，将它暂时存在输出寄存器；PU21计算8*0.4＝3.2，且累加上次的计算结果3.8，得到：3.2+3.8＝7，将它暂时存在输出寄存器；PU22计算8*0.8＝6.4，且累加上次的计算结果7.4，得到：6.4+11＝17.4，将它暂时存在输出寄存器，得到输出数据，如图14f所示。The actual calculation process of convolution is shown in Figure 14e and Figure 14f. Specifically, in one clock cycle, the convolution operation control circuit reads input data from the first input matrix, and sends the input data to the first input register of the arithmetic unit, and the convolution operation control circuit reads from the convolution core Take the weight data and send the weight data to the second input register of the arithmetic unit. In the next clock cycle, the arithmetic unit in the arithmetic unit uses the input data in the first input register and the weight data in the second input register for multiplication , And accumulate with the accumulated value of the multiply-accumulate operation in the previous cycle. As shown in Fig. 14e, since the size of the arithmetic unit array is 2*2, the input data and weight data are respectively taken in 2*2 units for calculation. Take the data and weight for the first time to perform matrix multiplication and addition operations, and then temporarily store the intermediate data. Each arithmetic unit PU realizes the multiplication and addition of the row of its corresponding input data and the column of weight data. For example, when the input data and weight data have been sent to the first input register and the second input register of the arithmetic unit, In the first clock cycle, PU11 calculates 1*0.1=0.1. Since 0.1 is only a part of the entire convolution, it is temporarily stored in the output register instead of being written into the memory. Similarly, PU12 calculates 1*0.5=0.5; PU21 calculates 5*0.1=0.5; PU22 calculates 5*0.5=2.5. After calculating the above data, in the second clock cycle, PU11 calculates 2*0.2=0.4, and accumulates the next calculation result of 0.1 to obtain: 0.4+0.1=0.5, which is temporarily stored in the output register; PU12 calculates 2* 0.6=1.2, and add the second calculation result 0.5, get: 1.2+0.5=1.7, temporarily store it in the output register; PU21 calculates 6*0.2=1.2, and add the second calculation result 0.5, get: 1.2+0.5 =1.7, temporarily store it in the output register; PU22 calculates 6*0.6=3.6, and accumulates the second calculation result 2.5 to obtain: 3.6+2.5=6.1, temporarily stores it in the output register. In the third clock cycle, PU11 calculates 3*0.3=0.9, and accumulates the next calculation result 0.5 to obtain: 0.9+0.5=1.4, which is temporarily stored in the output register; PU12 calculates 3*0.7=2.1, and accumulates The second calculation result of 1.7 is obtained: 2.1+1.7=3.8, which is temporarily stored in the output register; PU21 calculates 7*0.3=2.1, and the second calculation result of 1.7 is accumulated, and the following calculation result is obtained: 2.1+1.7=3.8, which is temporarily stored Output register; PU22 calculates 7*0.7=4.9, and accumulates the second calculation result 6.1 to obtain: 4.9+6.1=11, which is temporarily stored in the output register. In the fourth clock cycle, PU11 calculates 4*0.4=1.6, and accumulates the next calculation result 1.4 to get: 1.6+1.4=3, which is temporarily stored in the output register; PU12 calculates 4*0.8=3.2, and accumulates The second calculation result of 3.8 is obtained: 3.2+3.8=7, which is temporarily stored in the output register; PU21 calculates 8*0.4=3.2, and the second calculation result of 3.8 is accumulated, and the following calculation result is obtained: 3.2+3.8=7, which is temporarily stored Output register; PU22 calculates 8*0.8=6.4, and accumulates the second calculation result 7.4 to obtain: 6.4+11=17.4, temporarily store it in the output register to obtain output data, as shown in Figure 14f.

最终所有的输出结果会存入到存储区中。Finally, all output results will be stored in the storage area.

可以理解的，每个运算单元PU的第一输入寄存器、第二输入寄存器和输出寄存器可以有多个，这样卷积运算控制电路在向第一输入寄存器、第二输入寄存器传送数据时以及从输出寄存器中读取数据时，可以在不同的寄存器之间交替进行，以便做到向寄存器传送数据和运算单元进行运算的并行操作，提高计算效率。It is understandable that there can be multiple first input registers, second input registers, and output registers of each arithmetic unit PU, so that the convolution operation control circuit transmits data to the first input register and the second input register and from the output When reading data from a register, it can be performed alternately between different registers, so as to achieve a parallel operation of transferring data to the register and arithmetic unit for operation, thereby improving calculation efficiency.

本公开实施例还提供了一种卷积运算方法，是基于前述任一卷积运算电路的卷积运算方法，其特征在于，包括：从存储器中取出卷积运算指令；对所述卷积运算指令进行解码，并将所述解码后的卷积运算指令发送至所述卷积运算电路；基于所述解码后的卷积运算指令，所述卷积运算电路从所述存储器中获取输入数据和权重数据并进行运算，在运算完成后将运算结果存储到所述存储器中。The embodiments of the present disclosure also provide a convolution operation method, which is based on any of the foregoing convolution operation circuits, and is characterized in that it includes: fetching a convolution operation instruction from a memory; Instruction is decoded, and the decoded convolution operation instruction is sent to the convolution operation circuit; based on the decoded convolution operation instruction, the convolution operation circuit obtains input data and The data is weighted and the calculation is performed, and the calculation result is stored in the memory after the calculation is completed.

本公开实施例还提供了一种电子设备，包括：存储器，用于存储计算机可读指令；以及一个或多个处理器，用于运行所述计算机可读指令，使得所述处理器运行时实现前述实施例中的任一所述卷积运算方法。An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can realize The convolution operation method described in any one of the foregoing embodiments.

本公开实施例还提供了一种非暂态计算机可读存储介质，其特征在于，该非暂态计算机可读存储介质存储计算机指令，该计算机指令用于使计算机执行前述实施例中的任一所述卷积运算方法。The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments. The convolution operation method.

本公开实施例提供一种计算机程序产品，其中，其特征在于：包括计算机指令，当所述计算机指令被计算设备执行时，所述计算设备可以执行前述实施例中的任一所述卷积运算方法。An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can perform any of the convolution operations in the foregoing embodiments method.

本公开实施例提供一种芯片，其特征在于，包括前述实施例中的任一所述的卷积运算电路。An embodiment of the present disclosure provides a chip, which is characterized by including the convolution operation circuit described in any of the foregoing embodiments.

本公开实施例提供一种计算装置，其特征在于，包括前述实施例中的任一所述的芯片。An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.

本公开附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the drawings of the present disclosure illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。其中，单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如，非限制性地，可以使用的示范类型的硬件逻辑部件包括：现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described hereinabove may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Claims

A convolution operation circuit, including:

Convolution operation control circuit;

An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units;

The convolution operation control circuit is used to receive a convolution operation instruction, and transmit the input data and weight data to the operation units participating in the operation of the plurality of operation units one by one according to the operation sequence indicated by the convolution operation instruction, so The arithmetic unit participating in the operation performs a convolution operation on the input data and the weight data according to the instruction of the instruction, where the instruction is a single instruction.

3. The convolution operation circuit according to claim 1, wherein the convolution operation instruction includes an instruction name, a first address of input data, a first address of weight data, and a first address of output data.

The convolution operation circuit according to claim 3, wherein:

The arithmetic unit includes an arithmetic unit, and the arithmetic unit includes at least a multiplier and an adder;

The operation unit is configured to combine the multiplier and the adder according to the convolution operation instruction to perform a convolution operation.

The convolution operation circuit according to claim 3, wherein the operation unit participating in the operation performs a convolution operation operation on the input data and the weight data according to the instruction of the instruction, comprising:

Calculating the product of the input data and the weight data by the multiplier;

Calculating the accumulated value of the product by the adder;

The accumulated value is output.

The convolution operation circuit according to claim 4, wherein:

The input data is row vector data of the first input matrix;

The weight data is row vector data of the convolution kernel.

A convolution operation device, including:

Memory, used to store convolution operation instructions, input data, weight data and output data;

An instruction fetching module, connected to the memory, and configured to acquire the convolution operation instruction from the memory;

A decoding module, connected to the instruction fetching module, and configured to decode the convolution operation instruction acquired by the instruction fetching module;

Register for storing the attribute data of the input data, the attribute data of the weight data, and the attribute data of the output data;

The execution module is connected to the decoding module, the memory and the register, and includes the convolution operation circuit according to claims 1-6, which is used to execute the decoded convolution operation instruction.

The convolution operation device according to claim 6, wherein:

The execution module obtains the decoded convolution operation instruction from the decoding module;

The execution module obtains the attribute data of the input data, the attribute data of the weight data, and the attribute data of the output data from the register;

The execution module obtains the input data and the weight data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data;

The execution module calculates the input data and weight data according to the decoded convolution operation instruction to obtain output data;

The execution module stores the output data in the memory according to the attribute data of the output data.

The convolution operation device according to claim 7, wherein:

The input data is data of a first input matrix, the weight data is data of a convolution kernel, and the output data is data of an output matrix obtained by convolution;

The attribute data of the input data includes the number of rows, the number of columns, the depth, and the depth storage interval of the first input matrix;

The attribute data of the weight data includes the number of rows, the number of columns, the step size of the convolution kernel, and the row storage interval of the convolution kernel;

The attribute data of the output data includes the number of rows, the number of columns, the depth, and the depth storage interval of the output matrix.

The convolution operation device according to claim 7, wherein the execution module obtains the input data and all the input data for calculation from the memory according to the attribute data of the input data and the attribute data of the weight data. The weight data includes:

The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the input data;

The execution module reads the data of the convolution kernel according to the preset second reading method and the attribute data of the weight data.

The convolution operation device according to claim 9, wherein:

The first reading method is reading by row or reading by column;

The second reading method is reading in rows or reading in columns.

A convolution operation method based on the convolution operation circuit of any one of claims 1 to 5, characterized in that it comprises:

Fetch the convolution operation instruction from the memory;

Decoding the convolution operation instruction, and sending the decoded convolution operation instruction to the convolution operation circuit;

Based on the decoded convolution operation instruction, the convolution operation circuit obtains input data and weight data from the memory and performs operations, and stores the operation result in the memory after the operation is completed.