CN106203617B

CN106203617B - A kind of acceleration processing unit and array structure based on convolutional neural networks

Info

Publication number: CN106203617B
Application number: CN201610482653.7A
Authority: CN
Inventors: 宋博扬; 赵秋奇; 马芝; 刘记朋; 韩宇菲; 王明江
Original assignee: SHENZHEN INTEGRATED CIRCUIT DESIGN INDUSTRIALIZATION BASE ADMINISTRATION CENTER; Harbin Institute of Technology Shenzhen
Current assignee: SHENZHEN INTEGRATED CIRCUIT DESIGN INDUSTRIALIZATION BASE ADMINISTRATION CENTER; Harbin Institute of Technology Shenzhen
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2018-08-21
Anticipated expiration: 2036-06-27
Also published as: CN106203617A

Abstract

The invention discloses an accelerated processing unit based on a convolutional neural network, which is used for performing convolution operations on local data, the local data includes a plurality of multimedia data, and the accelerated processing unit includes a first register, a second register, and a second register. Three registers, a fourth register, a fifth register, a multiplier, an adder, and a first multiplexer and a second multiplexer. A single accelerated processing unit controls the first multiplexer and the second multiplexer, so that the multiplier and the adder can be reused, so that an accelerated processing unit only needs one multiplier and one adder to complete Convolution operation reduces the use of multipliers and adders. When implementing the same convolution operation, reducing the use of multipliers and adders will increase processing speed and reduce energy consumption. At the same time, the single acceleration processing unit has a smaller on-chip area .

Description

An accelerated processing unit and array structure based on convolutional neural network

技术领域technical field

本发明涉及卷积神经网络，具体涉及卷积神经网络的卷积层中的加速处理单元及阵列结构。The invention relates to a convolutional neural network, in particular to an accelerated processing unit and an array structure in a convolutional layer of the convolutional neural network.

背景技术Background technique

深度学习(deep learning)相对于浅层学习，是指机器通过算法，从历史数据中学习规律，并对事物作出智能识别和预测。Compared with shallow learning, deep learning refers to the machine learning rules from historical data through algorithms, and intelligently identifying and predicting things.

卷积神经网络(Convolutional Neural Network，CNN)属于deep learning的一种，其发明于1980年代初，由多层排列的人工神经元组成，卷积神经网络反映出了人类大脑处理视觉的方法。随着摩尔定律推动着计算机技术越来越强大，卷积神经网络能够更好的模仿生物神经网络的实际运作方式，避免了对图像的复杂前期预处理，可以直接输入原始图像，因而得到了更为广泛的应用，目前已成功应用于手写字符识别、人脸识别、人眼检测、行人检测和机器人导航中。Convolutional Neural Network (CNN) is a kind of deep learning. It was invented in the early 1980s and consists of artificial neurons arranged in multiple layers. The convolutional neural network reflects the way the human brain processes vision. As Moore's Law drives computer technology to become more and more powerful, convolutional neural networks can better imitate the actual operation of biological neural networks, avoid complex pre-processing of images, and can directly input original images, thus obtaining better results. For a wide range of applications, it has been successfully used in handwritten character recognition, face recognition, human eye detection, pedestrian detection and robot navigation.

卷积神经网络的基本机构中包括多个卷积层，每层由多个二维平面组成，而每个平面由多个独立神经元组成。每个神经元用于对多媒体数据的局部数据进行卷积运算，并且其一输入端还与前一卷积层的局部感受野相连，通过对前一卷积层的局部感受野的数据进行卷积运算，以提取该局部感受野的特征。The basic mechanism of convolutional neural network includes multiple convolutional layers, each layer is composed of multiple two-dimensional planes, and each plane is composed of multiple independent neurons. Each neuron is used to perform convolution operations on the local data of the multimedia data, and one of its input terminals is also connected to the local receptive field of the previous convolutional layer. product operation to extract the features of the local receptive field.

现有技术中，通常也采用加速处理单元来作为神经元，对多媒体数据的局部数据进行卷积运算。现有的加速处理单元对输入的每个多媒体数据都设计有一个加法器和一个乘法器，当该加速处理单元需要处理的局部数据有多个时，意味着每个加速处理单元包括多个加法器和多个乘法器，这种设计导致加速处理单元片的面积较大，功耗大，处理速度也有待提高。In the prior art, an accelerated processing unit is usually used as a neuron to perform convolution operations on partial data of multimedia data. The existing accelerated processing unit is designed with an adder and a multiplier for each input multimedia data. When the accelerated processing unit needs to process multiple local data, it means that each accelerated processing unit includes multiple additions. and multiple multipliers, this design leads to a larger area of the accelerated processing unit chip, a large power consumption, and a processing speed that needs to be improved.

发明内容Contents of the invention

本申请提供一种基于卷积神经网络的加速处理单元,用于对局部数据进行卷积运算，所述局部数据包括多个多媒体数据，所述加速处理单元包括第一寄存器、第二寄存器、第三寄存器、第四寄存器、第五寄存器、乘法器、加法器和第一多路选择器和第二多路选择器；The present application provides an accelerated processing unit based on a convolutional neural network, which is used to perform convolution operations on local data, the local data includes a plurality of multimedia data, and the accelerated processing unit includes a first register, a second register, and a second register. Three registers, a fourth register, a fifth register, a multiplier, an adder, and a first multiplexer and a second multiplexer;

第一寄存器用于输入多媒体数据，其输出端与乘法器的输入端连接，将多媒体数据发送到乘法器；The first register is used for inputting multimedia data, and its output end is connected with the input end of multiplier, and multimedia data is sent to multiplier;

第二寄存器用于输入滤波器权值，其输出端与乘法器的输入端连接，将滤波器权值发送到乘法器；The second register is used to input the filter weight value, and its output terminal is connected with the input terminal of the multiplier, and the filter weight value is sent to the multiplier;

乘法器用于将多媒体数据和滤波器权值相乘，其输出端与第三寄存器连接，将相乘后的结果发送到第三寄存器；The multiplier is used to multiply the multimedia data and the filter weight, and its output terminal is connected to the third register, and the multiplied result is sent to the third register;

第三寄存器的输出端与第一多路选择器的第一端连接；The output end of the third register is connected with the first end of the first multiplexer;

所述第一多路选择器的第二端连接加法器，第三端为前一加速处理单元的部分和输入端，所述第一多路选择器通过状态切换将第三寄存器和加法器连接，或将前一加速处理单元的部分和输入端和加法器连接；The second end of the first multiplexer is connected to the adder, and the third end is part and input of the previous acceleration processing unit, and the first multiplexer connects the third register and the adder through state switching , or connect the part of the previous accelerated processing unit with the input end and the adder;

所述加法器还与第五寄存器和第四寄存器连接，用于将第一多路选择器传送的相乘后的结果或前一加速处理单元的部分和与第五寄存器中的数据进行加法运算，并将相加后的结果输出到第四寄存器；The adder is also connected with the fifth register and the fourth register, and is used for adding the multiplied result transmitted by the first multiplexer or the partial sum of the previous acceleration processing unit to the data in the fifth register , and output the added result to the fourth register;

所述第二多路选择器的第一端和第二端分别连接第四寄存器和第五寄存器，所述第四寄存器通过第二多路选择器连接到第五寄存器。The first terminal and the second terminal of the second multiplexer are respectively connected to the fourth register and the fifth register, and the fourth register is connected to the fifth register through the second multiplexer.

优选地，所述第一多路选择器在加速处理单元未完成局部数据的乘加操作时保持第一状态，将第三寄存器连接到加法器，在加速处理单元完成局部数据的乘加操作后切换为第二状态，将前一加速处理单元的部分和输入端连接到加法器。Preferably, the first multiplexer maintains the first state when the acceleration processing unit has not completed the multiplication and addition operation of the local data, and connects the third register to the adder, and after the acceleration processing unit completes the multiplication and addition operation of the local data To switch to the second state, connect the part and the input of the previous accelerated processing unit to the adder.

优选地，所述第二多路选择器在加速处理单元未完成局部数据的乘加操作时保持为第一状态，将第四寄存器连接到第五寄存器，在加速处理单元完成局部数据的乘加操作后切换为第二状态，以将第五寄存器清零。Preferably, the second multiplexer remains in the first state when the acceleration processing unit has not completed the multiplication and addition operation of the local data, connects the fourth register to the fifth register, and completes the multiplication and addition of the local data in the acceleration processing unit switch to the second state after the operation to clear the fifth register.

优选地，所述第二多路选择器的第三端为重置端，所述第二多路选择器在加速处理单元完成局部数据的乘加操作后切换为第二状态，将重置端连接到第五寄存器。Preferably, the third terminal of the second multiplexer is a reset terminal, and the second multiplexer switches to the second state after the acceleration processing unit completes the multiplication and addition operation of local data, and the reset terminal connected to the fifth register.

优选地，还包括第一存储器、第二存储器和第三存储器，所述第一存储器与第一寄存器的输入端连接，用于输入并存储需要进行卷积运算的局部数据，并将局部数据中的多个多媒体数据依次发送给第一寄存器；所述第二存储器与第二寄存器的输入端连接，用于输入并存储滤波器权值，并将滤波器权值发送给第二寄存器；所述第三存储器与第四寄存器的输入端连接，用于输入并存储加法器输出的相加后的结果，并将相加后的结果发送给第四寄存器。Preferably, it also includes a first memory, a second memory and a third memory, the first memory is connected to the input end of the first register, and is used to input and store the local data that needs to be convoluted, and convert the local data into A plurality of multimedia data are sent to the first register in turn; the second memory is connected to the input of the second register for inputting and storing the filter weight, and sending the filter weight to the second register; The third memory is connected to the input terminal of the fourth register, and is used for inputting and storing the added result output by the adder, and sending the added result to the fourth register.

优选地，其特征在于，所述加法器还将相加后的结果输出到后一加速处理单元。Preferably, it is characterized in that the adder also outputs the added result to the next accelerated processing unit.

本申请提供还一种基于卷积神经网络的阵列结构，包括多个所述加速处理单元，多个加速处理单元呈现为M行N列的矩阵形态，其中M和N为大于或等于1的整数，每一列的加速处理单元前后相连。The present application also provides an array structure based on a convolutional neural network, including multiple accelerated processing units, and the multiple accelerated processing units are in the form of a matrix with M rows and N columns, where M and N are integers greater than or equal to 1 , the accelerated processing units of each column are connected back and forth.

优选地，每一列中，前一加速处理单元的加法器的输出端连接后一加速处理单元的第一多路选择器的第三端。Preferably, in each column, the output terminal of the adder of the preceding accelerated processing unit is connected to the third terminal of the first multiplexer of the subsequent accelerated processing unit.

优选地，同一行的加速处理单元中，输入的滤波器权值相同；位于同一对角线上的加速处理单元中，输入的局部数据相同。Preferably, in the accelerated processing units in the same row, the input filter weights are the same; in the accelerated processing units located in the same diagonal line, the input local data are the same.

优选地，不同行的加速处理单元中，输入的滤波器权值不同。Preferably, in the accelerated processing units of different rows, input filter weights are different.

本发明的有益效果是：单个加速处理单元通过对第一多路选择器和第二多路选择器的控制，使得乘法器和加法器可重复使用，从而使得一个加速处理单元只需要一个乘法器和一个加法器即可完成卷积运算，减少了乘法器和加法器的使用，在实现同样的卷积运算时，减少乘法器和加法器的使用将会提高处理速度并降低能耗，同时单个加速处理单元片上面积更小。The beneficial effects of the present invention are: a single accelerated processing unit controls the first multiplexer and the second multiplexer, so that the multiplier and the adder can be reused, so that one accelerated processing unit only needs one multiplier The convolution operation can be completed with one adder, reducing the use of multipliers and adders. When implementing the same convolution operation, reducing the use of multipliers and adders will increase processing speed and reduce energy consumption. At the same time, a single Accelerated processing units have smaller on-chip areas.

附图说明Description of drawings

图1为本发明实施例提供的一种基于卷积神经网络的加速处理单元结构框图；FIG. 1 is a structural block diagram of an accelerated processing unit based on a convolutional neural network provided by an embodiment of the present invention;

图2为本发明实施例提供的一种基于卷积神经网络的加速处理单元的卷积运算过程示意图；2 is a schematic diagram of a convolution operation process of an accelerated processing unit based on a convolutional neural network provided by an embodiment of the present invention;

图3为本发明实施例一种基于卷积神经网络的阵列结构列向分布示意图；3 is a schematic diagram of column distribution of an array structure based on a convolutional neural network according to an embodiment of the present invention;

图4为本发明实施例一种基于卷积神经网络的阵列结构行向分布示意图；FIG. 4 is a schematic diagram of row distribution of an array structure based on a convolutional neural network according to an embodiment of the present invention;

图5为本发明实施例一种基于卷积神经网络的阵列结构对角线分布示意图。FIG. 5 is a schematic diagram of a diagonal distribution of an array structure based on a convolutional neural network according to an embodiment of the present invention.

具体实施方式Detailed ways

下面通过具体实施方式结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例。The technical solutions of the present invention will be clearly and completely described below through specific embodiments in conjunction with the accompanying drawings. Apparently, the described embodiments are only some of the embodiments of the present invention, rather than all of them.

实施例一：Embodiment one:

请参考图1，本实施例提供一种基于卷积神经网络的加速处理单元，加速处理单元61包括第一寄存器21、第二寄存器22、第三寄存器23、第四寄存器24、第五寄存器25、乘法器41、加法器51和第一多路选择器31和第二多路选择器 32。Please refer to FIG. 1 , the present embodiment provides an accelerated processing unit based on a convolutional neural network. The accelerated processing unit 61 includes a first register 21, a second register 22, a third register 23, a fourth register 24, and a fifth register 25. , a multiplier 41 , an adder 51 , and the first multiplexer 31 and the second multiplexer 32 .

第一寄存器21与乘法器41的一输入端连接，第一寄存器21用于输入多媒体数据，并将多媒体数据发送到乘法器41。第二寄存器22与乘法器41的另一输入端连接，第二寄存器22用于输入滤波器权值，并将滤波器权值发送到乘法器41。乘法器41的输出端与第三寄存器23连接，用于将多媒体数据和滤波器权值相乘，并将相乘后的结果发送到第三寄存器23。The first register 21 is connected to an input end of the multiplier 41 , and the first register 21 is used for inputting multimedia data and sending the multimedia data to the multiplier 41 . The second register 22 is connected to the other input terminal of the multiplier 41 , and the second register 22 is used to input the filter weight and send the filter weight to the multiplier 41 . The output terminal of the multiplier 41 is connected to the third register 23 for multiplying the multimedia data and the filter weight, and sending the multiplied result to the third register 23 .

第一多路选择器31的第一端与第三寄存器23的输出端连接，第二端连接加法器51的一输入端，第三端为前一加速处理单元的部分和输入端。当第一多路选择器31切换到第一状态(例如置0)时，第一多路选择器31将第三寄存器 23和加法器51连接，将第三寄存器23中的数据发送到加法器51；当第一多路选择器31切换到第二状态(例如置1)时，第一多路选择器31将其第三端和加法器51连接，将前一加速处理单元的部分和发送到加法器51。The first terminal of the first multiplexer 31 is connected to the output terminal of the third register 23 , the second terminal is connected to an input terminal of the adder 51 , and the third terminal is part and input terminal of the previous accelerated processing unit. When the first multiplexer 31 is switched to the first state (such as setting 0), the first multiplexer 31 connects the third register 23 and the adder 51, and sends the data in the third register 23 to the adder 51; when the first multiplexer 31 is switched to the second state (for example, set to 1), the first multiplexer 31 connects its third terminal to the adder 51, and sends the part sum of the previous accelerated processing unit to adder 51.

加法器51的另一输入端与第五寄存器25连接，加法器51的输出端与第四寄存器24连接，加法器51输入第三寄存器23和第五寄存器25中的数据，将两寄存器中的数据进行加法运算，并将相加后的结果(也称为内部部分和)输出到第四寄存器24。The other input end of adder 51 is connected with the 5th register 25, and the output end of adder 51 is connected with the 4th register 24, and adder 51 inputs the data in the 3rd register 23 and the 5th register 25, with the data in the two registers. The data is added, and the added result (also called internal partial sum) is output to the fourth register 24 .

第二多路选择器32的第一端和第二端分别连接第四寄存器24和第五寄存器25，第二多路选择器32的第三端为重置端。当第二多路选择器32切换到第一状态(例如置0)时，第二多路选择器32将第四寄存器24和第五寄存器25 连接，将第四寄存器24中的内部部分和发送到第五寄存器25；当第二多路选择器32切换到第二状态(例如置1)时，第二多路选择器32将其第三端和第五寄存器25连接，重置第五寄存器25，使第五寄存器25中数据清零。The first terminal and the second terminal of the second multiplexer 32 are connected to the fourth register 24 and the fifth register 25 respectively, and the third terminal of the second multiplexer 32 is a reset terminal. When the second multiplexer 32 is switched to the first state (for example, set to 0), the second multiplexer 32 will connect the fourth register 24 and the fifth register 25, and send the internal part in the fourth register 24 to To the fifth register 25; when the second multiplexer 32 switches to the second state (for example, setting 1), the second multiplexer 32 connects its third terminal to the fifth register 25, and resets the fifth register 25. Clear the data in the fifth register 25 to zero.

在有的实施例中，为方便向寄存器发送数据，加速处理单元61还包括第一存储器11、第二存储器12和第三存储器13，第一存储器11与第一寄存器21 的输入端连接，用于输入并存储需要进行卷积运算的局部数据，并将局部数据中的多个多媒体数据依次发送给第一寄存器21；第二存储器12与第二寄存器 22的输入端连接，用于输入并存储滤波器权值，并将滤波器权值发送给第二寄存器22；第三存储器13与第四寄存器24的输入端连接，用于输入并存储加法器51输出的内部部分和，并将内部部分和发送给第四寄存器24。In some embodiments, for the convenience of sending data to the registers, the accelerated processing unit 61 further includes a first memory 11, a second memory 12, and a third memory 13, the first memory 11 is connected to the input end of the first register 21 for It is used to input and store the local data that needs to be convoluted, and send multiple multimedia data in the local data to the first register 21 in sequence; the second memory 12 is connected to the input end of the second register 22 for input and storage filter weight, and send the filter weight to the second register 22; the third memory 13 is connected to the input end of the fourth register 24, and is used to input and store the internal part sum of the adder 51 output, and the internal part and sent to the fourth register 24.

加速处理单元61用于对局部数据进行卷积运算，局部数据包括多个多媒体数据，多媒体数据可以是视频数据，图像数据，也可以是音频数据。当多媒体数据是视频数据时，可以认为每个多媒体数据对应一个像素。The acceleration processing unit 61 is used for performing convolution operation on partial data, the partial data includes a plurality of multimedia data, and the multimedia data may be video data, image data, or audio data. When the multimedia data is video data, it can be considered that each multimedia data corresponds to one pixel.

下面以图像数据为例，说明加速处理单元61的卷积运算过程。Taking image data as an example, the convolution operation process of the accelerated processing unit 61 will be described below.

结合图1和图2，单个基于卷积神经网络的加速处理单元61的工作过程如下：1 and 2, the working process of a single convolutional neural network-based accelerated processing unit 61 is as follows:

步骤10，读取需要进行卷积运算的视频数据和滤波器权值。若图像数据不为0，图像数据被存储于第一存储器11，在需要的时候被发送至第一寄存器21 用来提取图像数据，若图像数据为0，图像数据0直接被发送至第一寄存器21 而无需提取，采取跳过或门控的策略避免非必要的读取和计算；滤波器权值被存储于第二存储器12，在需要的时候被发送至第二寄存器22用来滤波器权值数据，其中，数据提取方式为依次串行提取，即在第一个循环中，将由本加速处理单元61进行卷积运算的局部数据中的第一个图像数据发送到第一寄存器21；在第二个循环中，将第二个图像数据发送到第一寄存器21，后面依次读入图像数据。滤波器权值由处理器根据卷积算法要求产生。Step 10, read the video data and filter weights that need to be convolved. If the image data is not 0, the image data is stored in the first memory 11, and sent to the first register 21 to extract the image data when needed, and if the image data is 0, the image data 0 is directly sent to the first register 21 without extracting, skipping or gating strategy is adopted to avoid unnecessary reading and calculation; the filter weight value is stored in the second memory 12, and is sent to the second register 22 for filter weight when needed Value data, wherein, the data extraction method is serial extraction in sequence, that is, in the first cycle, the first image data in the local data that is convoluted by the acceleration processing unit 61 is sent to the first register 21; In the second cycle, the second image data is sent to the first register 21, and then the image data is sequentially read in. The filter weights are generated by the processor according to the requirements of the convolution algorithm.

步骤20，乘法运算。第一寄存器21中的图像数据和第二寄存器22中的滤波器权值被发送到乘法器41中执行乘法运算，经过乘法器41相乘后的结果被输出到第三寄存器23。Step 20, multiplication operation. The image data in the first register 21 and the filter weight in the second register 22 are sent to the multiplier 41 for multiplication, and the multiplied result by the multiplier 41 is output to the third register 23 .

步骤30，加法运算。由于加速处理单元61中的乘加操作还没有结束，此时第一多路选择器31置0，当第一多路选择器31被置0时，第三寄存器23中图像数据被发送到加法器51中，加法器51将图像数据与第五寄存器25中的前一次内部部分和相加。对于第一次的内部卷积操作，第五寄存器25中为零，对于后续的内部卷积操作，第五寄存器25中为前一次卷积操作后的内部部分和。本次卷积操作中相加后的结果(即内部部分和)被输出到第四寄存器24，此时便完成了一次内部卷积操作，最终得到第一个图像数据和滤波器权值的部分和。由于加速处理单元61中的乘加操作还没有结束，此时第二多路选择器32置0，当第二多路选择器32被置0时，内部部分和被第四寄存器24发送至第五寄存器25。Step 30, addition operation. Since the multiplication and addition operation in the acceleration processing unit 61 has not yet ended, the first multiplexer 31 is set to 0, and when the first multiplexer 31 is set to 0, the image data in the third register 23 is sent to the addition In the register 51, the adder 51 adds the image data to the previous internal partial sum in the fifth register 25. For the first internal convolution operation, the fifth register 25 is zero, and for the subsequent internal convolution operation, the fifth register 25 is the internal partial sum after the previous convolution operation. The result of the addition in this convolution operation (ie, the internal partial sum) is output to the fourth register 24, and an internal convolution operation is completed at this time, and finally the first image data and the part of the filter weight are obtained and. Since the multiplication and addition operation in the acceleration processing unit 61 has not yet ended, the second multiplexer 32 is set to 0 at this moment, and when the second multiplexer 32 is set to 0, the internal part sum is sent to the first by the fourth register 24 Five Register 25.

步骤40，本加速处理单元61判断是否完成了所有局部数据的内部卷积操作，若内部卷积操作未完成时，将依次重复步骤10、步骤20和步骤30，提取第二个图像数据，并输入到第一寄存器21，第二个滤波器权值输入到第二寄存器22，第一寄存器21中的图像数据和第二寄存器22中的滤波器权值都发送到乘法器 41中，图像数据和滤波器权值相乘，得出的结果由乘法器41发送到第三寄存器中，由于加速处理单元61中的乘加操作还没有结束，此时第一多路选择器31 置0，第三寄存器23中的数据通过第一多路选择器发送到加法器51中，与来自第五寄存器25中的数据求和，最终得到第二次的图像数据和滤波器权值的部分和。加法器51中的部分和发送到第四寄存器24中，此时，由于加速处理单元中的乘加操作还没有结束，此时第二多路选择器32置0，第四寄存器24中的数据通过第二多路选择器31发送到第五寄存器25中。从而完成对第二个图像数据的内部卷积操作。直到提取局部数据的最后一个图像数据，该图像信息和滤波器权值经过上述的相乘和相加操作后，得到本加速处理单元的部分和，该部分和通过与前述相同的操作，最终进入第五寄存器25中。当所有内部卷积操作已完成时，则将进行步骤50。Step 40, the acceleration processing unit 61 judges whether the internal convolution operation of all local data is completed, if the internal convolution operation is not completed, step 10, step 20 and step 30 will be repeated in sequence to extract the second image data, and Input to the first register 21, the second filter weight is input to the second register 22, the image data in the first register 21 and the filter weight in the second register 22 are all sent in the multiplier 41, the image data Multiplied with the filter weight, the result obtained is sent to the third register by the multiplier 41. Since the multiplication and addition operation in the acceleration processing unit 61 has not yet ended, the first multiplexer 31 is set to 0 at this moment, and the first multiplexer 31 is set to 0. The data in the third register 23 is sent to the adder 51 through the first multiplexer, and summed with the data in the fifth register 25 to finally obtain the partial sum of the second image data and the filter weight. The partial sum in the adder 51 is sent in the fourth register 24. At this time, because the multiplication and addition operation in the accelerated processing unit has not yet ended, the second multiplexer 32 is set to 0 at this time, and the data in the fourth register 24 sent to the fifth register 25 through the second multiplexer 31. Thus completing the inner convolution operation on the second image data. Until the last image data of the local data is extracted, the image information and the filter weights undergo the above-mentioned multiplication and addition operations to obtain the partial sum of this accelerated processing unit, and this partial sum is finally entered into in the fifth register 25. When all inner convolution operations have been completed, then step 50 will proceed.

步骤50，当加速处理单元61中的对局部数据的乘加操作结束后，第一多路选择器31和第二多路选择器32置1，当第一多路选择器32被置1时，前一个加速处理单元中的部分和通过第一多路选择器31发送到加法器51中，第五寄存器25将本加速处理单元61中的最终部分和发送到加法器51中，最后，前一个加速处理单元的部分和与本加速处理单元61中的部分和求和，得到两个加速处理单元叠加的部分和，该叠加的部分和输出，发送到下一个加速处理单元。当第二多路选择器32将由状态0被置1，第四寄存器24不再向第五寄存器25 发送数据，并且第五寄存器25中的数据将被清零。Step 50, after the multiplication and addition operation on the local data in the accelerated processing unit 61 ends, the first multiplexer 31 and the second multiplexer 32 are set to 1, and when the first multiplexer 32 is set to 1 , the part sum in the previous acceleration processing unit is sent to the adder 51 by the first multiplexer 31, and the fifth register 25 sends the final part sum in the acceleration processing unit 61 to the adder 51, and finally, the previous The partial sum of one accelerated processing unit is summed with the partial sum of this accelerated processing unit 61 to obtain the superimposed partial sum of the two accelerated processing units, and the output of the superimposed partial sum is sent to the next accelerated processing unit. When the second multiplexer 32 will be set to 1 from state 0, the fourth register 24 will no longer send data to the fifth register 25, and the data in the fifth register 25 will be cleared.

本实施例中，单个加速处理单元通过对第一多路选择器31和第二多路选择器32的控制，使得乘法器41和加法器51可重复使用，从而使得一个加速处理单元只需要一个乘法器和一个加法器即可完成卷积运算，减少了乘法器和加法器的使用，在实现同样的卷积运算时，减少乘法器和加法器的使用将会提高处理速度并降低能耗，同时单个加速处理单元片上面积更小。In this embodiment, a single accelerated processing unit controls the first multiplexer 31 and the second multiplexer 32 to make the multiplier 41 and the adder 51 reusable, so that one accelerated processing unit only needs one A multiplier and an adder can complete the convolution operation, reducing the use of multipliers and adders. When implementing the same convolution operation, reducing the use of multipliers and adders will increase processing speed and reduce energy consumption. At the same time, the area on a single accelerated processing unit chip is smaller.

实施例二：Embodiment two:

请参考图3至图5，示出了一种基于卷积神经网络的阵列结构，包括多个所述加速处理单元，多个加速处理单元呈现为M行N列的矩阵形态，其中M和N 为大于或等于1的整数，每一列的加速处理单元前后相连。Please refer to Figures 3 to 5, which show an array structure based on a convolutional neural network, including multiple accelerated processing units, and the multiple accelerated processing units are in the form of a matrix with M rows and N columns, where M and N is an integer greater than or equal to 1, and the accelerated processing units of each column are connected successively.

本实施例中，多个加速处理单元呈现为3行3列的矩阵形态，每一列中，前一加速处理单元的加法器的输出端连接后一加速处理单元的第一多路选择器的第三端。In this embodiment, multiple accelerated processing units are in the form of a matrix of 3 rows and 3 columns. In each column, the output end of the adder of the previous accelerated processing unit is connected to the first multiplexer of the next accelerated processing unit. Three ends.

同一行的加速处理单元中，输入的滤波器权值相同；位于同一对角线上的加速处理单元中，输入的局部数据相同。In the accelerated processing units in the same row, the input filter weights are the same; in the accelerated processing units located in the same diagonal line, the input local data are the same.

不同行的加速处理单元中，输入的滤波器权值不同。In the accelerated processing units of different rows, the input filter weights are different.

下面结合附图，说明多个加速处理单元的卷积层运算过程。The following describes the convolutional layer operation process of multiple acceleration processing units with reference to the accompanying drawings.

结合图1至图5，基于卷积神经网络的阵列结构的运算过程如下：Combining Figures 1 to 5, the operation process of the convolutional neural network-based array structure is as follows:

如图3所示，前一个加速处理单元的加法器51连接后一个加速处理单元的第一多路选择器31,每一行输出的部分和都垂直移动，将前后两个加速处理单元的部分和累加，在计算过程结束时可以在顶行被读出，在下一个计算过程的开始由缓冲器送到阵列的底行。As shown in Figure 3, the adder 51 of the preceding acceleration processing unit is connected to the first multiplexer 31 of the subsequent acceleration processing unit, and the partial sum output by each row is moved vertically, so that the partial sums of the two accelerated processing units before and after are moved vertically. Accumulations, which can be read out on the top row at the end of a computation pass, are buffered into the bottom row of the array at the beginning of the next computation pass.

例如，加速处理单元PE1.1、加速处理单元PE2.1和加速处理单元PE3.1，先分别进行内部卷积运算，最终结果存储在各自的第五寄存器25中，然后，加速处理单元PE3.1中输出的部分和与加速处理单元PE2.1中第五寄存器25的部分和在加速处理单元PE2.1的加法器51中再次求和累加，得到第一次累加的部分和，所述第一次累加的部分和由加速处理单元PE2.1发送到加速处理单元 PE1.1中，与加速处理单元PE1.1中第五寄存器25的部分和在加速处理单元 PE1.1的加法器51中再次求和，最终输出本列所有加速处理单元1的部分和。For example, the accelerated processing unit PE1.1, the accelerated processing unit PE2.1, and the accelerated processing unit PE3.1 first perform internal convolution operations respectively, and the final results are stored in their respective fifth registers 25, and then the accelerated processing unit PE3. The partial sum output in 1 and the partial sum of the fifth register 25 in the accelerated processing unit PE2.1 are summed and accumulated again in the adder 51 of the accelerated processed unit PE2.1 to obtain the first accumulated partial sum. The part of one accumulation is sent to the acceleration processing unit PE1.1 by the acceleration processing unit PE2.1, and the part of the fifth register 25 in the acceleration processing unit PE1.1 and in the adder 51 of the acceleration processing unit PE1.1 Sum again, and finally output the partial sum of all accelerated processing units 1 in this column.

还需指出的是，如图4和图5所示，同一行的加速处理单元中，输入的滤波器权值相同，位于同一对角线上的加速处理单元中，输入的图像数据相同，不同行的加速处理单元中，输入的滤波器权值不同。由于整个图像数据有若干行，而每个加速处理单元只是处理整个图像数据中的单行数据，因此就需要将加速处理单元分别处理完每一行数据再对每一行数据的卷积结果进行累加操作。同一对角线上的输入数据相同，不同对角线上的输入图像数据不同，相当于不同对角线上的输入的图像数据是图像数据的不同行数据。而处理不同行的图像数据需要不同的滤波器权值，例如一个滤波器权值只是用来处理第一行的图像数据，当要处理第二行的图像数据的时候，就需要用新的滤波器权值。因此可使同一行的加速处理单元采用相同的滤波器权值，不同行的加速处理单元采用不同的滤波器权值。It should also be pointed out that, as shown in Figure 4 and Figure 5, in the accelerated processing units in the same row, the input filter weights are the same, and in the accelerated processing units located on the same diagonal line, the input image data are the same, but different In the accelerated processing units of the row, the input filter weights are different. Since the entire image data has several rows, and each accelerated processing unit only processes a single row of data in the entire image data, the accelerated processing unit needs to process each row of data separately and then accumulate the convolution results of each row of data. The input data on the same diagonal are the same, and the input image data on different diagonals are different, which means that the input image data on different diagonals are different rows of image data. Processing different rows of image data requires different filter weights. For example, a filter weight is only used to process the first row of image data. When processing the second row of image data, a new filter weight is required. device weight. Therefore, accelerated processing units in the same row can use the same filter weight, and accelerated processing units in different rows can use different filter weights.

例如，加速处理单元PE1.1、加速处理单元PE1.2和加速处理单元PE1.3中的滤波器权值相同，加速处理单元PE2.1和加速处理单元PE1.2中输入的图像数据相同，加速处理单元PE1.1、加速处理单元PE2.2和加速处理单元PE3.1中的滤波器权值不相同。For example, the filter weights in the accelerated processing unit PE1.1, the accelerated processing unit PE1.2 and the accelerated processing unit PE1.3 are the same, and the input image data in the accelerated processing unit PE2.1 and the accelerated processing unit PE1.2 are the same, The filter weights in the accelerated processing unit PE1.1, the accelerated processing unit PE2.2 and the accelerated processing unit PE3.1 are different.

如此实现了，同时处理一行的多媒体数据，再对不同行的多媒体数据使用不同的滤波器权值，在分别处理完每一行数据后，在对前后各行多媒体数据进行累加操作，从而快速、可靠地处理全部的多媒体数据。In this way, one line of multimedia data is processed at the same time, and then different filter weights are used for different lines of multimedia data. After each line of data is processed separately, the multimedia data of the previous and subsequent lines is accumulated, so as to quickly and reliably Handle all multimedia data.

以上应用了具体个例对本发明进行阐述，只是用于帮助理解本发明，并不用以限制本发明。对于本发明所属技术领域的技术人员，依据本发明的思想，还可以做出若干简单推演、变形或替换。The above uses specific examples to illustrate the present invention, which is only used to help understand the present invention, and is not intended to limit the present invention. For those skilled in the technical field to which the present invention belongs, some simple deduction, deformation or replacement can also be made according to the idea of the present invention.

Claims

1. a kind of acceleration processing unit based on convolutional neural networks, for carrying out convolution algorithm, the part to local data Data include multiple multi-medium datas, it is characterised in that are posted including the first register, the second register, third register, the 4th Storage, the 5th register, multiplier, adder and the first multiple selector and the second multiple selector；

First register sends out multi-medium data for inputting multi-medium data, the input terminal connection of output end and multiplier It is sent to multiplier；

Second register is used for input filter weights, and filter weights are sent out in the input terminal connection of output end and multiplier It is sent to multiplier；

Multiplier is for multi-medium data to be multiplied with filter weights, and output end is connect with third register, after multiplication Result be sent to third register；

The output end of third register is connect with the first end of the first multiple selector；

The second end of first multiple selector connects adder, and third end is part and the input of previous acceleration processing unit End, first multiple selector is switched by state connects third register with adder, or previous acceleration is handled list The part of member is connected with input terminal with adder；

The adder is also connect with the 5th register and the 4th register, after the multiplication for transmitting the first multiple selector Result or previous acceleration processing unit part and carry out add operation with the data in the 5th register, and after will add up As a result it is output to the 4th register；

The first end and second end of second multiple selector is separately connected the 4th register and the 5th register, and the described 4th Register is connected to the 5th register by the second multiple selector.

2. acceleration processing unit as described in claim 1, which is characterized in that first multiple selector is accelerating processing single First state is kept when the multiply-add operation of the unfinished local data of member, third register is connected to adder, is handled in acceleration It is switched to the second state after the multiply-add operation of unit completion local data, the part of previous acceleration processing unit and input terminal are connected It is connected to adder.

3. acceleration processing unit as described in claim 1, which is characterized in that second multiple selector is accelerating processing single First state is remained when the multiply-add operation of the unfinished local data of member, the 4th register is connected to the 5th register, is being added It is switched to the second state after the multiply-add operation of fast processing unit completion local data, the 5th register is reset.

4. acceleration processing unit as claimed in claim 3, which is characterized in that attach most importance at the third end of second multiple selector End is set, second multiple selector is switched to the second state after the multiply-add operation that acceleration processing unit completes local data, Resetting end is connected to the 5th register.

5. acceleration processing unit according to any one of claims 1 to 4, which is characterized in that further include first memory, Two memories and third memory, the first memory are connect with the input terminal of the first register, are needed for inputting and storing The local data of convolution algorithm is carried out, and multiple multi-medium datas in local data are sent to the first register successively； The second memory is connect with the input terminal of the second register, is weighed for inputting and storing filter weights, and by filter Value is sent to the second register；The third memory is connect with the input terminal of the 4th register, for inputting and storing addition Device output after being added as a result, and the result after will add up be sent to the 4th register.

6. acceleration processing unit according to any one of claims 1 to 4, which is characterized in that the adder also will add up Result afterwards is output to latter acceleration processing unit.

7. a kind of array structure based on convolutional neural networks, which is characterized in that including multiple as any in claim 1 to 6 Acceleration processing unit described in, multiple acceleration processing units are rendered as the matrix shape of M rows N row, wherein M and N be more than or Integer equal to 1 is connected before and after the acceleration processing unit of each row.

8. array structure as claimed in claim 7, which is characterized in that in each row, the adder of previous acceleration processing unit Output end connect latter acceleration processing unit the first multiple selector third end.

9. array structure as claimed in claim 8, which is characterized in that in the acceleration processing unit of a line, the filtering of input Device weights are identical；In the acceleration processing unit on same diagonal line, the local data of input is identical.

10. array structure as claimed in claim 9, which is characterized in that in the acceleration processing unit that do not go together, the filtering of input Device weights are different.