WO2020199476A1

WO2020199476A1 - Neural network acceleration method and apparatus based on pulsation array, and computer device and storage medium

Info

Publication number: WO2020199476A1
Application number: PCT/CN2019/103137
Authority: WO
Inventors: 郭跃超; 高鹏; 谢国彤; 唐义君; 张萌
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-04
Filing date: 2019-08-28
Publication date: 2020-10-08
Anticipated expiration: 2021-10-04
Also published as: CN110135556A; CN110135556B

Abstract

A neural network acceleration method and apparatus based on a pulsation array, and a computer device and a storage medium. The method comprises: acquiring a convolution parameter of a convolution filter; segmenting a plurality of sub-filters from the convolution filter according to a preset filter segmentation rule; segmenting, according to a preset feature map segmentation rule, a plurality of feature sub-maps from a feature map to be convoluted; on the basis of the pulsation array, performing convolution calculation on the corresponding feature sub-maps according to the sub-filters; and superposing convolution calculation results corresponding to the sub-filters.

Description

Neural network acceleration method, device, computer equipment and storage medium based on systolic array

本申请要求于2019年4月4日提交中国专利局、申请号为201910268881.8、发明名称为“基于脉动阵列的神经网络加速方法、装置、计算机设备及存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 4, 2019, the application number is 201910268881.8, and the invention title is "Neural network acceleration method, device, computer equipment and storage medium based on systolic array". The entire content is incorporated into this application by reference.

Technical field

本申请涉及神经网络技术领域，尤其涉及一种基于脉动阵列的神经网络加速方法、装置、计算机设备及存储介质。This application relates to the technical field of neural networks, and in particular to a neural network acceleration method, device, computer equipment, and storage medium based on a systolic array.

Background technique

常用的神经网络最主要的部分就是卷积的计算，卷积计算中常常会碰到卷积滤波器不等于1的情况，在这种情况下，主流的一些神经网络计算库，例如CUDNN(NVIDIA的深度网络计算库)在计算这种卷积的时候会显著的变慢。一些深度学习加速器例如现场可编程门阵列(Field-Programmable Gate Array，FPGA)，专用网络处理器(network process units，NPU)等在卷积部分通常会用脉动阵列的结构实现，此种结构对于卷积滤波器的卷积滤波器不等于1的情况非常不友好。The most important part of commonly used neural networks is the calculation of convolution. Convolution calculations often encounter situations where the convolution filter is not equal to 1. In this case, some mainstream neural network computing libraries, such as CUDNN (NVIDIA The deep network calculation library) will be significantly slower when calculating this convolution. Some deep learning accelerators, such as Field-Programmable Gate Array (FPGA), network process units (NPU), etc., are usually implemented with a systolic array structure in the convolution part. The situation where the convolution filter of the product filter is not equal to 1 is very unfriendly.

现有技术一般通过先计算卷积步长为1时的卷积结果，然后下采样丢弃不需要的卷积结果以得到特定卷积步长的特征图，这样做显然会浪费计算和调度资源，同样会使得卷积计算变慢。The prior art generally calculates the convolution result when the convolution step length is 1, and then down-samples and discards the unnecessary convolution result to obtain the feature map of a specific convolution step length. This obviously wastes calculation and scheduling resources. It will also slow down the convolution calculation.

发明内容Summary of the invention

本申请实施例提供一种基于脉动阵列的神经网络加速方法、装置、计算机设备及存储介质，能够较佳地解决步长不为1的卷积计算会浪费脉动阵列计算和调度资源的问题。The embodiments of the application provide a systolic array-based neural network acceleration method, device, computer equipment, and storage medium, which can better solve the problem of systolic array calculation and scheduling resources being wasted in convolution calculation with a step size of not one.

第一方面，本申请提供了一种基于脉动阵列的神经网络加速方法，所述方法包括：In the first aspect, this application provides a neural network acceleration method based on a systolic array, the method including:

获取卷积滤波器的卷积参数，所述卷积参数包括卷积步长和所述卷积滤波器的尺寸；Acquiring a convolution parameter of a convolution filter, where the convolution parameter includes a convolution step size and the size of the convolution filter;

若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，各所述子滤波器的尺寸小于所述卷积滤波器的尺寸；If the convolution step length is not 1 and the size of the convolution filter is greater than 1×1, a number of sub-filters are segmented from the convolution filter according to a preset filter segmentation rule, and each sub-filter is The size of the filter is smaller than the size of the convolution filter;

获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，所述若干特征子图与所述若干子滤波器一一对应；Acquiring a feature map to be convolved, and segmenting a number of feature submaps from the feature map to be convolved according to a preset feature map segmentation rule, the plurality of feature submaps corresponding to the plurality of subfilters one-to-one;

基于脉动阵列，根据各所述子滤波器对各自对应的特征子图进行卷积计算，卷积计算的步长为1；Based on the systolic array, perform convolution calculation on the corresponding feature submap according to each of the subfilters, and the step size of the convolution calculation is 1;

将各所述子滤波器对应的卷积计算结果进行叠加，并将叠加的结果作为所述卷积滤波器对所述待卷积特征图卷积计算的结果进行输出。The convolution calculation results corresponding to each of the subfilters are superimposed, and the superimposed result is output as the convolution calculation result of the feature map to be convolved by the convolution filter.

第二方面，本申请提供了一种基于脉动阵列的神经网络加速装置，所述装置包括：In the second aspect, the present application provides a neural network acceleration device based on a systolic array, the device including:

卷积参数获取模块，用于获取卷积滤波器的卷积参数，所述卷积参数包括卷积步长和所述卷积滤波器的尺寸；A convolution parameter acquisition module, configured to acquire convolution parameters of a convolution filter, where the convolution parameters include a convolution step size and a size of the convolution filter;

滤波器分割模块，用于若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，各所述子滤波器的尺寸小于所述卷积滤波器的尺寸；The filter segmentation module is configured to: if the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, segment the convolution filter according to a preset filter segmentation rule. Filter, the size of each of the sub-filters is smaller than the size of the convolution filter;

特征图分割模块，用于获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，所述若干特征子图与所述若干子滤波器一一对应；The feature map segmentation module is used to obtain the feature map to be convolved and segment the feature map to be convolved according to a preset feature map segmentation rule into several feature sub-images, the several feature sub-images and the several sub-filters One-to-one correspondence

卷积模块，用于基于脉动阵列，根据各所述子滤波器对各自对应的特征子图进行卷积计算，卷积计算的步长为1；The convolution module is configured to perform convolution calculation on the corresponding feature sub-maps according to each of the sub-filters based on the systolic array, and the step size of the convolution calculation is 1;

叠加模块，用于将各所述子滤波器对应的卷积计算结果进行叠加，并将叠加的结果作为所述卷积滤波器对所述待卷积特征图卷积计算的结果进行输出。The superposition module is configured to superimpose the convolution calculation results corresponding to each of the sub-filters, and output the superposition result as the convolution calculation result of the feature map to be convolved by the convolution filter.

第三方面，本申请提供了一种计算机设备，所述计算机设备包括存储器和处理器；所述存储器用于存储计算机程序；所述处理器，用于执行所述计算机程序并在执行所述计算机程序时实现上述的基于脉动阵列的神经网络加速方法。In a third aspect, the present application provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and when the computer is executed The program realizes the above-mentioned neural network acceleration method based on systolic array.

第四方面，本申请提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，若所述计算机程序被处理器执行，实现上述的基于脉动阵列的神经网络加速方法。In a fourth aspect, the present application provides a computer-readable storage medium that stores a computer program. If the computer program is executed by a processor, the above-mentioned systolic array-based neural network acceleration method is implemented.

本申请公开了一种基于脉动阵列的神经网络加速方法、装置、设备及存储介质，通过在卷积步长不为1时根据预设的滤波器分割规则从卷积滤波器分割出若干子滤波器，以及根据预设的特征图分割规则从待卷积特征图分割出若干特征子图实现可以以卷积步长为1执行卷积计算，各子滤波器对应的卷积计算结果进行叠加后的叠加结果与根据原卷积滤波器对待卷积特征图执行的卷积步长不为1的卷积计算结果相同，即分割操作前后的两种卷积计算是等价的；但是由于分割操作后卷积步长为1，可以更充分的利用脉动阵列的计算能力。This application discloses a neural network acceleration method, device, equipment and storage medium based on a systolic array. When the convolution step length is not 1, a number of sub-filters are segmented from the convolution filter according to a preset filter segmentation rule. According to the preset feature map segmentation rules, several feature submaps can be segmented from the feature map to be convolved. The convolution calculation can be performed with the convolution step length of 1, and the convolution calculation results corresponding to each subfilter are superimposed. The superposition result of is the same as the convolution calculation result of the convolution step size of the convolution feature map to be performed according to the original convolution filter is not 1, that is, the two convolution calculations before and after the segmentation operation are equivalent; but due to the segmentation operation The post-convolution step size is 1, which can make fuller use of the computing power of the systolic array.

Description of the drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

图1为本申请一实施例的基于脉动阵列的神经网络加速方法的流程示意图；1 is a schematic flowchart of a neural network acceleration method based on a systolic array according to an embodiment of the application;

图2为本申请另一实施例的基于脉动阵列的神经网络加速方法的流程示意图；2 is a schematic flowchart of a neural network acceleration method based on a systolic array according to another embodiment of the application;

图3为卷积步长为2且卷积滤波器尺寸为2×2时分割卷积滤波器的示意图；Fig. 3 is a schematic diagram of a partitioned convolution filter when the convolution step length is 2 and the convolution filter size is 2×2;

图4为卷积步长为2且卷积滤波器尺寸为2×2时分割待卷积特征图的示意图；4 is a schematic diagram of segmenting the feature map to be convolved when the convolution step size is 2 and the convolution filter size is 2×2;

图5为图1中分割待卷积特征图一实施方式的子流程示意图；FIG. 5 is a schematic diagram of a sub-process of an embodiment of dividing the feature map to be convolved in FIG. 1;

图6为分割补零后的待卷积特征图的示意图；6 is a schematic diagram of a feature map to be convolved after segmentation and zero padding;

图7为脉动阵列的结构示意图；Figure 7 is a schematic diagram of the structure of a systolic array;

图8为脉动阵列进行卷积计算的示意图；Fig. 8 is a schematic diagram of convolution calculation performed by a systolic array;

图9为图1中基于脉动阵列进行卷积计算的子流程示意图；FIG. 9 is a schematic diagram of a sub-process of performing convolution calculation based on a systolic array in FIG. 1;

图10为本申请再一实施例的基于脉动阵列的神经网络加速方法的流程示意图；10 is a schematic flowchart of a neural network acceleration method based on a systolic array according to another embodiment of the application;

图11为卷积步长为2且卷积滤波器尺寸为3×3时分割卷积滤波器的示意图；FIG. 11 is a schematic diagram of a divided convolution filter when the convolution step size is 2 and the convolution filter size is 3×3;

图12为卷积步长为2且卷积滤波器尺寸为3×3时分割卷积滤波器的子流程示意图；FIG. 12 is a schematic diagram of a sub-process of a partitioned convolution filter when the convolution step size is 2 and the convolution filter size is 3×3;

图13为卷积步长为2且卷积滤波器尺寸为3×3时分割待卷积特征图的示意图；FIG. 13 is a schematic diagram of segmenting the feature map to be convolved when the convolution step length is 2 and the convolution filter size is 3×3;

图14为本申请又一实施例的基于脉动阵列的神经网络加速方法的流程示意图；14 is a schematic flowchart of a neural network acceleration method based on a systolic array according to another embodiment of the application;

图15为卷积步长为3且卷积滤波器尺寸为3×3时分割卷积滤波器的示意图；Fig. 15 is a schematic diagram of a partitioned convolution filter when the convolution step length is 3 and the convolution filter size is 3×3;

图16为卷积步长为3且卷积滤波器尺寸为3×3时分割待卷积特征图的子流程示意图；16 is a schematic diagram of a sub-process of segmenting the feature map to be convolved when the convolution step is 3 and the size of the convolution filter is 3×3;

图17为卷积步长为3且卷积滤波器尺寸为3×3时分割待卷积特征图的示意图；FIG. 17 is a schematic diagram of segmenting the feature map to be convolved when the convolution step length is 3 and the convolution filter size is 3×3;

图18为根据神经网络加速方法对深度卷积神经网络下采样的拓扑结构进行等价变换的示意图；FIG. 18 is a schematic diagram of performing equivalent transformation of the topological structure of a deep convolutional neural network downsampling according to a neural network acceleration method;

图19为本申请实施例的基于脉动阵列的神经网络加速装置的结构示意图；19 is a schematic structural diagram of a neural network acceleration device based on a systolic array according to an embodiment of the application;

图20为本申请另一实施例的基于脉动阵列的神经网络加速装置的结构示意图；20 is a schematic structural diagram of a neural network acceleration device based on a systolic array according to another embodiment of the application;

图21为本申请一实施例提供的一种计算机设备的结构示意图。FIG. 21 is a schematic structural diagram of a computer device provided by an embodiment of this application.

detailed description

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

附图中所示的流程图仅是示例说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解、组合或部分合并，因此实际执行的顺序有可能根据实际情况改变。另外，虽然在装置示意图中进行了功能模块的划分，但是在某些情况下，可以以不同于装置示意图中的模块划分。The flowchart shown in the drawings is merely an illustration, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions. In addition, although the functional modules are divided in the device schematic diagram, in some cases, they may be divided into different modules from the device schematic diagram.

本申请的实施例提供了一种基于脉动阵列的神经网络加速方法、装置、设备及存储介质。其中，该基于脉动阵列的神经网络加速方法可以应用于终端或服务器中，以实现加速基于脉动阵列的神经网络的训练或推理。The embodiments of the present application provide a neural network acceleration method, device, equipment and storage medium based on a systolic array. Among them, the systolic array-based neural network acceleration method can be applied to a terminal or a server to accelerate the training or inference of the systolic array-based neural network.

下面结合附图，对本申请的一些实施方式作详细说明。在不冲突的情况下，下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

请参阅图1，图1是本申请的实施例提供的一种基于脉动阵列的神经网络加速方法的流程示意图。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a neural network acceleration method based on a systolic array provided by an embodiment of the present application.

如图1所示，基于脉动阵列的神经网络加速方法包括以下步骤：As shown in Figure 1, the neural network acceleration method based on systolic array includes the following steps:

步骤S110、获取卷积滤波器的卷积参数。Step S110: Obtain convolution parameters of the convolution filter.

其中，所述卷积参数包括卷积步长和所述卷积滤波器的尺寸。Wherein, the convolution parameter includes the convolution step size and the size of the convolution filter.

滤波器filter，又称内核kernel、特征检测器(feature detector)，在输入的图像或特征图上滑动滤波器并计算点乘即为卷积操作，卷积操作的输出的矩阵叫做卷积特征(Convolved Feature)、激活图(Activation Map)或者特征图(Feature Map)。Filter, also known as the kernel kernel, feature detector, sliding the filter on the input image or feature map and calculating the point product is the convolution operation, and the output matrix of the convolution operation is called the convolution feature ( Convolved Feature), activation map (Activation Map) or feature map (Feature Map).

示例性的，在神经网络中的神经元进行卷积操作前，先获取预先存储的或者初始化的卷积滤波器的卷积参数。Exemplarily, before the neurons in the neural network perform the convolution operation, the pre-stored or initialized convolution parameters of the convolution filter are first obtained.

在本实施例中，卷积参数包括卷积步长stride和所述卷积滤波器的尺寸，即卷积滤波器的高度h和宽度w；在另一些实施例中，卷积参数还包括输入通道数和/或输出通道数；其中输入通道数in depth是由待卷积特征图的通道数所决定的，输出通道数out depth等于卷积滤波器的个数，可以决定卷积结束后输出特征图的通道数。In this embodiment, the convolution parameter includes the convolution step stride and the size of the convolution filter, that is, the height h and the width w of the convolution filter; in other embodiments, the convolution parameter also includes input The number of channels and/or the number of output channels; the number of input channels in depth is determined by the number of channels in the feature map to be convolved, and the number of output channels out depth is equal to the number of convolution filters, which can determine the output after convolution The number of channels of the feature map.

基于脉动阵列的神经网络加速方法可以用于输入通道数等于1或大于1的场景，也可用于输出通道数等于1或大于1的场景。The systolic array-based neural network acceleration method can be used in scenarios where the number of input channels is equal to or greater than 1, and can also be used in scenarios where the number of output channels is equal to or greater than 1.

步骤S120、若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器。Step S120: If the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, segment a number of sub-filters from the convolution filter according to a preset filter segmentation rule.

其中，各所述子滤波器的尺寸小于所述卷积滤波器的尺寸。Wherein, the size of each sub-filter is smaller than the size of the convolution filter.

一些深度学习加速器例如FPGA，专用NPU等在卷积部分通常会用脉动阵列的结构实现，但是这种结构对于卷积步长不等于1的情况非常不友好；本实施例将尺寸大于1×1卷积滤波器分割为若干子滤波器，以使得各子滤波器分别以等于1的卷积步长执行卷积操作，从而充分利用脉动阵列结构的性能。Some deep learning accelerators such as FPGA, dedicated NPU, etc. are usually implemented with a systolic array structure in the convolution part, but this structure is very unfriendly to the case where the convolution step length is not equal to 1; this embodiment will be larger than 1×1 The convolution filter is divided into several sub-filters, so that each sub-filter performs a convolution operation with a convolution step length equal to 1, so as to make full use of the performance of the systolic array structure.

在一些实施例中，如图2和图3所示，步骤S120若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，具体包括：In some embodiments, as shown in FIG. 2 and FIG. 3, in step S120, if the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, according to a preset filter segmentation rule The convolution filter divides into several sub-filters, which specifically include:

步骤S121、若所述卷积步长为2且所述卷积滤波器的尺寸为2×2，从所述卷积滤波器分割出4个子滤波器，各所述子滤波器的尺寸为1×1。Step S121: If the convolution step size is 2 and the size of the convolution filter is 2×2, 4 subfilters are divided from the convolution filter, and the size of each subfilter is 1. ×1.

如图3所示，某卷积操作对应的卷积参数为[6 6 2 2]，即输入通道数in depth为6，输出通道数out depth等于6，6个卷积滤波器Kernel Tenseor的尺寸均为2×2。As shown in Figure 3, the convolution parameter corresponding to a certain convolution operation is [6 6 2 2], that is, the number of input channels in depth is 6, the number of output channels out depth is equal to 6, and the size of the 6 convolution filters KernelTenseor Both are 2×2.

如图3所示，将每个2×2的卷积滤波器分别分割为4个1×1的子滤波器。以第一个卷积滤波器分割的4个子滤波器为例，其中第一个子滤波器包括所述卷积滤波器奇数行奇数列的权值w1，第二个子滤波器包括所述卷积滤波器奇数行偶数列的权值w2，第三个子滤波器包括所述卷积滤波器偶数行奇数列的权值w3，第四个子滤波器包括所述卷积滤波器偶数行偶数列的权值w4。As shown in Figure 3, each 2×2 convolution filter is divided into 4 1×1 sub-filters. Take the four sub-filters divided by the first convolution filter as an example, where the first sub-filter includes the weight w1 of the odd-numbered rows and odd-numbered columns of the convolution filter, and the second sub-filter includes the convolution The weights w2 of the odd rows and even columns of the filter, the third subfilter includes the weights w3 of the even rows and odd columns of the convolution filter, and the fourth subfilter includes the weights of the even rows and even columns of the convolution filter. The value w4.

具体的，将卷积滤波器第一行第一列的权值分配给第一个1×1的子滤波器，将卷积滤波器第一行第二列的权值分配给第二个1×1的子滤波器，将卷积滤波器第二行第一列的权值分配给第三个1×1的子滤波器，将卷积滤波器第二行第二列的权值分配给第四个1×1的子滤波器。Specifically, assign the weight of the first row and first column of the convolution filter to the first 1×1 sub-filter, and assign the weight of the first row and second column of the convolution filter to the second 1 ×1 sub-filter, assign the weight of the second row and first column of the convolution filter to the third 1×1 sub-filter, and assign the weight of the second row and second column of the convolution filter to The fourth 1×1 sub-filter.

步骤S130、获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图。Step S130: Obtain a feature map to be convolved, and segment a number of feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule.

其中，所述若干特征子图与所述若干子滤波器一一对应。Wherein, the plurality of feature sub-images correspond to the plurality of sub-filters one to one.

示例性的，待卷积特征图的通道数可以等于1或大于1，待卷积特征图的通道数可以决定相应卷积滤波器卷积参数中的输入通道数。Exemplarily, the number of channels of the feature map to be convolved may be equal to or greater than 1, and the number of channels of the feature map to be convolved may determine the number of input channels in the convolution parameter of the corresponding convolution filter.

在一些实施例中，如图2和图4所示，若所述卷积步长为2且所述卷积滤波器的尺寸为2×2，步骤S130获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，具体包括：In some embodiments, as shown in FIG. 2 and FIG. 4, if the convolution step size is 2 and the size of the convolution filter is 2×2, step S130 obtains the feature map to be convolved and performs the The feature map segmentation rules of to segment several feature sub-maps from the feature map to be convolved, which specifically include:

步骤S1311、将所述待卷积特征图奇数行奇数列的数值分配至第一个特征子图的相应位置。Step S1311: Assign the values of the odd rows and odd columns of the feature map to be convolved to the corresponding positions of the first feature submap.

步骤S1312、将所述待卷积特征图奇数行偶数列的数值分配至第二个特征子图的相应位置。Step S1312. Assign the values of the odd rows and even columns of the feature map to be convolved to the corresponding positions of the second feature submap.

步骤S1313、将所述待卷积特征图偶数行奇数列的数值分配至第三个特征子图的相应位置。Step S1313: Assign the values of the even rows and odd columns of the feature map to be convolved to the corresponding positions of the third feature submap.

步骤S1314、将所述待卷积特征图偶数行偶数列的数值分配至第四个特征子图的相应位置。Step S1314: Assign the values of the even rows and even columns of the feature map to be convolved to the corresponding positions of the fourth feature submap.

示例性的，待卷积特征图中同一行的数值在各特征子图中也位于同一行，待卷积特征图中同一列的数值在各特征子图中也位于同一列。Exemplarily, the values in the same row in the feature map to be convolved are also located in the same row in each feature submap, and the values in the same column in the feature map to be convolved are also located in the same column in each feature submap.

如图4所示，获取到的卷积特征图input Tensor是[1 6 4 4]的卷积特征图，该待卷积特征图的通道数为6，宽和高均为4。根据预设的分割特征从所述待卷积特征图分割出了4个特征子图。As shown in Figure 4, the acquired convolution feature map input Tensor is the convolution feature map of [1 6 4 4]. The number of channels of the feature map to be convolved is 6, and the width and height are both 4. According to the preset segmentation feature, 4 feature submaps are segmented from the feature map to be convolved.

在一些实施例中，如图5所示，步骤S130获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图具体包括：In some embodiments, as shown in FIG. 5, step S130 acquiring a feature map to be convolved and segmenting several feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule specifically includes:

步骤S131、获取待卷积特征图。Step S131: Obtain a feature map to be convolved.

示例性的，获取的卷积特征图如图6所示。Exemplarily, the acquired convolution feature map is shown in FIG. 6.

步骤S132、若获取的待卷积特征图的长或宽不是所述卷积步长的整数倍，对所述待卷积特征图的预设位置进行补零以使补零后的待卷积特征图的长或宽是所述卷积步长的整数倍。Step S132: If the length or width of the acquired feature map to be convolved is not an integer multiple of the convolution step length, perform zero padding on the preset position of the feature map to be convolved to make the zero padding to be convolved The length or width of the feature map is an integer multiple of the convolution step length.

在本实施例中，获取的卷积特征图的长、宽均为3，示例性的在该卷积特征图的右侧和下方补零padding，使得补零后的待卷积特征图的长、宽为4。In this embodiment, the length and width of the acquired convolution feature map are both 3. The padding is exemplarily added to the right and bottom of the convolution feature map, so that the length of the feature map to be convolved after zero padding is The width is 4.

步骤S133、根据预设的特征图分割规则从补零后的待卷积特征图分割出若干特征子图。Step S133: According to a preset feature map segmentation rule, a number of feature sub-maps are segmented from the feature map to be convolved after zero padding.

如图6所示，根据上述步骤S1311-步骤S1314从补零后的待卷积特征图分割出了4个特征子图。As shown in FIG. 6, according to the above steps S1311-step S1314, 4 feature submaps are segmented from the feature map to be convolved after zero padding.

在一些实施例中，待卷积特征图的结构为NCHW，如[1 6 4 4]，N代表数量，C代表通道数channel，H代表高度，W代表宽度；因此实例数batchsize＝1，通道数channel＝6，待卷积特征图的高度H＝4，待卷积特征图的宽度W＝4时，即待卷积张量的数量为1，有6个通道，各通道均为一个待卷积特征图。可以根据先分割、卷积同一数量编号下不同通道的待卷积特征图，然后分割、卷积下一数量编号下不同通道的待卷积特征图。In some embodiments, the structure of the feature map to be convolved is NCHW, such as [1 6 4 4], N represents the number, C represents the number of channels, H represents the height, and W represents the width; therefore, the number of instances batchsize=1, the channel When the number of channel=6, the height of the feature map to be convolved H=4, and the width of the feature map to be convolved W=4, that is, the number of tensors to be convolved is 1, there are 6 channels, and each channel is a waiting Convolution feature map. The feature maps to be convolved for different channels under the same number can be divided and convolved first, and then the feature maps to be convolved under the next number and different channels can be divided and convolved.

步骤S140、基于脉动阵列，根据各所述子滤波器对各自对应的特征子图进行卷积计算，卷积计算的步长为1。Step S140, based on the systolic array, perform convolution calculation on the corresponding feature submap according to each of the subfilters, and the step size of the convolution calculation is 1.

脉动阵列(Systolic Array)核心概念就是让数据在运算单元的阵列中进行流动，减少访存的次数，并且使得结构更加规整，布线更加统一，提高频率。The core concept of Systolic Array is to allow data to flow in the array of arithmetic units, reduce the number of memory accesses, and make the structure more regular, the wiring more uniform, and the frequency.

在一些实施例中，如图7所示，脉动阵列Systolic Array包括L×L个处理单元PE，所述脉动阵列连接于权值寄存器filter buffer、输入寄存器in buffer和输出寄存器out buffer。每行处理单元PE的左侧、每列处理单元PE的上侧都设有先进先出寄存器FIFO。滤波器的权值通过先进先出寄存器FIFO存储并传输给同一行的所有处理单元PE，第一行和第一列的处理单元PE接收来自输入寄存器中待卷积特征图的数据，并且第一行和第一列的处理单元PE均向各自右下角的处理单元PE传输来自待卷积特征图的数据。这样的设计最大化了数据的复用。In some embodiments, as shown in FIG. 7, the Systolic Array includes L×L processing units PE, and the systolic array is connected to the weight register filter buffer, the input register in buffer, and the output register out buffer. The left side of each row processing unit PE and the upper side of each column processing unit PE are provided with a first-in first-out register FIFO. The weight of the filter is stored in the first-in first-out register FIFO and transmitted to all processing units PE in the same row. The processing units PE in the first row and the first column receive the data from the feature map to be convolved in the input register, and the first The processing unit PE in the row and the first column both transmit data from the feature map to be convolved to the processing unit PE in the lower right corner of each. This design maximizes data reuse.

示例性的，如图8所示，脉动阵列根据一个3×3滤波器W对一个5×5的特征图X进行二维卷积。Exemplarily, as shown in FIG. 8, the systolic array performs two-dimensional convolution on a 5×5 feature map X according to a 3×3 filter W.

假设滤波器W和特征图X有以下形式：Assume that the filter W and the feature map X have the following forms:

其中，wi和xj分别代表滤波器W和特征图X的某一行数据，则最后一行的三个处理单元PE输出三行卷积结果：Among them, wi and xj respectively represent a certain row of data of filter W and feature map X, and the three processing units PE in the last row output three rows of convolution results:

其中*表示一维卷积计算。Where * represents one-dimensional convolution calculation.

在一些实施例中，如图9所示，步骤S140基于脉动阵列，根据各所述子滤波器对各自对应的特征子图进行卷积计算，具体包括：In some embodiments, as shown in FIG. 9, step S140 is based on the systolic array, and performs convolution calculation on the corresponding feature sub-map according to each of the sub-filters, which specifically includes:

步骤S141、将所述子滤波器的权值加载至连接于所述脉动阵列的权值寄存器中。Step S141: Load the weight of the sub-filter into the weight register connected to the systolic array.

示例性的，将子滤波器的权值加载至权值寄存器filter buffer中，子滤波器的权值通过先进先出寄存器FIFO存储并传输给同一行的处理单元PE。Exemplarily, the weight of the sub-filter is loaded into the weight register filter buffer, and the weight of the sub-filter is stored in the first-in first-out register FIFO and transmitted to the processing unit PE in the same row.

步骤S142、将所述子滤波器对应的特征子图加载至连接于所述脉动阵列的输入寄存器中。Step S142: Load the characteristic sub-map corresponding to the sub-filter into the input register connected to the systolic array.

示例性的，将与子滤波器对应的特征子图加载至输入寄存器in buffer中，脉动阵列第一行和第一列的处理单元PE接收来自输入寄存器in buffer中特征子图的数据。Exemplarily, the feature submap corresponding to the subfilter is loaded into the input register in buffer, and the processing unit PE in the first row and the first column of the systolic array receives data from the feature submap in the input register in buffer.

步骤S143、获取所述脉动阵列卷积计算后的输出结果。Step S143: Obtain the output result of the systolic array convolution calculation.

示例性的，脉动阵列第一行和第一列的处理单元PE均向各自右下角的处理单元PE传输来自特征子图的数据；最后一行的处理单元PE输出所述子滤波器对对应的特征子图进行卷积计算，卷积步长为1的卷积结果。Exemplarily, the processing units PE in the first row and the first column of the systolic array both transmit data from the feature sub-map to the processing unit PE in the lower right corner of each; the processing unit PE in the last row outputs the feature corresponding to the sub-filter pair Convolution calculation is performed on the sub-image, and the convolution step length is 1 convolution result.

如图3所示，第一个子滤波器的权值为w1，对图4中与其对应的第一个特征子图进行卷积计算；第二个子滤波器的权值为w2，对第二个特征子图进行卷积计算；第三个子滤波器的权值为w3，对第三个特征子图进行卷积计算；第四个子滤波器的权值为w4，对第四个特征子图进行卷积计算。第一至第四个子滤波器对应的卷积计算的结果如下：As shown in Figure 3, the weight of the first sub-filter is w1, and the convolution calculation is performed on the first feature sub-image corresponding to it in Figure 4; the weight of the second sub-filter is w2, which is Convolution calculation is performed on the three feature sub-images; the weight of the third sub-filter is w3, and the convolution calculation is performed on the third feature sub-image; the weight of the fourth sub-filter is w4, and the weight of the fourth sub-filter is w4. Perform convolution calculations. The results of the convolution calculation corresponding to the first to fourth sub-filters are as follows:

步骤S150、将各所述子滤波器对应的卷积计算结果进行叠加，并将叠加的结果作为所述卷积滤波器对所述待卷积特征图卷积计算的结果进行输出。Step S150: Superimpose the convolution calculation results corresponding to each of the sub-filters, and output the superimposed result as the result of the convolution calculation of the feature map to be convolved by the convolution filter.

示例性的，将4个子滤波器对应的卷积计算结果进行叠加，得到：Exemplarily, the convolution calculation results corresponding to the 4 sub-filters are superimposed to obtain:

如果直接根据图3左侧的卷积滤波器对图4左侧的卷积特征图以卷积步长为2进行卷积计算，卷积计算的结果为：If you directly perform convolution calculation on the convolution feature map on the left side of Fig. 4 according to the convolution filter on the left side of Fig. 3 with a convolution step size of 2, the result of the convolution calculation is:

因此，本实施例的基于脉动阵列的神经网络加速方法，通过在卷积步长不为1时根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，以及根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图实现可以以卷积步长为1执行卷积计算，各子滤波器对应的卷积计算结果进行叠加后的叠加结果与根据原卷积滤波器对待卷积特征图执行的卷积步长不为1的卷积计算结果相同，即分割操作前后的两种卷积计算是等价的；因此叠加的结果可以作为所述卷积滤波器对所述待卷积特征图卷积计算的结果进行输出，以进行后续的处理如再一次卷积、池化、分类等；但是由于分割操作后卷积步长为1，可以更充分的利用脉动阵列的计算能力。Therefore, the neural network acceleration method based on the systolic array of this embodiment divides a number of sub-filters from the convolution filter according to a preset filter segmentation rule when the convolution step length is not 1, and according to the preset Set the feature map segmentation rule to segment a number of feature sub-maps from the feature map to be convolved. The convolution calculation can be performed with a convolution step length of 1, and the convolution calculation results corresponding to each sub-filter are superimposed. It is the same as the convolution calculation result of the convolution step length not 1 performed on the convolution feature map according to the original convolution filter, that is, the two convolution calculations before and after the segmentation operation are equivalent; therefore, the result of the superposition can be used as the The convolution filter outputs the result of the convolution calculation of the feature map to be convolved for subsequent processing such as reconvolution, pooling, classification, etc.; but because the convolution step length is 1, after the segmentation operation, The computing power of the systolic array can be fully utilized.

示例性的，如图10和图11所示，步骤S120若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，具体包括：Exemplarily, as shown in FIG. 10 and FIG. 11, in step S120, if the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, according to a preset filter segmentation rule from the The convolution filter divides into several sub-filters, including:

步骤S122、若所述卷积步长为2且所述卷积滤波器的尺寸为3×3，从所述卷积滤波器分割出4个子滤波器，各所述子滤波器的尺寸为2×2。Step S122: If the convolution step size is 2 and the size of the convolution filter is 3×3, 4 sub-filters are divided from the convolution filter, and the size of each sub-filter is 2. ×2.

其中第一个子滤波器包括所述卷积滤波器奇数行奇数列的权值，第二个子滤波器包括所述卷积滤波器奇数行偶数列的权值，第三个子滤波器包括所述卷积滤波器偶数行奇数列的权值，第四个子滤波器包括所述卷积滤波器偶数行偶数列的权值。Wherein the first subfilter includes the weights of odd rows and odd columns of the convolution filter, the second subfilter includes the weights of odd rows and even columns of the convolution filter, and the third subfilter includes the weights of the odd rows and even columns of the convolution filter. The weights of the even rows and odd columns of the convolution filter, and the fourth sub-filter includes the weights of the even rows and even columns of the convolution filter.

在一些实施例中，卷积滤波器kernel(filter)的尺寸无法整除卷积步长(stride)，可以通过在卷积滤波器的预设位置进行补零以使补零后的卷积滤波器的长或宽是所述卷积步长的整数倍。在本实施例中，卷积滤波器的尺寸为3×3，卷积步长为2，卷积滤波器的尺寸无法整除卷积步长，可以通过补零操作以使补零后的卷积滤波器的长或宽是所述卷积步长的整数倍，从而使卷积滤波器可以根据预设的滤波器分割规则分割出若干子滤波器。In some embodiments, the size of the convolution filter kernel (filter) cannot divide the convolution stride, and the zero-filled convolution filter can be made by performing zero padding at the preset position of the convolution filter The length or width of is an integer multiple of the convolution step length. In this embodiment, the size of the convolution filter is 3×3, and the convolution step size is 2. The size of the convolution filter cannot divide the convolution step size. The zero padding operation can be used to make the zero pad convolution The length or width of the filter is an integer multiple of the convolution step length, so that the convolution filter can divide a number of sub-filters according to a preset filter division rule.

具体的，如图11和图12所示，所述若所述卷积步长为2且所述卷积滤波器的尺寸为3×3，从所述卷积滤波器分割出4个子滤波器，各所述子滤波器的尺寸为2×2，具体包括：Specifically, as shown in FIGS. 11 and 12, if the convolution step size is 2 and the size of the convolution filter is 3×3, 4 sub-filters are divided from the convolution filter , The size of each sub-filter is 2×2, and specifically includes:

步骤S11、将所述卷积滤波器奇数行奇数列的权值分配至第一个子滤波器。Step S11: Assign weights of odd rows and odd columns of the convolution filter to the first subfilter.

步骤S12、将所述卷积滤波器奇数行偶数列的权值分配至第二个子滤波器的第一列，并以0填充所述第二个子滤波器的第二列。Step S12: Assign the weights of the odd rows and even columns of the convolution filter to the first column of the second sub-filter, and fill the second column of the second sub-filter with 0.

步骤S13、将所述卷积滤波器偶数行奇数列的权值分配至第三个子滤波器的第一行，并以0填充所述第三个子滤波器的第二行。Step S13: Assign the weights of the even rows and odd columns of the convolution filter to the first row of the third sub-filter, and fill the second row of the third sub-filter with 0.

步骤S14、将所述卷积滤波器偶数行偶数列的权值分配至第四个子滤波器的第一行第一列，并以0填充所述第四个子滤波器的其余位置。Step S14: Assign the weights of the even rows and even columns of the convolution filter to the first row and first column of the fourth subfilter, and fill the remaining positions of the fourth subfilter with 0.

在本实施例中，如图10和图13所示，若所述卷积步长为2且所述卷积滤波器的尺寸为3×3，步骤S130获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，具体包括：In this embodiment, as shown in FIG. 10 and FIG. 13, if the convolution step size is 2 and the size of the convolution filter is 3×3, step S130 obtains the feature map to be convolved and performs The feature map segmentation rules of to segment several feature sub-maps from the feature map to be convolved, which specifically include:

步骤S1321、将所述待卷积特征图奇数行奇数列的数值分配至第一个特征子图的相应位置。Step S1321: Allocate the values of odd rows and odd columns of the feature map to be convolved to the corresponding positions of the first feature submap.

步骤S1322、将所述待卷积特征图奇数行偶数列的数值分配至第二个特征子图的相应位置。Step S1322, assign the values of odd rows and even columns of the feature map to be convolved to the corresponding positions of the second feature submap.

步骤S1323、将所述待卷积特征图偶数行奇数列的数值分配至第三个特征子图的相应位置。Step S1323: Assign the values of the even rows and odd columns of the feature map to be convolved to the corresponding positions of the third feature submap.

步骤S1324、将所述待卷积特征图偶数行偶数列的数值分配至第四个特征子图的相应位置。Step S1324: Assign the values of the even rows and even columns of the feature map to be convolved to the corresponding positions of the fourth feature submap.

卷积步长为2且所述卷积滤波器的尺寸为3×3时，根据预设的特征图分割规则从所述待卷积特征图分割出4个特征子图；如果待卷积特征图的通道数为1，则分割后的通道数为4。When the convolution step size is 2 and the size of the convolution filter is 3×3, 4 feature submaps are segmented from the feature map to be convolved according to the preset feature map segmentation rule; if the feature to be convolved The number of channels in the picture is 1, and the number of channels after division is 4.

在一些实施例中，如图14和图15所示，步骤S120若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，具体包括：In some embodiments, as shown in FIG. 14 and FIG. 15, in step S120, if the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, according to the preset filter division rule The convolution filter divides into several sub-filters, which specifically include:

步骤S123、若所述卷积步长为3且所述卷积滤波器的尺寸为3×3，从所述卷积滤波器分割出9个子滤波器，各所述子滤波器的尺寸为1×1且分别包括所述卷积滤波器9个权值中的一个。Step S123: If the convolution step size is 3 and the size of the convolution filter is 3×3, 9 sub-filters are divided from the convolution filter, and the size of each sub-filter is 1. ×1 and each includes one of the 9 weights of the convolution filter.

示例性的，将卷积滤波器第一行第一列的权值分配给第一个1×1的子滤波器，将卷积滤波器第一行第二列的权值分配给第二个1×1的子滤波器，将卷积滤波器第一行第三列的权值分配给第三个1×1的子滤波器，将卷积滤波器第二行第一列的权值分配给第四个1×1的子滤波器，将卷积滤波器第二行第二列的权值分配给第五个1×1的子滤波器，以此类推。Exemplarily, assign the weight of the first row and first column of the convolution filter to the first 1×1 sub-filter, and assign the weight of the first row and second column of the convolution filter to the second one 1×1 sub-filter, assign the weight of the first row and third column of the convolution filter to the third 1×1 sub-filter, and assign the weight of the second row and first column of the convolution filter For the fourth 1×1 sub-filter, assign the weight of the second row and second column of the convolution filter to the fifth 1×1 sub-filter, and so on.

在本实施例中，如图16和图17所示，若所述卷积步长为3且所述卷积滤波器的尺寸为3×3，步骤S130获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，具体包括：In this embodiment, as shown in FIG. 16 and FIG. 17, if the convolution step size is 3 and the size of the convolution filter is 3×3, step S130 obtains the feature map to be convolved and performs The feature map segmentation rules of to segment several feature sub-maps from the feature map to be convolved, which specifically include:

步骤S1331、将所述待卷积特征图第3n+1行第3n+1列的数值分配至第一个特征子图的相应位置。其中n为自然数。Step S1331: Assign the value of the 3n+1th row and 3n+1th column of the feature map to be convolved to the corresponding position of the first feature submap. Where n is a natural number.

步骤S1332、将所述待卷积特征图第3n+1行第3n+2列的数值分配至第二个特征子图的相应位置。Step S1332, assign the value in the 3n+1th row and 3n+2th column of the feature map to be convolved to the corresponding position of the second feature submap.

步骤S1333、将所述待卷积特征图第3n+1行第3n+3列的数值分配至第三个特征子图的相应位置。Step S1333: Assign the values in the 3n+1th row and 3n+3th column of the feature map to be convolved to the corresponding position of the third feature submap.

步骤S1334、将所述待卷积特征图第3n+2行第3n+1列的数值分配至第四个特征子图的相应位置。Step S1334: Assign the value in the 3n+2th row and 3n+1th column of the feature map to be convolved to the corresponding position of the fourth feature submap.

步骤S1335、将所述待卷积特征图第3n+2行第3n+2列的数值分配至第五个特征子图的相应位置。Step S1335: Assign the values in the 3n+2th row and 3n+2th column of the feature map to be convolved to the corresponding position of the fifth feature submap.

步骤S1336、将所述待卷积特征图第3n+2行第3n+3列的数值分配至第六个特征子图的相应位置。Step S1336: Assign the value in the 3n+2th row and 3n+3th column of the feature map to be convolved to the corresponding position of the sixth feature submap.

步骤S1337、将所述待卷积特征图第3n+3行第3n+1列的数值分配至第七个特征子图的相应位置。Step S1337: Assign the values in the 3n+3th row and 3n+1th column of the feature map to be convolved to the corresponding position of the seventh feature submap.

步骤S1338、将所述待卷积特征图第3n+3行第3n+2列的数值分配至第八个特征子图的相应位置。Step S1338: Assign the value in the 3n+3th row and 3n+2th column of the feature map to be convolved to the corresponding position of the eighth feature submap.

步骤S1339、将所述待卷积特征图第3n+3行第3n+3列的数值分配至第九个特征子图的相应位置。Step S1339: Assign the value in the 3n+3th row and 3n+3th column of the feature map to be convolved to the corresponding position of the ninth feature submap.

示例性的，如图17所示，获取到的待卷积特征图的长宽均为8，不是所述卷积步长，即3的整数倍，则对所述待卷积特征图的预设位置进行补零以使补零后的待卷积特征图的长或宽是所述卷积步长的整数倍；然后根据预设的特征图分割规则从补零后的待卷积特征图分割出9个特征子图。Exemplarily, as shown in FIG. 17, the length and width of the acquired feature map to be convolved are both 8. If it is not the convolution step size, that is, an integer multiple of 3, the pre-convolution feature map is Set the position to perform zero-padded so that the length or width of the zero-padded feature map to be convolution is an integer multiple of the convolution step; then according to the preset feature map segmentation rule, the zero-padded feature map to be convolved Segment 9 feature sub-maps.

本申请的基于脉动阵列的神经网络加速方法通过在卷积步长不为1时根据预设的滤波器分割规则从卷积滤波器分割出若干子滤波器，以及根据预设的特征图分割规则从待卷积特征图分割出若干特征子图实现可以以卷积步长为1执行卷积计算。可以很好的与一些特殊的专用深度网络加速器例如FPGA，NPU等底层都会采用的脉动阵列(Systolic Array)的结构适配，可以节省计算资源，而且这种分割方法本身就是一种特殊的计算逻辑，可以集成进入各种深度学习的框架中。本申请提供的分割变换方法，并不影响深度网络本身的前传和后传路径，而且因为节省了计算资源，实际上提升了训练和推理的速度。The systolic array-based neural network acceleration method of the present application divides several sub-filters from the convolution filter according to the preset filter segmentation rule when the convolution step length is not 1, and according to the preset feature map segmentation rule Segmenting several feature sub-maps from the feature map to be convolved can achieve convolution calculation with a convolution step length of 1. It can be well adapted to some special dedicated deep network accelerators such as FPGA, NPU and other low-level systolic array (Systolic Array) structure, which can save computing resources, and this segmentation method itself is a special computing logic , Can be integrated into various deep learning frameworks. The segmentation transformation method provided in this application does not affect the forward and backward paths of the deep network itself, and because it saves computing resources, it actually improves the speed of training and inference.

在一些实施例中，如图18所示为根据本申请的基于脉动阵列的神经网络加速方法对传统的深度卷积神经网络ResNet50的下采样的拓扑结构进行分割变换的示意图；箭头左侧为传统的深度卷积神经网络ResNet50的下采样的拓扑结构部分的简化模型，箭头左侧为经过分割变换等价变换后的计算拓扑结构。In some embodiments, as shown in FIG. 18 is a schematic diagram of the down-sampling topology of the traditional deep convolutional neural network ResNet50 according to the systolic array-based neural network acceleration method of the present application; the left side of the arrow is the traditional A simplified model of the down-sampled topology structure of the deep convolutional neural network ResNet50. The left side of the arrow is the calculated topology after the equivalent transformation of the segmentation transformation.

相较于传统的深度卷积神经网络的计算图结构，经过分割变换等价变换后的计算拓扑结构如下的优势：1.省去了传统ResNet50在左侧的两个1×1的映射卷积，减少计算资源。2.传统ResNet50右侧的残差分量部分，可以转换为直接恒等映射((Identity Mapping)，有利于残差的传播。本申请的基于脉动阵列的神经网络加速方法可以应用在很多网络模型中，例如densNet，或者shakeshake网络等等，只要存于在下采样部分的网络，都可以采用本申请提供的神经网络加速方法做变换后，再进行计算训练等。Compared with the calculation graph structure of the traditional deep convolutional neural network, the calculation topology after the equivalent transformation of the segmentation transformation has the following advantages: 1. Eliminates the two 1×1 mapping convolutions on the left side of the traditional ResNet50 , Reduce computing resources. 2. The residual component part on the right side of the traditional ResNet50 can be converted into a direct identity mapping (Identity Mapping), which is conducive to the propagation of residuals. The systolic array-based neural network acceleration method of this application can be applied to many network models For example, densNet, or shakeshake network, etc., as long as the network is stored in the down-sampling part, the neural network acceleration method provided in this application can be used for transformation, and then calculation training can be performed.

请参阅图19，图19是本申请一实施例提供的基于脉动阵列的神经网络加速装置的结构示意图，该基于脉动阵列的神经网络加速装置可以配置于服务器中，用于执行前述的基于脉动阵列的神经网络加速方法。Please refer to FIG. 19, which is a schematic structural diagram of a systolic array-based neural network acceleration device provided by an embodiment of the present application. The systolic array-based neural network acceleration device can be configured in a server to execute the aforementioned systolic array The neural network acceleration method.

如图19所示，该基于脉动阵列的神经网络加速装置，包括：As shown in Figure 19, the systolic array-based neural network acceleration device includes:

卷积参数获取模块110，用于获取卷积滤波器的卷积参数，所述卷积参数包括卷积步长和所述卷积滤波器的尺寸。The convolution parameter obtaining module 110 is configured to obtain convolution parameters of a convolution filter, where the convolution parameters include a convolution step size and a size of the convolution filter.

滤波器分割模块120，用于若所述卷积步长不为1且所述卷积滤波器的尺寸大于1×1，根据预设的滤波器分割规则从所述卷积滤波器分割出若干子滤波器，各所述子滤波器的尺寸小于所述卷积滤波器的尺寸。The filter segmentation module 120 is configured to: if the convolution step length is not 1 and the size of the convolution filter is greater than 1×1, segment the convolution filter according to a preset filter segmentation rule. Sub-filters, the size of each sub-filter is smaller than the size of the convolution filter.

特征图分割模块130，用于获取待卷积特征图并根据预设的特征图分割规则从所述待卷积特征图分割出若干特征子图，所述若干特征子图与所述若干子滤波器一一对应。The feature map segmentation module 130 is configured to obtain a feature map to be convolved and segment a number of feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule, the number of feature sub-maps and the number of sub-filters One to one correspondence.

卷积模块140，用于基于脉动阵列，根据各所述子滤波器对各自对应的特征子图进行卷积计算，卷积计算的步长为1。The convolution module 140 is configured to perform convolution calculation on the corresponding feature sub-maps according to each of the sub-filters based on the systolic array, and the step size of the convolution calculation is 1.

叠加模块150，用于将各所述子滤波器对应的卷积计算结果进行叠加，并将叠加的结果作为所述卷积滤波器对所述待卷积特征图卷积计算的结果进行输出。The superposition module 150 is configured to superimpose the convolution calculation results corresponding to each of the sub-filters, and output the superimposition result as the convolution calculation result of the feature map to be convolved by the convolution filter.

在一些实施例中，如图20所示，特征图分割模块130包括：In some embodiments, as shown in FIG. 20, the feature map segmentation module 130 includes:

特征图获取子模块131，用于获取待卷积特征图。The feature map acquiring sub-module 131 is used to acquire the feature map to be convolved.

补零子模块132，用于若获取的待卷积特征图的长或宽不是所述卷积步长的整数倍，对所述待卷积特征图的预设位置进行补零以使补零后的待卷积特征图的长或宽是所述卷积步长的整数倍。The zero padding sub-module 132 is configured to, if the length or width of the acquired feature map to be convolved is not an integer multiple of the convolution step length, perform zero padding on the preset position of the feature map to be convolved to make zero padding The length or width of the subsequent feature map to be convolved is an integer multiple of the convolution step length.

特征图分割子模块133，用于根据预设的特征图分割规则从补零后的待卷积特征图分割出若干特征子图。The feature map segmentation submodule 133 is used to segment a number of feature submaps from the feature map to be convolved after zero padding according to preset feature map segmentation rules.

在一些实施例中，如图20所示，卷积模块140包括：In some embodiments, as shown in FIG. 20, the convolution module 140 includes:

权值加载子模块141，用于将所述子滤波器的权值加载至连接于所述脉动阵列的权值寄存器中；The weight loading sub-module 141 is used to load the weight of the sub-filter into the weight register connected to the systolic array;

子图加载子模块142，用于将所述子滤波器对应的特征子图加载至连接于所述脉动阵列的输入寄存器中。The sub-picture loading sub-module 142 is used to load the characteristic sub-picture corresponding to the sub-filter to the input register connected to the systolic array.

输出子模块143，用于获取所述脉动阵列卷积计算后的输出结果。The output sub-module 143 is used to obtain the output result of the systolic array convolution calculation.

在一些实施例中，如图20所示，滤波器分割模块120包括第一滤波器分割子模块121，用于若所述卷积步长为2且所述卷积滤波器的尺寸为2×2，从所述卷积滤波器分割出4个子滤波器，各所述子滤波器的尺寸为1×1；其中第一个子滤波器包括所述卷积滤波器奇数行奇数列的权值，第二个子滤波器包括所述卷积滤波器奇数行偶数列的权值，第三个子滤波器包括所述卷积滤波器偶数行奇数列的权值，第四个子滤波器包括所述卷积滤波器偶数行偶数列的权值。In some embodiments, as shown in FIG. 20, the filter segmentation module 120 includes a first filter segmentation sub-module 121, which is used if the convolution step size is 2 and the size of the convolution filter is 2× 2. Divide 4 sub-filters from the convolution filter, each of the sub-filters has a size of 1×1; wherein the first sub-filter includes the weights of odd rows and odd columns of the convolution filter , The second sub-filter includes the weights of the odd-numbered rows and even columns of the convolution filter, the third sub-filter includes the weights of the even-numbered rows and odd columns of the convolution filter, and the fourth sub-filter includes the weights of the convolution filter. The weights of the even rows and even columns of the product filter.

特征图分割模块130包括第一特征图分割子模块1301，用于若所述卷积步长为2且所述卷积滤波器的尺寸为2×2，将所述待卷积特征图奇数行奇数列的数值分配至第一个特征子图的相应位置，将所述待卷积特征图奇数行偶数列的数值分配至第二个特征子图的相应位置，将所述待卷积特征图偶数行奇数列的数值分配至第三个特征子图的相应位置，将所述待卷积特征图偶数行偶数列的数值分配至第四个特征子图的相应位置。The feature map segmentation module 130 includes a first feature map segmentation sub-module 1301, which is configured to, if the convolution step size is 2 and the size of the convolution filter is 2×2, divide the odd rows of the feature map to be convolution The values of the odd columns are assigned to the corresponding positions of the first feature submap, the values of the odd rows and the even columns of the feature map to be convolved are assigned to the corresponding positions of the second feature submap, and the feature map to be convolved is assigned The values of even rows and odd columns are allocated to the corresponding positions of the third feature submap, and the values of even rows and even columns of the feature map to be convolved are allocated to the corresponding positions of the fourth feature submap.

在一些实施例中，如图20所示，滤波器分割模块120包括第二滤波器分割子模块122，用于若所述卷积步长为2且所述卷积滤波器的尺寸为3×3，从所述卷积滤波器分割出4个子滤波器，各所述子滤波器的尺寸为2×2；其中第一个子滤波器包括所述卷积滤波器奇数行奇数列的权值，第二个子滤波器包括所述卷积滤波器奇数行偶数列的权值，第三个子滤波器包括所述卷积滤波器偶数行奇数列的权值，第四个子滤波器包括所述卷积滤波器偶数行偶数列的权值。In some embodiments, as shown in FIG. 20, the filter segmentation module 120 includes a second filter segmentation sub-module 122, which is used if the convolution step size is 2 and the size of the convolution filter is 3× 3. Divide 4 sub-filters from the convolution filter, each of the sub-filters has a size of 2×2; wherein the first sub-filter includes the weights of odd rows and odd columns of the convolution filter , The second sub-filter includes the weights of the odd-numbered rows and even columns of the convolution filter, the third sub-filter includes the weights of the even-numbered rows and odd columns of the convolution filter, and the fourth sub-filter includes the weights of the convolution filter. The weights of the even rows and even columns of the product filter.

特征图分割模块130包括第二特征图分割子模块1302，用于若所述卷积步长为2且所述卷积滤波器的尺寸为3×3，将所述待卷积特征图奇数行奇数列的数值分配至第一个特征子图的相应位置，将所述待卷积特征图奇数行偶数列的数值分配至第二个特征子图的相应位置，将所述待卷积特征图偶数行奇数列的数值分配至第三个特征子图的相应位置，将所述待卷积特征图偶数行偶数列的数值分配至第四个特征子图的相应位置。The feature map segmentation module 130 includes a second feature map segmentation sub-module 1302, which is used to divide the odd-numbered rows of the feature map to be convolved if the convolution step size is 2 and the size of the convolution filter is 3×3 The values of the odd columns are assigned to the corresponding positions of the first feature submap, the values of the odd rows and the even columns of the feature map to be convolved are assigned to the corresponding positions of the second feature submap, and the feature map to be convolved is assigned The values of even rows and odd columns are allocated to the corresponding positions of the third feature submap, and the values of even rows and even columns of the feature map to be convolved are allocated to the corresponding positions of the fourth feature submap.

在一些实施例中，如图20所示，滤波器分割模块120包括第三滤波器分割子模块123，用于若所述卷积步长为3且所述卷积滤波器的尺寸为3×3，从所述卷积滤波器分割出9个子滤波器，各所述子滤波器的尺寸为1×1且分别包括所述卷积滤波器9个权值中的一个。In some embodiments, as shown in FIG. 20, the filter segmentation module 120 includes a third filter segmentation sub-module 123, which is used if the convolution step size is 3 and the size of the convolution filter is 3× 3. Separate 9 sub-filters from the convolution filter, each of the sub-filters has a size of 1×1 and each includes one of the 9 weights of the convolution filter.

特征图分割模块130包括第三特征图分割子模块1303，用于若所述卷积步长为3且所述卷积滤波器的尺寸为3×3，将所述待卷积特征图第3n+1行第3n+1列的数值分配至第一个特征子图的相应位置，将所述待卷积特征图第3n+1行第3n+2列的数值分配至第二个特征子图的相应位置，将所述待卷积特征图第3n+1行第3n+3列的数值分配至第三个特征子图的相应位置，将所述待卷积特征图第3n+2行第3n+1列的数值分配至第四个特征子图的相应位置，将所述待卷积特征图第3n+2行第3n+2列的数值分配至第五个特征子图的相应位置，将所述待卷积特征图第3n+2行第3n+3列的数值分配至第六个特征子图的相应位置，将所述待卷积特征图第3n+3行第3n+1列的数值分配至第七个特征子图的相应位置，将所述待卷积特征图第3n+3行第3n+2列的数值分配至第八个特征子图的相应位置，将所述待卷积特征图第3n+3行第3n+3列的数值分配至第九个特征子图的相应位置，其中n为自然数。The feature map segmentation module 130 includes a third feature map segmentation sub-module 1303, which is configured to: if the convolution step size is 3 and the size of the convolution filter is 3×3, the 3nth feature map to be convolved Assign the value in row 3n+1 and column 3n+1 to the corresponding position of the first feature submap, and assign the value in row 3n+1 and column 3n+2 of the feature map to be convolved to the second feature submap Assign the value in row 3n+1, column 3n+3 of the feature map to be convolved to the corresponding position of the third feature submap, and assign the feature map to be convolved to row 3n+2, row 3n+3. The value in column 3n+1 is assigned to the corresponding position of the fourth feature submap, and the value in row 3n+2 and column 3n+2 of the feature map to be convolved is assigned to the corresponding position of the fifth feature submap, Assign the value in row 3n+2, column 3n+3 of the feature map to be convolved to the corresponding position of the sixth feature submap, and assign the feature map to be convolved in row 3n+3, column 3n+1 The value of is assigned to the corresponding position of the seventh feature submap, the value of the 3n+3 row 3n+2 column of the feature map to be convolved is assigned to the corresponding position of the eighth feature submap, and the The value in the 3n+3 row and 3n+3 column of the convolution feature map is assigned to the corresponding position of the ninth feature sub-map, where n is a natural number.

需要说明的是，所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的装置和各模块、单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the above-described device and each module and unit can refer to the corresponding process in the foregoing method embodiment. No longer.

本申请的方法、装置可用于众多通用或专用的计算系统环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、机顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。The method and device of this application can be used in many general or special computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including the above Distributed computing environment of any system or device, etc.

示例性的，上述的方法、装置可以实现为一种计算机程序的形式，该计算机程序可以在如图21所示的计算机设备上运行。Exemplarily, the foregoing method and apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 21.

请参阅图21，图21是本申请实施例提供的一种计算机设备的结构示意图。该计算机设备可以是服务器或终端。Please refer to FIG. 21. FIG. 21 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer equipment can be a server or a terminal.

参阅图21，该计算机设备包括通过系统总线连接的处理器、存储器和网络接口，其中，存储器可以包括非易失性存储介质和内存储器。Referring to FIG. 21, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.

非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令，该程序指令被执行时，可使得处理器执行任意一种基于脉动阵列的神经网络加速方法。The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions. When the program instructions are executed, the processor can execute any neural network acceleration method based on a systolic array.

处理器用于提供计算和控制能力，支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.

内存储器为非易失性存储介质中的计算机程序的运行提供环境，该计算机程序被处理器执行时，可使得处理器执行任意一种基于脉动阵列的神经网络加速方法。The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any neural network acceleration method based on the systolic array.

该网络接口用于进行网络通信，如发送分配的任务等。本领域技术人员可以理解，该计算机设备的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure of the computer device is only a block diagram of a part of the structure related to the solution of the application, and does not constitute a limitation on the computer device to which the solution of the application is applied. The specific computer device may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.

应当理解的是，处理器可以是中央处理单元(Central Processing Unit，CPU)，该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中，通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

其中，在一个实施例中，所述处理器用于运行存储在存储器中的计算机程序，以实现前述基于脉动阵列的神经网络加速方法的步骤。Wherein, in one embodiment, the processor is used to run a computer program stored in a memory to implement the steps of the aforementioned method for accelerating a neural network based on a systolic array.

示例性的，所述处理器用于运行存储在存储器中的计算机程序，以实现如下步骤：Exemplarily, the processor is configured to run a computer program stored in the memory to implement the following steps:

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法，如：From the description of the foregoing implementation manners, it can be understood that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD-ROM, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the methods described in each embodiment of this application or some parts of the embodiment, such as:

一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序中包括程序指令，所述处理器执行所述程序指令，实现本申请实施例提供的任一项基于脉动阵列的神经网络加速方法。A computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement any item provided in the embodiments of this application based on Neural network acceleration method of systolic array.

其中，所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元，例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备，例如所述计算机设备上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A neural network acceleration method based on systolic array, including:

Acquiring a convolution parameter of a convolution filter, where the convolution parameter includes a convolution step size and the size of the convolution filter;

If the convolution step length is not 1 and the size of the convolution filter is greater than 1×1, a number of sub-filters are segmented from the convolution filter according to a preset filter segmentation rule, and each sub-filter is The size of the filter is smaller than the size of the convolution filter;

Acquiring a feature map to be convolved, and segmenting a number of feature submaps from the feature map to be convolved according to a preset feature map segmentation rule, the plurality of feature submaps corresponding to the plurality of subfilters one-to-one;

Based on the systolic array, perform convolution calculation on the corresponding feature submap according to each of the subfilters, and the step size of the convolution calculation is 1;

The convolution calculation results corresponding to each of the subfilters are superimposed, and the superimposed result is output as the convolution calculation result of the feature map to be convolved by the convolution filter.

8. The neural network acceleration method according to claim 1, wherein said obtaining the feature map to be convolved and segmenting a number of feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule specifically includes:

Obtain the feature map to be convolved;

If the length or width of the acquired feature map to be convolved is not an integer multiple of the convolution step length, zero padding is performed on the preset position of the feature map to be convolved to make the zero padding feature map to be convolved The length or width is an integer multiple of the convolution step length;

According to preset feature map segmentation rules, a number of feature sub-maps are segmented from the zero-filled feature map to be convolved.

The neural network acceleration method according to claim 2, wherein, if the convolution step size is not 1 and the size of the convolution filter is greater than 1, according to a preset filter segmentation rule from the convolution The product filter divides into several sub-filters, including:

If the convolution step length is 2 and the size of the convolution filter is 2×2, 4 sub-filters are divided from the convolution filter, and the size of each sub-filter is 1×1; Wherein the first subfilter includes the weights of odd rows and odd columns of the convolution filter, the second subfilter includes the weights of odd rows and even columns of the convolution filter, and the third subfilter includes the weights of the odd rows and even columns of the convolution filter. Weights of even rows and odd columns of a convolution filter, and the fourth subfilter includes weights of even rows and even columns of the convolution filter;

If the convolution step size is 2 and the size of the convolution filter is 2×2, the acquiring a feature map to be convolved and segmenting it from the feature map to be convolved according to a preset feature map segmentation rule Several feature subgraphs, including:

Assign the values of odd rows and odd columns of the feature map to be convolved to the corresponding positions of the first feature submap, and assign the values of odd rows and even columns of the feature map to be convolved to the corresponding positions of the second feature submap Position, assign the values of the even rows and odd columns of the feature map to be convolved to the corresponding positions of the third feature submap, and assign the values of the even rows and even columns of the feature map to be convolved to the fourth feature submap Corresponding position.

If the convolution step length is 2 and the size of the convolution filter is 3×3, 4 sub-filters are divided from the convolution filter, and the size of each sub-filter is 2×2; Wherein the first subfilter includes the weights of odd rows and odd columns of the convolution filter, the second subfilter includes the weights of odd rows and even columns of the convolution filter, and the third subfilter includes the weights of the odd rows and even columns of the convolution filter. Weights of even rows and odd columns of a convolution filter, and the fourth subfilter includes weights of even rows and even columns of the convolution filter;

If the convolution step size is 2 and the size of the convolution filter is 3×3, the acquiring feature map to be convolved and segmenting it from the feature map to be convolved according to a preset feature map segmentation rule Several feature subgraphs, including:

The neural network acceleration method according to claim 4, wherein, if the convolution step size is 2 and the size of the convolution filter is 3×3, 4 sub-convolution filters are divided The filter, the size of each sub-filter is 2×2, and specifically includes:

Allocating weights of odd rows and odd columns of the convolution filter to the first subfilter;

Assign the weights of the odd rows and even columns of the convolution filter to the first column of the second subfilter, and fill the second column of the second subfilter with 0;

Assign the weights of the even rows and odd columns of the convolution filter to the first row of the third subfilter, and fill the second row of the third subfilter with 0;

Assign the weights of the even rows and even columns of the convolution filter to the first row and first column of the fourth subfilter, and fill the remaining positions of the fourth subfilter with 0.

If the convolution step length is 3 and the size of the convolution filter is 3×3, 9 sub-filters are divided from the convolution filter, and the size of each sub-filter is 1×1 and Each includes one of the 9 weights of the convolution filter;

If the convolution step length is 3 and the size of the convolution filter is 3×3, the acquiring the feature map to be convolved and segmenting it from the feature map to be convolved according to a preset feature map segmentation rule Several feature subgraphs, including:

Assign the value in row 3n+1 and column 3n+1 of the feature map to be convolved to the corresponding position of the first feature submap, and assign the value in row 3n+1 and column 3n+2 of the feature map to be convolved The value of is assigned to the corresponding position of the second feature submap, and the value of row 3n+1 and column 3n+3 of the feature map to be convolved is assigned to the corresponding position of the third feature submap. The value in row 3n+2, column 3n+1 of the convolution feature map is allocated to the corresponding position of the fourth feature submap, and the value in row 3n+2, column 3n+2 of the feature map to be convolved is allocated to Assign the value in row 3n+2 and column 3n+3 of the feature map to be convolved to the corresponding position of the fifth feature submap to the corresponding location of the feature map to be convolved. The value in row 3n+3 and column 3n+1 is allocated to the corresponding position of the seventh feature submap, and the value in row 3n+3 and column 3n+2 of the feature map to be convolved is allocated to the eighth feature For the corresponding position of the sub-image, the value in the 3n+3 row and 3n+3 column of the feature map to be convolved is assigned to the corresponding position of the ninth feature sub-image, where n is a natural number.

7. The neural network acceleration method according to any one of claims 1 to 6, wherein: the systolic array is used to perform convolution calculation on the corresponding feature submap according to each of the subfilters, which specifically includes:

Loading the weight of the sub-filter into the weight register connected to the systolic array;

Loading the feature sub-map corresponding to the sub-filter into the input register connected to the systolic array;

Obtain the output result of the systolic array convolution calculation.

A neural network acceleration device based on a systolic array, including:

A convolution parameter acquisition module, configured to acquire convolution parameters of a convolution filter, where the convolution parameters include a convolution step size and a size of the convolution filter;

The filter segmentation module is configured to: if the convolution step size is not 1 and the size of the convolution filter is greater than 1×1, segment the convolution filter according to a preset filter segmentation rule. Filter, the size of each of the sub-filters is smaller than the size of the convolution filter;

The feature map segmentation module is used to obtain the feature map to be convolved and segment the feature map to be convolved according to a preset feature map segmentation rule into several feature sub-images, the several feature sub-images and the several sub-filters One-to-one correspondence

The convolution module is configured to perform convolution calculation on the corresponding feature sub-maps according to each of the sub-filters based on the systolic array, and the step size of the convolution calculation is 1;

The superposition module is configured to superimpose the convolution calculation results corresponding to each of the sub-filters, and output the superposition result as the convolution calculation result of the feature map to be convolved by the convolution filter.

A computer device including a memory and a processor;

The memory is used to store computer programs;

The processor is configured to execute the computer program and implement the following steps when executing the computer program:

The computer device according to claim 9, wherein the processor implements the acquisition of the feature map to be convolved and the segmentation of several feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule , Used to implement the following steps:

Obtain the feature map to be convolved;

The computer device according to claim 10, wherein the processor implements that if the convolution step size is not 1 and the size of the convolution filter is greater than 1, according to a preset filter division rule When several sub-filters are segmented from the convolution filter, it is used to implement the following steps:

The computer device according to any one of claims 9-13, wherein: when the processor implements the systolic array based on each of the sub-filters to perform convolution calculations on the corresponding feature sub-maps, Used to implement the following steps:

Obtain the output result of the systolic array convolution calculation.

A computer-readable storage medium that stores a computer program, and if the computer program is executed by a processor, the following steps are implemented:

The storage medium according to claim 15, wherein the processor implements the acquisition of the feature map to be convolved and the segmentation of several feature sub-maps from the feature map to be convolved according to a preset feature map segmentation rule , Used to implement the following steps:

Obtain the feature map to be convolved;

The storage medium according to claim 16, wherein the processor implements that if the convolution step size is not 1 and the size of the convolution filter is greater than 1, according to a preset filter division rule When several sub-filters are segmented from the convolution filter, it is used to implement the following steps:

22. The storage medium according to any one of claims 15-19, wherein: when the processor implements the systolic-based array and performs convolution calculations on the corresponding feature submaps according to the subfilters, Used to implement the following steps:

Obtain the output result of the systolic array convolution calculation.