CN114003201B

CN114003201B - Matrix transformation method, device and convolutional neural network accelerator

Info

Publication number: CN114003201B
Application number: CN202111277687.XA
Authority: CN
Inventors: 陈世达; 张宏; 李永配
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2024-11-19
Anticipated expiration: 2041-10-29
Also published as: CN114003201A

Abstract

The embodiment of the present invention provides a matrix transformation method, device, convolutional neural network accelerator, storage medium and electronic device, wherein the method comprises: sequentially caching multiple sub-feature maps included in an input feature map in a predetermined order; performing matrix transformation on each cached sub-feature map to obtain multiple target sub-feature matrices; and determining a target output feature map corresponding to the input feature map based on the multiple target sub-feature matrices and the weight parameters of the neural network model. Through the present invention, the problem of low efficiency of matrix transformation of input feature maps in matrix operations involved in the existing accelerator operation process is solved, memory access and bandwidth redundancy are reduced, and the speed of operation is improved.

Description

Matrix transformation method, device and convolutional neural network accelerator

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a matrix transformation method and device, a convolutional neural network accelerator, a storage medium and an electronic device.

Background

At present, the deep learning field is widely focused on, wherein a Convolutional Neural Network (CNN) becomes a research hot spot in various fields such as image classification, target detection, semantic segmentation and the like, and can achieve good effects. The most computationally intensive part of CNN is the convolutional layer and the fully-connected layer, and their bottom layers can be realized based on matrix multiplication. With the continuous improvement of the calculation scale and complexity of the CNN model, the traditional CPU platform cannot meet the requirement of practicality. Therefore, accelerator implementation adopting a computing platform such as a GPU and an FPGA is widely focused, however, compared with the characteristics of high energy efficiency, easiness in reconstruction, rapid iterative updating and convenience in deployment of a mobile edge end of the GPU and the FPGA, the accelerator implementation can be more suitable for the requirement of rapid development of a deep learning algorithm.

In the related art, the implementation method of the CNN accelerator based on the FPGA mainly comprises two aspects: loop unrolling parallel computation and systolic array computation. However, the first method achieves computational acceleration by increasing parallelism, but faces the problem of high fan-in/fan-out, resulting in lower final speed and poor computation mode versatility; the second method converts convolution operation and full-join operation in CNN into matrix multiplication, but the implementation of matrix transformation img2col method has a larger influence on the overall performance.

As is clear from the above, the related art has a problem that the matrix conversion efficiency of the input feature map in the matrix operation related to the accelerator operation process is low.

In view of the above problems in the related art, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a matrix transformation method and device, a convolutional neural network accelerator, a storage medium and an electronic device, which are used for at least solving the problem of low matrix transformation efficiency of an input feature diagram in matrix operation related to an accelerator operation process in the related technology.

According to an embodiment of the present invention, there is provided a matrix transformation method including: sequentially caching according to a preset sequence to cache a plurality of sub-feature images included in the input feature images; performing matrix transformation on the sub-feature graphs cached each time to obtain a plurality of target sub-feature matrixes; and determining a target output feature map corresponding to the input feature map based on the target sub-feature matrixes and the weight parameters of the neural network model.

According to another embodiment of the present invention, there is provided a matrix conversion apparatus including: the caching module is used for caching a plurality of sub-feature images included in the input feature images in sequence according to a preset sequence; the transformation module is used for carrying out matrix transformation on the sub-feature images in each cache so as to obtain a plurality of target sub-feature matrixes; and the determining module is used for determining a target output characteristic diagram corresponding to the input characteristic diagram based on the target sub-characteristic matrixes and weight parameters of the neural network model.

In accordance with yet another embodiment of the present invention, a convolutional neural network accelerator is provided, comprising the apparatus of the above embodiments.

According to yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program when executed by a processor implements the steps of the method as described in any of the above.

According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the method and the device, the multiple sub-feature images included in the input feature images are sequentially cached according to a preset sequence, matrix transformation is carried out on each cached sub-feature image so as to obtain multiple target sub-feature matrixes, and the target output feature images corresponding to the input feature images are determined according to the multiple sub-feature matrixes and the weight parameters of the neural network model. Because only one sub-feature map included in the output feature map is cached each time to perform matrix transformation on the sub-feature map, the number of times of accessing read data in calculation is reduced, and only one sub-feature map is cached each time, the problem of data redundancy access is avoided, so that the problem of low matrix transformation efficiency of the input feature map in matrix operation related to an accelerator operation process in the related art can be solved, access and bandwidth redundancy are reduced, and the running speed is improved.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal of a matrix transformation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a matrix transformation method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a convolutional neural network accelerator structure in accordance with an exemplary embodiment of the present invention;

FIG. 4 is a schematic diagram of a convolution calculation process in the related art;

FIG. 5 is a graphical illustration of splitting input features by input channel parallelism, line width, and line number according to an exemplary embodiment of the invention;

FIG. 6 is a block overlap condition schematic diagram after segmentation of an input feature map according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram showing the transformation of convolution calculations into a matrix operation in the related art;

FIG. 8 is a schematic diagram showing a convolution calculation converted into a matrix operation in the related art;

FIG. 9 is a sub-feature pictorial representation of a cache in accordance with an exemplary embodiment of the present invention;

FIG. 10 is a schematic diagram of a line cache sliding window process in accordance with an exemplary embodiment of the present invention;

Fig. 11 is a block diagram of a pulse array unit according to an exemplary embodiment of the present invention;

Fig. 12 is a block diagram of a matrix conversion apparatus according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of a mobile terminal of a matrix transformation method according to an embodiment of the present application. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a matrix transformation method in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a matrix transformation method is provided, fig. 2 is a flowchart of matrix transformation according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:

Step S202, sequentially caching a plurality of sub-feature images included in an input feature image according to a predetermined sequence;

Step S204, performing matrix transformation on the sub-feature graphs cached each time to obtain a plurality of target sub-feature matrices;

and S206, determining a target output feature map corresponding to the input feature map based on the multiple sub-feature matrices and weight parameters of the neural network model.

In the above embodiment, when the convolutional neural network performs convolutional operation, an input feature map of an original input picture may be first determined, sub-feature maps in the input feature map are cached, matrix transformation is performed on each cached sub-feature map to obtain a plurality of target sub-feature matrices, and then a target output feature map corresponding to the input feature map is determined according to the plurality of target sub-feature matrices and weight parameters of the neural network model. The input feature map may be a feature map for systolic array matrix calculation.

Alternatively, the main body of execution of the above steps may be an FPGA, a convolutional neural network accelerator, a background processor, or other devices with similar processing capability, and may also be a machine integrated with at least a data processing device, where the data processing device may include a terminal such as a computer, a mobile phone, or the like, but is not limited thereto.

In the above embodiment, when the main execution body of the above steps is an FPGA or a convolutional neural network accelerator, the structural schematic diagram of the convolutional neural network accelerator may be shown in fig. 3, and as shown in fig. 3, the convolution operation may be performed by using the original input image as the input feature map of the first layer of convolution layer and the corresponding weight parameters. The original input feature map and the weight parameters of each convolution layer or full connection layer in the network are stored in an external memory (typically DDR). Before performing the calculation, a software end (i.e. a processing system) can configure a hardware module of a programmable logic end, for example, in the convolution calculation, the processing system configures a working register of a direct memory access controller (DMA) through a bus, so that the DMA carries data (input feature images and weight parameters) from an external DDR to the logic on the FPGA chip. Because the on-chip resources of the FPGA are limited, an input feature map caching unit and a weight caching unit can be added to be respectively responsible for caching part of input feature maps (namely part of feature maps) and storing parameter data, and after the data of the two parts are operated for a plurality of times, the direct memory access controller reads and caches the data of the next batch. The input feature map caching unit can adopt a channel priority strategy to realize storage, and the matrix transformation unit can adopt a line cache window sliding method to realize convolution torque array operation, namely hardware realization of img2col operation. The pulse array unit is a main calculation unit for realizing matrix operation on the input feature map and the weight parameters, and the result is a partial sum of convolution operation. Therefore, the accumulator can buffer the results of the next batch until the calculation of the input feature map corresponding to the current weight parameter is completed, and the final result is output. The bias module is responsible for increasing bias parameters in the direction of an output channel of the input feature diagram, and the result is output to a next stage module through nonlinear processing by an activation function (such as a ReLU) module, so that calculation of the convolution layer of the next stage is completed.

In the above embodiment, the next layer of the convolution layer may be a pooling layer or an element-level operation layer (for example, a splicing Concat layer), so that the pooling/element-level operation processing unit is responsible for the pooling operation and the element-level operation, and the output result thereof is written back to the external memory through DMA, and is used as an input feature map of the next convolution layer, and the processing procedure of the new round of convolution layer-bias-activation function-pooling/element-level operation is performed until all the layers of the network are completely calculated.

In the above embodiment, the inputs of the convolution calculation include the weight parameters weight (Co, ci, ky, kx) and the input feature map ifmp (N, ci, hi, wi), and the final calculation result is the output feature map ofmp (N, co, ho, wo). Where N represents a batch size and may be set to 1 (this value is merely an exemplary illustration and the invention is not limited in this regard). Co and Ci represent the output and input channel data, kx and Ky represent the length and width of the convolution kernel, hi, wi and Ho, and Wo represent the input and output feature map height and width, respectively, with the final result being biased bias (Co, 1), i.e., one bias for each output channel. The calculation of the convolution layer is shown in equation (1), where S represents the step size of the sliding window.

Where ho= ((Hi-ky+ 2*P)/S) +1, wo= ((Wi-kx+ 2*P)/S) +1, and p represents the zero padding number of lines. Taking the first convolution layer as an example, the input feature map ifmp is (1, 3, 416, 416), the weight is (16,3,3,3), s=1, and p=1, and the output feature map ofmp of the first layer is (1, 16, 416, 416).

Meanwhile, the fully connected layer can be regarded as vector matrix multiplication, that is, ifmp is (Ci, 1), weight is (Co, ci), bias is (Co, 1), and ofmp (Co, 1) =weight× ifmp +bias.

According to the method and the device, the multiple sub-feature images included in the input feature images are sequentially cached according to a preset sequence, matrix transformation is conducted on each cached sub-feature image, so that multiple target sub-feature matrixes are obtained, and the multiple sub-feature matrixes and the weight parameters of the neural network model determine the target output feature images corresponding to the input feature images. Because only one sub-feature map included in the output feature map is cached each time to perform matrix transformation on the sub-feature map, the number of times of accessing read data in calculation is reduced, and only one sub-feature map is cached each time, the problem of data redundancy access is avoided, so that the problem of low matrix transformation efficiency of the input feature map in matrix operation related to an accelerator operation process in the related art can be solved, access and bandwidth redundancy are reduced, and the operation speed is improved.

In an exemplary embodiment, sequentially buffering the plurality of sub-feature maps included in the input feature map in a predetermined order includes: determining the parallelism of the input channels and the row width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channel, the line width and the line number to obtain a plurality of sub feature maps; and sequentially caching the plurality of sub-feature images according to the preset sequence. In this embodiment, when the sub-feature images included in the input feature images are cached according to a predetermined sequence, the input channel parallelism of the input feature images and the line width of the input feature images may be determined first, then the number of lines cached each time is determined, the input feature images are divided according to the input channel parallelism, the line width and the line number, so as to obtain a plurality of sub-feature images, and after a plurality of sub-feature images are obtained, the plurality of sub-feature images are sequentially cached according to the predetermined sequence. Wherein the predetermined order may be a top-down order.

In the related art, the calculation process of convolution can be seen in fig. 4, as shown in fig. 4, each filter weight slides from left to right to top down on the input feature map ifmp, and at the same time, the two corresponding positions are multiplied and accumulated to generate a feature point of the output feature map ofmp, when the sliding window traverses the whole ifmp, an output feature map of one channel is generated, and Co weights generate ofmp of the final Co channels. Because the convolution process involves the storage and multiplexing of a large number of input feature images and parameter data, the FPGA on-chip resources are limited. Thus, only part of the data can be cached on-chip at a time for operation, such as the gray part in the figure. The number of gray cubes in the Ci direction is PC, which is called input channel parallelism, and the number of gray cubes in the Co direction is PF, which is called output channel parallelism. It follows that the accelerator calculates only the convolution results for the entire input signature and a subset of the weight parameters at a time (the dashed portion in fig. 4). In order to effectively read corresponding calculation data in the external memory, in the related art, an input feature map is divided into a plurality of blocks horizontally and longitudinally by partitioning, and only one block of the input feature map is read and calculated at a time, however, due to the existence of a sliding window, it is obvious that two blocks adjacent to each other up and down/left and right have a data overlapping condition, so that data of overlapping portions need to be considered, and in addition, after a convolution result of each block is calculated, additional control and operation are required to restore the physical form of the original output feature map, which introduces additional time delay.

In the above embodiment, the input feature diagram is divided according to the input channel parallelism, the line width, and the number of lines, and the input feature diagram is shown in fig. 5, and as shown in fig. 5, the line W direction and the input channel Ci direction of the input feature diagram are all stored, and the column H direction stores only a part of lines, for example, 32 lines (the value is configurable), so that w×ci×32 data is required to be stored in total, the resource addresses stored on the chip are continuous, and the line size of each layer is compatible up to the nth power of 2, for example, 416→512, 208→256, and the like. Note that the input channel parallelism PC may be 8 (this value is only one example, and the present invention is not limited thereto, and may be set to 16, 32, or the like, for example). PC means that each input data is stored externally, stored on-chip, and calculated on-chip for each 8 sets in the input channel direction. In the figure, step1 represents calculating the first group 8 of parallelism according to the row-column direction, step2 represents calculating the next group 8 of parallelism according to the input channel direction until the input channel direction is calculated. The problem of limited on-chip storage resources can be effectively relieved by adopting a storage strategy with the input channel direction priority, and the output characteristic diagrams corresponding to the blocks do not need to be spliced and restored after the output characteristic diagrams are obtained through calculation. Furthermore, overlapping portions introduced by regular blocking can reduce access bandwidth by multiplexing.

In one exemplary embodiment, determining the number of lines per cache includes: determining the sliding step length of a sliding window; determining the overlapping number of overlapping rows in two adjacent slides based on the slide step size; acquiring a pre-cache number of a pre-cache line; and determining the difference value between the pre-cache quantity and the overlapping quantity as the line quantity. In this embodiment, when determining the number of lines cached each time, the sliding step length of the sliding window may be determined first, the overlapping number of the overlapped lines in the two adjacent sliding steps is determined according to the sliding step length, the pre-caching number of the pre-caching lines determined in advance is obtained, and the difference between the pre-caching number and the overlapping number is determined as the number of lines.

In the above embodiment, after the input feature map is divided, n feature maps are obtained, and only Block n is stored on each slice, where n=1, 2,3,4 …, and the current Block is calculated and then the next Block is cached. Taking the first channel as an example, when ky=kx=3, s=1, and p=1, two blocks adjacent to each other above and below overlap, and when the number of pre-buffers is 32, a schematic diagram of Block overlap after splitting the input feature map may be seen in fig. 6, and as shown in fig. 6, block1 and Block2 overlap in two rows. Thus, 2 lines of cache may be available per on-chip cache; the final result of the blocks does not need extra recovery operation, and the direct continuous addresses are written back to the external memory; each block contains all input feature maps of the input channel dimension, so when all filters traverse and calculate the current block, the final output feature map result can be directly generated, and the part and the result of the intermediate convolution calculation do not need to be cached on a chip. Thereby reducing the frequent access times of the external memory and reducing the bandwidth requirement.

In an exemplary embodiment, performing matrix transformation on the sub-feature map cached each time to obtain a plurality of target sub-feature matrices includes: the following is performed for each time period: determining row data included in the sub-feature map acquired in the time period; and performing matrix transformation on the data to obtain the target sub-feature matrix. In this embodiment, when determining the target feature matrix, the matrix transformation may be performed on the data acquired in each time period to obtain a target sub-feature matrix, the convolution calculation is performed on the target sub-feature matrix to obtain an output feature map, and after the calculation is completed, the matrix transformation is performed on the data acquired in the next time period again to obtain the target sub-feature matrix, and the output feature matrix is obtained by calculation in sequence. Therefore, redundant data access can be effectively avoided, and hardware design of a pipeline is realized, which has important significance on the storage space, access bandwidth and computing resources of the FPGA. The time period may be N clock beats, such as 3 clock beats. The time period may be determined based on the sliding window and the number of lines per cache.

In the related art, the convolution layer and the full-connection layer in the CNN calculation can be converted into matrix multiplication, that is, can be realized by adopting a systolic array, and the full-connection layer is matrix vector multiplication, so that the calculation process is not repeated. The matrix transformation of the convolution computation, i.e., the hardware implementation of img2 col. As shown in fig. 7, the img2col is a process of converting the input feature map into a matrix format corresponding to the filter, and since the sliding window has overlapping data, a large amount of redundant data exists inside.

The convolution operation of fig. 4 can be converted into the operation of matrix multiplication by img2col, and corresponding calculation is performed by using a systolic array as shown in fig. 8. One more intuitive method is to implement img2col by using a high-level language at the software end, and then transmit the transformed data to the FPGA for calculation, however, a large amount of redundant data and each layer of calculation need matrix conversion operation, and the time cost caused by such a processing mode is huge. As can be seen from fig. 7 and 8, img2col converts the original convolution input feature map into a matrix form, which uses the data of the corresponding position covered ifmp by each convolution kernel sliding on the input feature map ifmp, and the data covered by the convolution kernels is developed into a 1-dimensional vector as a result of the corresponding img2 col.

In the above embodiment, it is assumed that the data already stored on the current tile is a cube as shown in fig. 9, i.e., pc=8, and ci×w×32 pieces of data in total. Assuming that the convolution kernel of the current layer is 3×3×ci, its parallelism PC remains corresponding to ifmp. The first window is positioned as window 1, the second window is positioned as window 2, then window 3, window 4 and window 5 are sequentially positioned until one row of the window slides. PKKC =3×3×pc=72 data per window coverage, corresponding to a 1-dimensional vector of img2 col. As shown in fig. 10, assuming that the input signature data bit width is DW, each beat of data is pc×dw, and for a 3×3 convolution kernel, 2 line buffers may be employed, where the buffers may be implemented based on FIFOs, the depth of which depends on the line maximum of the input signature, e.g., 512. It is easy to know that when the 3 rd line data flows in after 2 lines of data are cached, the data of the first 2 lines flow into the register group according to the sequence and the 3 rd line data at the same time, each clock beat flows out 3 groups of PC multiplied by DW numbers, after waiting for 3 clock beats, 9 groups of data corresponding to 3 multiplied by 3 windows in the register group are all valid, the data are output to the pulse array for longitudinal input for 1 time, the next beat corresponds to the next 9 groups of data corresponding to the sliding 1 window, and the corresponding img2col matrix converted input feature diagram can be obtained by analogy.

In the embodiment, the matrix transformation operation of img2col is realized by adopting the line cache sliding window mode, so that the problem of redundant data access and storage faced by the img2col operation is effectively avoided, and meanwhile, the method is based on hardware realization, and the requirements of the on-chip storage space and the access and storage bandwidth of an external memory are effectively reduced.

In an exemplary embodiment, determining the target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrices and the weight parameters of the neural network model includes: determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameter; caching the output feature graphs corresponding to each sub-feature matrix to obtain a plurality of output feature graphs; the target output feature map is determined based on a plurality of the output feature maps. In this embodiment, when determining the target output feature map, determining an output feature map corresponding to each sub-feature matrix according to each sub-feature matrix and the weight parameter, and caching the output feature map corresponding to each sub-feature matrix, that is, caching each output feature map obtained from the sub-feature map, and performing corresponding calculation according to a plurality of output feature maps to determine the target output feature map.

In an exemplary embodiment, determining the output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameter includes: determining sub-weight parameters of the target input channel corresponding to the sub-feature matrix based on the weight parameters; determining a current sub-feature matrix obtained in a current time period; determining a first product of the current sub-feature matrix and the sub-weight parameters; determining a second product of a previous feature matrix and the sub-weight parameter in a previous time period of the current time period; a sum of the first product and the second product is determined as the output signature. In this embodiment, when determining the output feature map, for each target input channel, a sub-weight parameter of the target input channel may be determined, a current sub-feature matrix obtained in a current time period is determined, a first product of the current sub-feature matrix and the sub-weight parameter is determined, a second product of a previous feature matrix and the weight parameter in a previous time period of the current time period is determined, and a sum of the first product and the second product is determined as the output feature map.

In the above embodiment, the output characteristic diagram may be determined by using a pulse array unit, and the pulse array unit structure may be referred to as fig. 11, and the pulse array unit may include a two-dimensional PE matrix as shown in fig. 11. The weight parameters of the target input channels may be first determined, wherein the weight parameters employed by each input channel may be fixed. The input feature map data and the output feature map data can be calculated in a directional propagation mode, wherein the weight parameters can be pre-stored longitudinally, and then the input feature map data enters the pulse array unit transversely one by one to be calculated and flows to the adjacent PE on the right, and the calculation result of each PE is longitudinally and directionally propagated. All flows are once per clock tick. Each PE is a multiply-accumulate unit, i.e., when Wi and Xi meet in each PE, multiplication is performed, and the product result is accumulated with the result flowing in from the adjacent upper side thereof, and then flows to the adjacent lower PE. After the partial input feature map of the current batch is calculated, partial sum results are obtained, and the final output feature map results are the sum of all data products in the direction of the input channel. The temporary storage part and the part corresponding to the next batch are needed to be accumulated until the final calculation result is output, namely the output characteristic diagram of the current convolution layer.

In one exemplary embodiment, determining the current sub-feature matrix obtained during the current time period includes: determining a target sub-feature matrix as the current sub-feature matrix in the case that no input channel exists before the target input channel; and in the case that an input channel exists before the target input channel, determining a first product of adjacent input channels, which are included in the input channels and are adjacent to the target input channel, as the current sub-feature matrix. In this embodiment, in the case where there is no input channel before the target input channel, the target sub-feature matrix is determined as the current sub-feature matrix, and in the case where there is an input channel before the target input channel, the first product of adjacent input channels adjacent to the target input channel included in the input channel is determined as the current sub-feature matrix. As shown in fig. 11, when the target input channel is a first column channel, the target sub-feature matrix is determined as the current sub-feature matrix, and when the target input channel is a channel other than the first column channel, such as a second column channel, the first product of the first column channel is determined as the current sub-feature matrix.

In the above embodiment, based on the systolic array method, each PE is only connected with the adjacent PEs, the fan-in and fan-out is 1, and the systolic array has a regular rectangular physical structure, so that the systolic array is more effective when mapped onto the FPGA, and because the physical space of DSP resource distribution on the corresponding FPGA is a rectangular area, the accelerator can reach a higher clock frequency.

In the foregoing embodiment, the problem of on-chip storage resource limitation is effectively alleviated by the storage strategy with the input channel direction prioritized, and compared with other partitioning strategies, after the output feature map is obtained by calculation, the output feature map corresponding to each partitioning is not required to be spliced and restored. Furthermore, overlapping portions introduced by regular blocking can reduce access bandwidth by multiplexing. The fan-in/fan-out is 1 through the pulse array scheme, and meanwhile, the regular matrix physical structure of the fan-in/fan-out is easy for layout and wiring of the FPGA, so that higher clock frequency can be realized, and the overall performance of the accelerator is improved. In addition, the accelerator based on the FPGA has the characteristics of flexible design, reconfigurability and low power consumption.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiment also provides a device for determining an output feature map, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 12 is a block diagram of a matrix transforming apparatus according to an embodiment of the present invention, as shown in fig. 12, the apparatus comprising:

A caching module 1202, configured to sequentially cache, according to a predetermined order, a plurality of sub-feature graphs included in an input feature graph;

The transformation module 1204 is configured to perform matrix transformation on the sub-feature graphs cached each time, so as to obtain a plurality of target sub-feature matrices;

the determining module 1206 is configured to determine a target output feature map corresponding to the input feature map based on a plurality of target sub-feature matrices and weight parameters of a neural network model.

Wherein the buffer module 1202 corresponds to the input feature map buffer unit in fig. 3, the transform module 1204 corresponds to the matrix transform (line buffer window sliding) unit in fig. 3, and the determination module 1206 corresponds to the systolic array and accumulator buffer in fig. 3.

In an exemplary embodiment, the caching module 1202 may implement sequentially caching the plurality of sub-feature maps included in the input feature map in a predetermined order by: determining the parallelism of the input channels and the row width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channel, the line width and the line number to obtain a plurality of sub feature maps; and sequentially caching the plurality of sub-feature images according to the preset sequence.

In one exemplary embodiment, the cache module 1202 may implement determining the number of lines per cache by: determining the sliding step length of a sliding window; determining the overlapping number of overlapping rows in two adjacent slides based on the slide step size; acquiring a pre-cache number of a pre-cache line; and determining the difference value between the pre-cache quantity and the overlapping quantity as the line quantity.

In an exemplary embodiment, the transformation module 1204 may perform matrix transformation on the sub-feature map cached each time to obtain a plurality of target sub-feature matrices by: the following is performed for each time period: determining row data included in the sub-feature map acquired in the time period; and performing matrix transformation on the data to obtain the target sub-feature matrix.

In an exemplary embodiment, the determining module 1206 may determine the target output feature map corresponding to the input feature map based on a plurality of the target sub-feature matrices and weight parameters of the neural network model by: determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameter; caching the output feature graphs corresponding to each sub-feature matrix to obtain a plurality of output feature graphs; the target output feature map is determined based on a plurality of the output feature maps.

In an exemplary embodiment, the determining module 1206 may determine the output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameters by: determining sub-weight parameters of the determined target input channel corresponding to the sub-feature matrix based on the weight parameters; determining a current sub-feature matrix obtained in a current time period; determining a first product of the current feature matrix and the sub-weight parameters; determining a second product of a previous feature matrix and the weight parameter in a previous time period of the current time period; a sum of the first product and the second product is determined as the output signature.

In one exemplary embodiment, the determining module 1206 may implement determining the current sub-feature matrix obtained during the current time period by: determining a target sub-feature matrix as the current sub-feature matrix in the case that no input channel exists before the target input channel; and in the case that an input channel exists before the target input channel, determining a first product of adjacent input channels, which are included in the input channels and are adjacent to the target input channel, as the current sub-feature matrix.

In this embodiment, a convolutional neural network accelerator is further provided, which includes the apparatus in the foregoing embodiment, and may perform the method in the foregoing method embodiment. See fig. 3 for a structural view thereof.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.

Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of the method described in any of the above.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A matrix transformation method, comprising:

Sequentially caching a plurality of sub-feature images included in the input feature images according to a preset sequence;

Performing matrix transformation on the sub-feature graphs cached each time to obtain a plurality of target sub-feature matrixes;

determining a target output feature map corresponding to the input feature map based on the weight parameters of the target sub-feature matrixes and the neural network model;

sequentially caching the plurality of sub-feature maps included in the input feature map in a predetermined order includes: determining the parallelism of the input channels and the row width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channel, the line width and the line number to obtain a plurality of sub feature maps; sequentially caching a plurality of sub-feature images according to the preset sequence;

Determining the number of lines per cache includes: determining the sliding step length of a sliding window; determining the overlapping number of overlapping rows in two adjacent slides based on the slide step size; acquiring a pre-cache number of a pre-cache line; and determining the difference value between the pre-cache quantity and the overlapping quantity as the line quantity.

2. The method of claim 1, wherein matrix transforming the sub-feature map of each cache to obtain a plurality of target sub-feature matrices comprises:

the following is performed for each time period:

Determining row data included in the sub-feature map acquired in the time period;

and performing matrix transformation on the data to obtain the target sub-feature matrix.

3. The method of claim 1, wherein determining a target output feature map corresponding to the input feature map based on a plurality of the target sub-feature matrices and weight parameters of a neural network model comprises:

determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameter;

Caching the output feature graphs corresponding to each sub-feature matrix to obtain a plurality of output feature graphs;

The target output feature map is determined based on a plurality of the output feature maps.

4. The method of claim 3, wherein determining an output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameters comprises:

determining sub-weight parameters of the target input channel corresponding to the sub-feature matrix based on the weight parameters;

Determining a current sub-feature matrix obtained in a current time period;

determining a first product of the current sub-feature matrix and the sub-weight parameters;

Determining a second product of a previous feature matrix and the sub-weight parameter in a previous time period of the current time period;

a sum of the first product and the second product is determined as the output signature.

5. The method of claim 4, wherein determining the current sub-feature matrix obtained during the current time period comprises:

determining a target sub-feature matrix as the current sub-feature matrix in the case that no input channel exists before the target input channel;

And in the case that an input channel exists before the target input channel, determining a first product of adjacent input channels, which are included in the input channels and are adjacent to the target input channel, as the current sub-feature matrix.

6. A determination apparatus for an output feature map, comprising:

The caching module is used for caching the sub-feature images included in the input feature images in sequence according to a preset sequence;

The transformation module is used for carrying out matrix transformation on the sub-feature graphs cached each time so as to obtain a plurality of target sub-feature matrixes;

The determining module is used for determining a target output feature map corresponding to the input feature map based on a plurality of target sub-feature matrixes and weight parameters of the neural network model;

The caching module sequentially caches a plurality of sub-feature images included in the input feature image according to a preset sequence in the following manner: determining the parallelism of the input channels and the row width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channel, the line width and the line number to obtain a plurality of sub feature maps; sequentially caching a plurality of sub-feature images according to the preset sequence;

The device determines the number of lines of each cache by the following method: determining the sliding step length of a sliding window; determining the overlapping number of overlapping rows in two adjacent slides based on the slide step size; acquiring a pre-cache number of a pre-cache line; and determining the difference value between the pre-cache quantity and the overlapping quantity as the line quantity.

7. A convolutional neural network accelerator comprising the apparatus of claim 6.

8. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 5.

9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 5.