TW201818264A

TW201818264A - Buffer device and convolution operation device and method

Info

Publication number: TW201818264A
Application number: TW105137126A
Authority: TW
Inventors: 杜源; 杜力; 李一雷; 管延城; 劉峻誠
Original assignee: 耐能股份有限公司
Priority date: 2016-11-14
Filing date: 2016-11-14
Publication date: 2018-05-16
Also published as: TWI634436B

Abstract

A buffer device includes input lines, an input buffer unit and a remap unit. The input lines are coupled to a memory and configured to be inputted with data from the memory in a current clock. The input buffer unit is coupled to the input lines and configured to buffer one part of the inputted data and output the part of the inputted data in a later clock. The remap unit is coupled to the input lines and the input buffer unit, and configured to generate remap data for convolution according to the data on the input lines and the output of the input buffer unit in the current clock.

Description

Buffer device and convolution operation device and method

本發明是關於一種緩衝裝置及運算裝置，特別關於一種用於卷積運算的緩衝裝置以及卷積運算裝置。 The present invention relates to a buffer device and an arithmetic device, and more particularly to a buffer device and a convolution operation device for convolution operations.

卷積(convolution)是通過兩個函數生成第三個函數的一種數學運算子，卷積廣泛應用在科學、工程和數學上，例如圖像處理以及電子工程與訊號處理等等。 Convolution is a mathematical operation that generates a third function through two functions. Convolution is widely used in science, engineering, and mathematics, such as image processing and electronic engineering and signal processing.

卷積神經網路(Convolutional Neural Network，CNN)也是一種卷積運算的應用，其包括一個或多個卷積層(convolutional layer)以及關聯權重和池化層(pooling layer)。卷積神經網路是一種前饋神經網絡，它的人工神經元可以響應一部分覆蓋範圍內的周圍單元，對於大型圖像處理有不錯的表現。 The Convolutional Neural Network (CNN) is also an application of convolution operations that includes one or more convolutional layers and associated weights and pooling layers. The convolutional neural network is a feedforward neural network, and its artificial neurons can respond to a part of the coverage area, which has a good performance for large image processing.

卷積神經網路也應用在深度學習(deep learning)，與其他深度學習架構相比，卷積神經網路在圖像和語音識別方面能夠產生較佳的結果。此外，使用卷積神經網路的模型也可以使用反向傳播演算法進行訓練。相比較其他深度學習架構以及前饋神經網路，卷積神經網路所需估計的參數較少，因此其成為目前深度學習架構的重要發展趨勢。 Convolutional neural networks are also used in deep learning, and convolutional neural networks produce better results in terms of image and speech recognition than other deep learning architectures. In addition, models using convolutional neural networks can also be trained using backpropagation algorithms. Compared with other deep learning architectures and feedforward neural networks, convolutional neural networks require fewer estimated parameters, so they become an important development trend of the current deep learning architecture.

然而，卷積運算是很耗費效能的運算，在卷積神經網路的應用中，卷積運算將占據大部分處理器的效能。因此，如何提供一種緩衝裝置及卷積運算裝置，能提高卷積運算效能，實為當前重要的課題之一。另外，網際網路的多媒體應用例如串流媒體也越來越廣泛，因此，如何提供一種緩衝裝置及卷積運算裝置，也能處理數據串流，實為當前重要的課題之一。 However, convolution operations are very cost-effective operations. In convolutional neural network applications, convolution operations will dominate most processors. Therefore, how to provide a buffer device and a convolution operation device, which can improve the performance of convolution calculation, is one of the current important issues. In addition, multimedia applications such as streaming media in the Internet are becoming more and more widespread. Therefore, how to provide a buffer device and a convolution computing device, and also capable of processing data streams, is one of the current important issues.

有鑑於此，本發明之一目的為提供一種能處理數據串流的緩衝裝置及卷積運算裝置。 In view of the above, it is an object of the present invention to provide a buffer device and a convolution operation device capable of processing a data stream.

本發明之一目的為提供一種能提高卷積運算效能的緩衝裝置及卷積運算裝置。 SUMMARY OF THE INVENTION An object of the present invention is to provide a buffer device and a convolution operation device which can improve the performance of convolution calculation.

一種緩衝裝置與一記憶體耦接，緩衝裝置包括一輸入線路、一輸入緩衝單元以及一重新映射單元。輸入線路耦接記憶體，配置為在一當前時脈下讓數據從記憶體輸入；輸入緩衝單元耦接輸入線路，配置為在當前時脈下將部分輸入的緩衝以在後續時脈將其輸出；重新映射單元耦接輸入線路以及輸入緩衝單元，配置為在當前時脈下依據輸入線路上的數據以及輸入緩衝單元的輸出而產生多個重新映射數據，重新映射數據係供卷積運算的輸入。 A buffer device is coupled to a memory device, the buffer device including an input line, an input buffer unit, and a remapping unit. The input line is coupled to the memory, configured to allow data to be input from the memory under a current clock; the input buffer unit is coupled to the input line, configured to buffer a portion of the input at the current clock to output the output at a subsequent clock. The remapping unit is coupled to the input line and the input buffer unit, configured to generate a plurality of remapping data according to the data on the input line and the output of the input buffer unit under the current clock, and remap the data for the input of the convolution operation .

在一實施例中，輸入線路在當前時脈下從記憶體輸入W個數據，重新映射單元產生W組重新映射數據以分別供W個卷積運算的輸入。 In one embodiment, the input line inputs W data from the memory at the current clock, and the remapping unit generates W sets of remapping data for input of W convolution operations, respectively.

在一實施例中，輸入線路在當前時脈下從記憶體輸入W個數據，輸入緩衝單元係從W個數據中將最後K個數據緩衝以在後續時脈將其輸出，輸入緩衝單元的輸出係排列在該輸入線路的前面。 In one embodiment, the input line inputs W data from the memory at the current clock, and the input buffer unit buffers the last K data from the W data to output it at a subsequent clock, and outputs the output to the buffer unit. The system is arranged in front of the input line.

在一實施例中，各組重新映射數據包括M個重新映射數據，卷積計算係M×M卷積計算。 In an embodiment, each set of remapping data includes M remapping data, and the convolution calculation is a M x M convolution calculation.

在一實施例中，重新映射單元係在輸入緩衝單元的輸出以及輸入線路之中從起始端朝末端每間隔J步幅取得M個資料作為一組重新映射數據。 In one embodiment, the remapping unit retrieves M data as a set of remapping data from the start end to the end per J step of the output of the input buffer unit and the input line.

在一實施例中，卷積運算間的步幅為1，各組重新映射數據包括3個重新映射數據，卷積運算係3×3卷積運算，輸入緩衝單元配置為在當前時脈下將最後2個輸入數據緩衝以在後續時脈將其輸出。 In one embodiment, the stride between convolution operations is 1, each set of remapping data includes three remapping data, the convolution operation is a 3x3 convolution operation, and the input buffer unit is configured to be under the current clock. The last 2 input data buffers are output at subsequent clocks.

在一實施例中，緩衝裝置更包括一控制單元，控制單元耦接並控制重新映射單元。 In an embodiment, the buffering device further includes a control unit coupled to and controlling the remapping unit.

在一實施例中，記憶體所儲存的數據為二維陣列數據，緩衝裝置係作為行(column)緩衝，輸入緩衝單元係作為部分列(row)緩衝。 In one embodiment, the data stored in the memory is two-dimensional array data, the buffer device is used as a column buffer, and the input buffer unit is used as a partial row buffer.

在一實施例中，重新映射單元能夠運作在一第一卷積模式以及一第二卷積模式。當運作在第一卷積模式時，重新映射單元在當前時脈下依據輸入線路上的數據以及輸入緩衝單元的輸出而產生重新映射數據，重新映射數據係供卷積運算的輸入；當運作在第二卷積模式時，重新映射單元在當前時脈下將輸入線路上的數據輸出以供卷積運算的輸入。 In an embodiment, the remapping unit is operable in a first convolution mode and a second convolution mode. When operating in the first convolution mode, the remapping unit generates remapping data according to the data on the input line and the output of the input buffer unit under the current clock, and remaps the data for input of the convolution operation; In the second convolution mode, the remapping unit outputs the data on the input line for input to the convolution operation at the current clock.

在一實施例中，第一卷積模式是3×3卷積運算模式，第二卷積模式是1×1卷積運算模式。 In one embodiment, the first convolution mode is a 3x3 convolution mode of operation and the second convolution mode is a 1x1 convolution mode of operation.

一種卷積運算裝置包括一記憶體、一卷積運算模組以及一緩衝裝置。緩衝裝置耦接記憶體以及卷積運算模組。緩衝裝置包括一輸入線路、一輸入緩衝單元以及一重新映射單元。輸入線路耦接記憶體，配置為在一當前時脈下讓數據從記憶體輸入；輸入緩衝單元耦接輸入線路，配置為在當前時脈下將部分輸入的數據緩衝以在後續時脈將其輸出；重新映射單元耦接輸入線路以及輸入緩衝單元，配置為在當前時脈下依據輸入線路上的數據以及輸入緩衝單元的輸出而產生多個重新映射數據，重新映射數據係輸入至卷積運算模組。 A convolution operation device includes a memory, a convolution operation module, and a buffer device. The buffer device is coupled to the memory and the convolution operation module. The buffer device includes an input line, an input buffer unit, and a remapping unit. The input line is coupled to the memory, configured to allow data to be input from the memory under a current clock; the input buffer unit is coupled to the input line, configured to buffer a portion of the input data at the current clock to be subsequently applied to the subsequent clock The remapping unit is coupled to the input line and the input buffer unit, configured to generate a plurality of remapping data according to the data on the input line and the output of the input buffer unit under the current clock, and remap the data input to the convolution operation Module.

在一實施例中，輸入線路在當前時脈下從記憶體輸入W個數據，重新映射單元產生W組重新映射數據至卷積運算模組，卷積運算模組依據W組重新映射數據進行W個卷積運算。 In one embodiment, the input line inputs W data from the memory at the current clock, and the remapping unit generates W group remapping data to the convolution operation module, and the convolution operation module performs remapping data according to the W group. Convolution operation.

在一實施例中，輸入緩衝單元係從W個數據中將最後K個數據緩衝以在後續時脈將其輸出，輸入緩衝單元的輸出係排列在輸入線路的前面。 In one embodiment, the input buffer unit buffers the last K data from the W data for output at subsequent clocks, and the output of the input buffer unit is arranged in front of the input line.

在一實施例中，卷積運算間的步幅為1，各組重新映射數據包括3個重新映射數據，卷積運算係3×3卷積運算，輸入緩衝單元配置為在當前時脈下最後2個輸入數據緩衝以在後續時脈將其輸出。 In one embodiment, the stride between convolution operations is 1, each set of remapping data includes three remapping data, the convolution operation is a 3x3 convolution operation, and the input buffer unit is configured to be at the current clock. Two input data buffers are used to output them at subsequent clocks.

在一實施例中，緩衝裝置更包括一控制單元，控制單元耦接並控制重新映射單元。 In an embodiment, the buffering device further includes a control unit that couples and controls the remapping unit.

在一實施例中，在同一時脈下輸入數據的數量與卷積運算模組所進行的卷積運算的數量相等。 In one embodiment, the amount of input data at the same clock is equal to the number of convolution operations performed by the convolutional computing module.

在一實施例中，卷積運算模組以及緩衝裝置能夠運作在一第一卷積模式以及一第二卷積模式。當運作在第一卷積模式時，重新映射單元在當前時脈下依據輸入線路上的數據以及輸入緩衝單元的輸出而產生多個重新映射數據，重新映射數據係輸入至卷積運算模組；當運作在第二卷積模式時，重新映射單元在當前時脈下將輸入線路上的數據輸出至卷積運算模組。 In one embodiment, the convolutional computing module and the buffering device are operable in a first convolution mode and a second convolutional mode. When operating in the first convolution mode, the remapping unit generates a plurality of remapping data according to the data on the input line and the output of the input buffer unit under the current clock, and the remapping data is input to the convolution operation module; When operating in the second convolution mode, the remapping unit outputs the data on the input line to the convolution operation module at the current clock.

一種數據串流的卷積運算方法，包括：從一緩衝區取得前一輪卷積運算已經輸入的數據；在一記憶體中從數據串流取得前一輪卷積運算還未輸入的數據；將從緩衝區以及數據串流取得的數據產生多組重新映射數據；基於一濾波器以及各組重新映射數據進行本輪卷積運算；以及保留本輪卷積運算的部分數據於緩衝區以供次一輪卷積運算。 A convolution operation method for data stream, comprising: obtaining data that has been input by a previous round of convolution operation from a buffer; and obtaining data that has not been input by a previous round of convolution operation from a data stream in a memory; The data obtained by the buffer and the data stream generates a plurality of sets of remapping data; the current convolution operation is performed based on a filter and each group of remapping data; and part of the data of the current convolution operation is reserved in the buffer for the next round Convolution operation.

在一實施例中，W個前一輪卷積運算還未輸入的數據是在記憶體中從數據串流取得，產生的重新映射數據為W組以分別供W個卷積運算的輸入。 In one embodiment, the data that has not been input by the W previous convolution operations is taken from the data stream in the memory, and the generated remapping data is the W group for the input of the W convolution operations.

在一實施例中，W個前一輪卷積運算還未輸入的數據中最後K個數據是保留於緩衝區以供次一輪卷積運算。 In one embodiment, the last K data in the data that has not been input by the previous W convolution operations is retained in the buffer for the next round of convolution operations.

在一實施例中，卷積運算間的步幅為1，各組重新映射數據包括3個重新映射數據，卷積運算係3×3卷積運算，前一輪卷積運算還未輸入的數據中最後2個數據是保留於緩衝區以供次一輪卷積運算。 In one embodiment, the stride between the convolution operations is 1, and each set of remapping data includes three remapping data, and the convolution operation is a 3×3 convolution operation, and the previous round of convolution operations is not yet input. The last two data are reserved in the buffer for the next round of convolution operations.

在一實施例中，緩衝區為一處理器內部的一暫存器，記憶體為處理器內部的一快取記憶體。 In one embodiment, the buffer is a scratchpad inside the processor, and the memory is a cache memory inside the processor.

承上所述，在緩衝裝置及卷積運算裝置中，緩衝裝置的輸入緩衝單元能夠將部分的輸入數據緩衝儲存當作下一時脈的輸入數據，即使卷積運算需要的輸入數據數量多於從記憶體能夠一次讀取的數量，重新映射單元仍能夠從輸入緩衝單元取得所缺的數據，因而能提供足夠的重新映射數據以供卷積運算，因而提高整體卷積運算效能。另外，由於從記憶體所輸入的數據數量和卷積運算所輸出的數量二者相等，使得這個架構也很適合處理數據串流。 As described above, in the buffer device and the convolution operation device, the input buffer unit of the buffer device can buffer part of the input data buffer as the input data of the next clock, even if the convolution operation requires more input data than The amount of memory that can be read at one time, the remapping unit can still obtain the missing data from the input buffer unit, thus providing sufficient remapping data for convolution operations, thereby improving the overall convolution operation performance. In addition, since the amount of data input from the memory and the number of outputs from the convolution operation are equal, this architecture is also well suited for processing data streams.

1‧‧‧記憶體 1‧‧‧ memory

2‧‧‧緩衝裝置 2‧‧‧buffering device

21‧‧‧輸入線路 21‧‧‧Input line

22‧‧‧輸入緩衝單元 22‧‧‧Input buffer unit

23‧‧‧重新映射單元 23‧‧‧Remapping unit

24‧‧‧控制單元 24‧‧‧Control unit

3‧‧‧卷積運算模組 3‧‧‧Convolutional computing module

30~37‧‧‧卷積單元 30~37‧‧‧Convolution unit

4‧‧‧緩衝單元 4‧‧‧buffer unit

5‧‧‧控制單元 5‧‧‧Control unit

6‧‧‧卷積單元 6‧‧‧Convolution unit

61‧‧‧位址解碼器 61‧‧‧ address decoder

62‧‧‧加法器 62‧‧‧Adder

conv_size、Stride、num_row、num_col‧‧‧訊號 Conv_size, Stride, num_row, num_col‧‧‧ signals

PE0~PE8‧‧‧處理單位 PE0~PE8‧‧‧Processing unit

data[47：0]‧‧‧線路 Data[47:0]‧‧‧ lines

fc_bus[47：0]‧‧‧線路 Fc_bus[47:0]‧‧‧ lines

psum[35：0]‧‧‧輸出 Psum[35:0]‧‧‧ output

pm_0[31：0]、pm_1[31：0]、pm_2[31：0]‧‧‧輸出 Pm_0[31:0], pm_1[31:0], pm_2[31:0]‧‧‧ output

圖1為依據本發明一實施例的卷積運算裝置的區塊圖。 1 is a block diagram of a convolution operation device in accordance with an embodiment of the present invention.

圖2為圖1的卷積運算裝置對一二維數據進行卷積運算的示意圖。 2 is a schematic diagram of a convolution operation of the two-dimensional data by the convolution operation device of FIG. 1.

圖3為依據本發明一實施例的記憶體、緩衝裝置及卷積運算模組的區塊圖。 3 is a block diagram of a memory, a buffer device, and a convolution operation module according to an embodiment of the invention.

圖4為圖2的緩衝裝置的運作示意圖。 4 is a schematic view showing the operation of the shock absorber of FIG. 2.

圖5為依據本發明一實施例的卷積單元的區塊圖。 Figure 5 is a block diagram of a convolution unit in accordance with an embodiment of the present invention.

以下將參照相關圖式，說明依據本發明具體實施例的緩衝裝置及卷積運算裝置，其中相同的元件將以相同的元件符號加以說明，所附圖式僅為說明用途，並非用於侷限本發明。 Hereinafter, a buffer device and a convolution operation device according to an embodiment of the present invention will be described with reference to the accompanying drawings, wherein the same elements will be described with the same reference numerals, and the drawings are for illustrative purposes only and not for limitation. invention.

圖1為依據本發明一實施例的卷積運算裝置的區塊圖。請參閱圖1所示，一卷積運算裝置包括一記憶體1、一卷積運算模組3、一緩衝裝置2、一控制單元5以及一緩衝單元4。卷積運算裝置可用在卷積神經網路(Convolutional Neural Network，CNN)的應用。 1 is a block diagram of a convolution operation device in accordance with an embodiment of the present invention. Referring to FIG. 1 , a convolution operation device includes a memory 1 , a convolution operation module 3 , a buffer device 2 , a control unit 5 , and a buffer unit 4 . The convolutional computing device can be used in the application of the Convolutional Neural Network (CNN).

記憶體1儲存待卷積運算的數據，數據例如是影像、視頻、音頻、統計、卷積神經網路其中一層的數據等等。以影像數據來說，其例如是畫素數據；以視頻數據來說，其例如是視頻的視框的畫素數據或是移動向量、或是視頻中的音訊；以卷積神經網路其中一層的數據來說，其通常是一個二維陣列數據，也常常是一個影像數據。 The memory 1 stores data to be convoluted, such as images, video, audio, statistics, data of one layer of a convolutional neural network, and the like. In the case of image data, for example, pixel data; in the case of video data, for example, pixel data of a video frame or a motion vector, or audio in a video; In terms of data, it is usually a two-dimensional array of data, and often an image data.

全部或大部分的數據可先儲存在其他地方，例如在另一記憶體中，要進行卷積運算時再全部或部分地載入至記憶體1中，然後通過緩衝裝置2將數據輸入至卷積運算模組3來進行卷積運算。若輸入的數據是從數據串流而來，記憶體1隨時會從數據串流寫入最新的數據以供卷積運算。 All or most of the data may be stored elsewhere, for example, in another memory, to be fully or partially loaded into the memory 1 when the convolution operation is to be performed, and then the data is input to the volume through the buffer device 2. The product calculation module 3 performs a convolution operation. If the input data is streamed from the data stream, the memory 1 will write the latest data from the data stream for convolution at any time.

對於影像數據或視頻的視框數據來說，處理的順序是逐行(column)同時讀取多列(row)，因此在一個時序(clock)中，緩衝裝置2係從記憶體1輸入同一行不同列上的數據，也就是緩衝裝置2作為行(column)緩衝。 For frame data of image data or video, the order of processing is to read multiple columns at the same time, so in one clock, the buffer device 2 inputs the same line from the memory 1. The data on the different columns, that is, the buffer device 2 is buffered as a column.

緩衝裝置2耦接記憶體1以及卷積運算模組3，其與記憶體1之間具有有限的一記憶體存取寬度，卷積運算模組3實際上能進行得卷積運算也與記憶體存取寬度有關。如果記憶體輸入有頻頸，則卷積運算的效能將受到衝擊而下降。 The buffer device 2 is coupled to the memory 1 and the convolution operation module 3, and has a limited memory access width with the memory 1. The convolution operation module 3 can actually perform convolution operations and memory. The body access width is related. If the memory input has a frequency neck, the performance of the convolution operation will be affected by the impact.

實際上，卷積運算模組3所需的輸入除了數據之外還有係數，因平行處理的關係，卷積運算模組3不會只進行一個卷積運算，他會同時進行多個相鄰數據的卷積運算以提高效能。由於步幅(stride)的大小不會超過移動窗(sliding window)的大小或卷積大小(convolution size)，相鄰的卷積運算往往會有重疊的數據。一般卷積神經網路的應用來說，常用的移動窗是1×1、3×3、5×5、7×7等等，其中又以3×3較為常用。 In fact, the input required by the convolution operation module 3 has coefficients in addition to the data. Due to the parallel processing relationship, the convolution operation module 3 does not perform only one convolution operation, and he performs multiple adjacent operations at the same time. Convolution of data to improve performance. Since the size of the stride does not exceed the size of the sliding window or the convolution size, adjacent convolution operations tend to have overlapping data. In the general application of convolutional neural networks, the commonly used moving windows are 1×1, 3×3, 5×5, 7×7, etc., and 3×3 is more commonly used.

舉例來說，卷積運算模組3具有多個卷積單元，各卷積單元基於一濾波器以及多個當前數據進行一卷積運算，並於卷積運算後保留部分的當前數據。緩衝裝置2從記憶體1取得多個新數據，並將新數據輸入至卷積單元，新數據不與當前數據重複，新數據例如是前一輪卷積運算還未用到但是本輪卷積運算要用到的數據。卷積運算模組3的卷積單元基於濾波器、保留的當前數據以及新數據進行次輪卷積運算。 For example, the convolution operation module 3 has a plurality of convolution units, each convolution unit performs a convolution operation based on a filter and a plurality of current data, and retains part of the current data after the convolution operation. The buffer device 2 acquires a plurality of new data from the memory 1, and inputs the new data to the convolution unit, and the new data is not overlapped with the current data, for example, the previous round convolution operation is not used but the current convolution operation The data to be used. The convolution unit of the convolution operation module 3 performs a second round convolution operation based on the filter, the retained current data, and the new data.

在一種實施態樣下，卷積運算裝置例如是一個處理器，記憶體1例如是處理器內的快取記憶體(cache memory)，卷積運算模組3可包括一個或多個卷積單元陣列，卷積單元陣列具有多個卷積單元，各卷積單元可分別平行處理不同組輸入數據的卷積運算，各不同組輸入數據可能與前一組或次一組輸入數據有部分重疊的數據。緩衝裝置中包括多個功能單元，以增加平行處理卷積運算的效能。控制單元5、緩衝單元4、卷積單元陣列以及緩衝裝置的單元是以數位邏輯電路構成，各單元的內部可包括多個邏輯元件來實現其功能。記憶體1、卷積運算模組3、緩衝裝置2、控制單元5以及緩衝單元4可以整合在同一個積體電路。 In one embodiment, the convolution operation device is, for example, a processor, the memory 1 is, for example, a cache memory in the processor, and the convolution operation module 3 may include one or more convolution units. The array, the convolution unit array has a plurality of convolution units, and each convolution unit can separately process convolution operations of different sets of input data, and the different sets of input data may partially overlap with the previous or next set of input data. data. A plurality of functional units are included in the buffer device to increase the efficiency of parallel processing of convolution operations. The units of the control unit 5, the buffer unit 4, the convolution unit array, and the buffer device are constituted by digital logic circuits, and the inside of each unit may include a plurality of logic elements to perform their functions. The memory 1, the convolution operation module 3, the buffer device 2, the control unit 5, and the buffer unit 4 can be integrated in the same integrated circuit.

在其他實施態樣下，記憶體1可以是一般的隨機存取記憶體(dram)，卷積運算模組3、緩衝裝置2、控制單元5以及緩衝單元4可以整合在同一個積體電路。 In other embodiments, the memory 1 may be a general random access memory (dram), and the convolution operation module 3, the buffer device 2, the control unit 5, and the buffer unit 4 may be integrated in the same integrated circuit.

控制單元5可包括一指令解碼器以及一控制器，指令解碼器係從控制器得到指令並將指令解碼，藉以得到目前輸入數據大小、輸入數據的行數、輸入數據的列數、移動窗(sliding window)或稱為卷積大小(convolution size)的編號、以及輸入資料在記憶體1中的起始位址。另外，指令解碼器也從控制器得到移動窗種類資訊以及輸出特徵編號，並輸出適當的空置訊號到緩衝裝置2，緩衝裝置2根據這些訊號來運作，也控制卷積單元陣列3以及緩衝單元4的運作，例如數據從記憶體1輸入到緩衝裝置2以及卷積運算模組3的時序、卷積運算模組3的卷積運算的規模、數據從記憶體1到緩衝裝置2的讀取位址、數據從緩衝單元4到記憶體1的寫入位址、卷積運算模組3及緩衝裝置2所運作的卷積模式。 The control unit 5 may include an instruction decoder and a controller, and the instruction decoder obtains an instruction from the controller and decodes the instruction, thereby obtaining the current input data size, the number of rows of the input data, the number of columns of the input data, and the moving window ( The sliding window) or the number called the convolution size and the starting address of the input data in the memory 1. In addition, the command decoder also obtains the moving window type information and the output feature number from the controller, and outputs an appropriate vacant signal to the buffer device 2, and the buffer device 2 operates according to the signals, and also controls the convolution unit array 3 and the buffer unit 4. The operation, for example, the timing of data input from the memory 1 to the buffer device 2 and the convolution operation module 3, the scale of the convolution operation of the convolution operation module 3, and the read position of the data from the memory 1 to the buffer device 2. The address, the data from the buffer unit 4 to the write address of the memory 1, the convolution operation module 3, and the convolution mode operated by the buffer device 2.

舉例來說，緩衝裝置2以及卷積運算模組3能夠運作在一第一卷積模式以及一第二卷積模式，運作在那一個卷積模式是由緩衝裝置2中的控制單元決定，也就是緩衝裝置2中的控制單元控制緩衝裝置2以及卷積運算模組3運作在第一卷積模式或第二卷積模式，不同的卷積模式係以不同大小的移動窗(sliding window)或稱為卷積大小(convolution size)來進行卷積運算。例如，第一卷積模式是3×3卷積運算模式，第二卷積模式是1×1卷積運算模式。 For example, the buffer device 2 and the convolution operation module 3 can operate in a first convolution mode and a second convolution mode, and the convolution mode is determined by the control unit in the buffer device 2, That is, the control unit in the buffer device 2 controls the buffer device 2 and the convolution operation module 3 to operate in the first convolution mode or the second convolution mode, and the different convolution modes are different sizes of sliding windows or This is called convolution size for convolution operations. For example, the first convolution mode is a 3x3 convolution operation mode, and the second convolution mode is a 1x1 convolution operation mode.

舉例來說，控制單元5可接收一控制訊號或模式指令，並且根據這個控制訊號或模式指令來決定其他模組以及單元要在哪一種模式運算。這個控制訊號或模式指令可從其他控制單元或處理單元而得。 For example, the control unit 5 can receive a control signal or a mode command, and according to the control signal or mode command, determine which module and other mode operations the other modules are to operate on. This control signal or mode command can be derived from other control units or processing units.

當緩衝裝置2以及卷積運算模組3運作在第一卷積模式時，緩衝裝置2會將前一時脈的部分輸入數據留存至當前時脈，並根據前一時脈的部分輸入數據以及當前時脈的輸入數據產生多個重新映射數據，重新映射數據係輸入至卷積運算模組3。這些重新映射數據包括多組移動窗所需的數據，例如3個重新映射數據供3×3卷積運算模式。由於卷積運算模組3具備平行運算以及移位功能，在步幅為1的情況下，每個時脈下提供3個重新映射數據即可供卷積運算模組3進行一個3×3卷積運算。 When the buffer device 2 and the convolution operation module 3 operate in the first convolution mode, the buffer device 2 saves part of the input data of the previous clock to the current clock, and inputs data and current time according to the part of the previous clock. The input data of the pulse generates a plurality of remapping data, and the remapping data is input to the convolution operation module 3. These remapping data includes data required for multiple sets of moving windows, such as three remapping data for a 3x3 convolution mode of operation. Since the convolution operation module 3 has a parallel operation and a shift function, in the case of a stride of 1, three remapping data is provided under each clock, that is, the convolution operation module 3 can perform a 3×3 volume. Product operation.

當運作在第二卷積模式時，緩衝裝置2將當前時脈下的輸入數據輸出至卷積運算模組3，此時不需要將前一時脈的部分輸入數據留存至當前時脈。 When operating in the second convolution mode, the buffer device 2 outputs the input data under the current clock to the convolution operation module 3. At this time, it is not necessary to retain part of the input data of the previous clock to the current clock.

緩衝單元4是將卷積計算的結果暫存，必要時會先進行池化(pooling)運算。運算結果透過緩衝裝置2存至記憶體1，然後透過記憶體1輸出至其他地方。 The buffer unit 4 temporarily stores the result of the convolution calculation, and if necessary, performs a pooling operation. The calculation result is stored in the memory 1 through the buffer device 2, and then output to the other places through the memory 1.

請參考圖2，圖2為圖1的卷積運算裝置對一二維數據進行卷積運算的示意圖。二維數據具有多行多列，其例如是影像，於此僅示意地顯示其中5×4的像素。一3×3矩陣大小的濾波器用於二維數據的卷積運算，濾波器具有係數FC0~FC8，濾波器移動的步幅小於濾波器的最短寬度。濾波器的規模與移動窗(sliding window)或卷積運算窗相當。移動窗可在5×4的影像上間隔移動，每移動一次便對窗內對應的數據P0~P8進行一次3×3卷積運算，卷積運算後的結果可稱為特徵值。移動窗S每次移動的間隔稱為步幅(stride)，由於步幅(stride)的大小並不會超過移動窗S(sliding window)的大小或是卷積運算的尺寸(convolution size)，因此以本實施例的移動窗步幅來說，將會小於3個像素的移動距離。而且，相鄰的卷積運算往往會有重疊的數據。以步幅等於1來說，數據P2、P5、P8是新數據，數據P0、P1、P3、P4、P6、P7是前一輪卷積運算已經輸入過的數據。對於一般卷積神經網路的應用來說，常用的移動窗尺寸為1×1、3×3、5×5、7×7不等，其中又以本實施例的移動窗尺寸較為常用(3×3)。 Please refer to FIG. 2. FIG. 2 is a schematic diagram of the convolution operation of the two-dimensional data by the convolution operation device of FIG. 1. The two-dimensional data has a plurality of rows and columns, which is, for example, an image, and only 5×4 of the pixels therein are schematically shown. A 3×3 matrix size filter is used for convolution of two-dimensional data. The filter has coefficients FC0~FC8, and the moving step of the filter is smaller than the shortest width of the filter. The size of the filter is comparable to a sliding window or a convolution window. The moving window can be moved on the 5×4 image, and each time the movement, the corresponding data P0~P8 in the window is subjected to a 3×3 convolution operation, and the result of the convolution operation can be referred to as a feature value. The interval at which the moving window S moves each time is called stride, and since the stride does not exceed the size of the sliding window or the convolution size, With the moving window stride of this embodiment, it will be less than the moving distance of 3 pixels. Moreover, adjacent convolution operations tend to have overlapping data. In the case where the stride is equal to 1, the data P2, P5, and P8 are new data, and the data P0, P1, P3, P4, P6, and P7 are data that has been input by the previous round of convolution operations. For the application of the general convolutional neural network, the commonly used moving window size is 1×1, 3×3, 5×5, 7×7, and the moving window size of the embodiment is more commonly used (3) ×3).

圖3為依據本發明一實施例的記憶體、緩衝裝置及卷積運算模組的區塊圖。請參閱圖3所示，緩衝裝置2包括一輸入線路21、一輸入緩衝單元22以及一重新映射單元23。輸入線路21耦接記憶體1，其配置為在一當前時脈下讓數據從記憶體1輸入。輸入緩衝單元22耦接輸入線路21，其配置為在當前時脈下將部分的輸入數據緩衝以在後續時脈將其輸出。重新映射單元23耦接輸入線路21以及輸入緩衝單元22，其配置為在當前時脈下依據輸入線路21上的數據以及輸入緩衝單元22的輸出而產生多個重新映射數據，重新映射數據係輸入至卷積運算模組3以供卷積運算的輸入。 3 is a block diagram of a memory, a buffer device, and a convolution operation module according to an embodiment of the invention. Referring to FIG. 3, the buffer device 2 includes an input line 21, an input buffer unit 22, and a remapping unit 23. The input line 21 is coupled to the memory 1 and is configured to allow data to be input from the memory 1 under a current clock. The input buffer unit 22 is coupled to the input line 21 and is configured to buffer a portion of the input data under the current clock to output it at a subsequent clock. The remapping unit 23 is coupled to the input line 21 and the input buffer unit 22, and is configured to generate a plurality of remapping data according to the data on the input line 21 and the output of the input buffer unit 22 under the current clock, and remap the data input. The convolution operation module 3 is input for convolution operations.

舉例來說，輸入線路21在當前時脈下從記憶體1輸入W個數據，重新映射單元產生W組重新映射數據至卷積運算模組3，卷積運算模組依據W組重新映射數據進行W個卷積運算，各組重新映射數據可輸入到卷積運算模組3的各卷積單元以分別進行卷積運算。輸入緩衝單元22係從W個數據中將最後K個數據緩衝以在後續時脈將其輸出，輸入緩衝單元22的輸出係排列在輸入線路21的前面。各組重新映射數據包括M個重新映射數據，卷積計算係M×M卷積計算。重新映射單元23係在輸入緩衝單元22的輸出以及輸入線路21之中從起始端朝末端每間隔J步幅取得M個資料作為一組重新映射數據。在一些實施態樣中，在同一時脈下輸入數據的數量與卷積運算模組所進行的卷積運算的數量相等。 For example, the input line 21 inputs W data from the memory 1 under the current clock, and the remapping unit generates W group remapping data to the convolution operation module 3, and the convolution operation module performs remapping data according to the W group. For each of the W convolution operations, each set of remapping data can be input to each convolution unit of the convolution operation module 3 to perform convolution operations, respectively. The input buffer unit 22 buffers the last K data from the W data to output it at a subsequent clock, and the output of the input buffer unit 22 is arranged in front of the input line 21. Each group of remapping data includes M remapping data, and the convolution calculation is M×M convolution calculation. The remapping unit 23 acquires M pieces of data as a set of remapping data from the start end toward the end every J step interval in the output of the input buffer unit 22 and the input line 21. In some implementations, the amount of input data at the same clock is equal to the number of convolution operations performed by the convolutional computing module.

在本實施例中，W為8，K為2，M為3，J為1，這些數值僅為舉例說明而非限制，在其他實施例中，W、K、M、J可以是其他數值。 In the present embodiment, W is 8, K is 2, M is 3, and J is 1. These values are merely illustrative and not limiting. In other embodiments, W, K, M, J may be other values.

在本實施例中，輸入線路21在當前時脈下從記憶體1輸入8個數據，數據的單位為1個位元組，輸入緩衝單元22配置為在當前時脈下將最後2個輸入數據緩衝以在後續時脈將其輸出。例如在第1個時脈中，數據R0~R7會輸入到重新映射單元23，數據R6~R7還會輸入到輸入緩衝單元22；在第2個時脈中，數據R8~R15會輸入到重新映射單元23，數據R14~R15還會輸入到輸入緩衝單元22；在第i+1個時脈中，數據Ri*8~Ri*8+7會輸入到重新映射單元23，數據Ri*8+6~Ri*8+7還會輸入到輸入緩衝單元22。 In the present embodiment, the input line 21 inputs 8 data from the memory 1 under the current clock, the unit of the data is 1 byte, and the input buffer unit 22 is configured to input the last 2 input data under the current clock. Buffer to output it at subsequent clocks. For example, in the first clock, data R0 to R7 are input to the remapping unit 23, and data R6 to R7 are also input to the input buffer unit 22; in the second clock, data R8 to R15 are input to the second. The mapping unit 23, the data R14~R15 are also input to the input buffer unit 22; in the i+1th clock, the data Ri*8~Ri*8+7 is input to the remapping unit 23, and the data Ri*8+ 6~Ri*8+7 is also input to the input buffer unit 22.

在一個時脈中，將有10個數據輸入至重新映射單元23。例如例如在第1個時脈中，輸入線路21上的數據R0~R7會輸入到重新映射單元23；在第2個時脈中，輸入緩衝單元22的輸出數據R6~R7輸入到重新映射單元23，輸入線路21上的數據R8~R15會輸入到重新映射單元23；在第i+1個時脈中，輸入緩衝單元22的輸出數據Ri*8-2~Ri*8-1輸入到重新映射單元23，輸入線路21上的數據Ri*8~Ri*8+7會輸入到重新映射單元23。 In one clock, there will be 10 data input to the remapping unit 23. For example, in the first clock, the data R0 to R7 on the input line 21 are input to the remapping unit 23; in the second clock, the output data R6 to R7 of the input buffer unit 22 are input to the remapping unit. 23, the data R8~R15 on the input line 21 is input to the remapping unit 23; in the i+1th clock, the output data Ri*8-2~Ri*8-1 of the input buffer unit 22 is input to the re The mapping unit 23 inputs the data Ri*8~Ri*8+7 on the input line 21 to the remapping unit 23.

重新映射單元23產生8組重新映射數據至卷積運算模組3，各組重新映射數據包括3個重新映射數據，每個重新映射數據為1個位元組，各卷積單元30~37的卷積運算係3×3卷積運算，卷積運算間的步幅為1。重新映射數據的產生將參考圖4說明。 The remapping unit 23 generates 8 sets of remapping data to the convolution operation module 3, each group of remapping data includes 3 remapping data, each remapping data is 1 byte, and each convolution unit 30~37 The convolution operation is a 3×3 convolution operation, and the stride between convolution operations is 1. The generation of remapping data will be explained with reference to FIG.

因卷積運算係3×3卷積運算且步幅為1，重新映射單元23至少需要10個數據才能產生8組重新映射數據供8個卷積運算。如果沒有輸入緩衝單元22，重新映射單元23僅能從輸入線路21得到8個數據，因而僅能產生6組重新映射數據供6個卷積運算，而且每次從記憶體1的載入動作必須載入前一次已經載入過的2個數據，而且，在同一時脈下從記憶體1輸入的數據數量與卷積運算模組3所進行的卷積運算的數量不相等，這將造成整體效能下降。 Since the convolution operation is a 3x3 convolution operation and the stride is 1, the remapping unit 23 needs at least 10 data to generate 8 sets of remapping data for 8 convolution operations. If there is no input buffer unit 22, the remapping unit 23 can only obtain 8 data from the input line 21, and thus can only generate 6 sets of remapping data for 6 convolution operations, and each time the loading action from the memory 1 must be performed. Loading the two data that has been loaded the previous time, and the number of data input from the memory 1 at the same clock is not equal to the number of convolution operations performed by the convolution operation module 3, which will result in the whole Performance is declining.

反觀本案藉由輸入緩衝單元22的設置，重新映射單元23可以在一個時脈中得到10個數據，因而能產生8組重新映射數據供8個卷積運算，使得從記憶體1輸入的數據數量與卷積運算模組3所進行的卷積運算的數量相等。 In contrast, in the present case, by the setting of the input buffer unit 22, the remapping unit 23 can obtain 10 data in one clock, and thus can generate 8 sets of remapping data for 8 convolution operations, so that the amount of data input from the memory 1 The number of convolution operations performed by the convolution operation module 3 is equal.

圖4為圖3的緩衝裝置的運作示意圖。請參閱圖3與圖4所示，舉例來說，記憶體1所儲存的數據為二維陣列數據，緩衝裝置2係作為行(column)緩衝，輸入緩衝單元22係作為部分列(row)緩衝。 4 is a schematic view showing the operation of the shock absorber of FIG. 3. Referring to FIG. 3 and FIG. 4, for example, the data stored in the memory 1 is two-dimensional array data, the buffer device 2 is used as a column buffer, and the input buffer unit 22 is used as a partial row buffer. .

移動窗(sliding window)是逐行(column)從記憶體1依序讀取固定8列(row)的數據。在每個時脈中，記憶體1的8列數據作為8個輸入數據。 The sliding window is a column that sequentially reads fixed 8 columns of data from the memory 1. In each clock, the eight columns of data of the memory 1 are taken as eight input data.

在時脈Clk n中，數據R0~R7會從記憶體1透過輸入線路21輸入到重新映射單元23，數據R6~R7還會輸入到輸入緩衝單元22。重新映射單元23係在輸入線路21之中，也就是在數據R0~R7之中，從起始端(數據R0)朝末端(數據R7)每間隔1步幅取得3個數據作為一組重新映射數據，各組重新映射數據分別數據R0~R2為一組、R1~R3為一組、...R5~R7為一組。但由於輸入至重新映射單元23的數據不足，重新映射單元23只能得到8個數據因而只能產生6組重新映射數據供有效的卷積運算，並將產生2組無法供有效卷積運算的數據，但這現象僅會在初始時發生。 In the clock Clk n, the data R0 to R7 are input from the memory 1 through the input line 21 to the remapping unit 23, and the data R6 to R7 are also input to the input buffer unit 22. The remapping unit 23 is in the input line 21, that is, among the data R0 to R7, three data are acquired as a group of remapping data from the start end (data R0) toward the end (data R7) at intervals of one step. Each group of remapping data is a group of data R0~R2, a group of R1~R3, and a group of R5~R7. However, since the data input to the remapping unit 23 is insufficient, the remapping unit 23 can only obtain 8 data and thus can only generate 6 sets of remapping data for efficient convolution operations, and will generate 2 sets of unconservable convolution operations. Data, but this happens only at the beginning.

在時脈Clk n+1中，數據R6~R7會從輸入緩衝單元22輸入到重新映射單元23，數據R8~R15會從記憶體1透過輸入線路21輸入到重新映射單元23，數據R14~R15還會輸入到輸入緩衝單元22。重新映射單元23係在輸入緩衝單元22的輸出以及輸入線路21之中，也就是從數據R6~R15中，從起始端(R6)朝末端(R15)每間隔1步幅取得3個資料作為一組重新映射數據，各組重新映射數據分別數據R6~R8為一組、R7~R9為一組、...R13~R15為一組。輸入至重新映射單元23的數據已有10個，將產生8組有效卷積運算的數據。 In the clock Clk n+1, the data R6 to R7 are input from the input buffer unit 22 to the remapping unit 23, and the data R8 to R15 are input from the memory 1 through the input line 21 to the remapping unit 23, and the data R14 to R15. It is also input to the input buffer unit 22. The remapping unit 23 is in the output of the input buffer unit 22 and the input line 21, that is, from the data R6 to R15, three data are obtained as one from the start end (R6) toward the end (R15) at intervals of one step. The group remaps the data, and each group remaps the data into a group of R6~R8, R7~R9 as a group, and R13~R15 as a group. There are already 10 data input to the remapping unit 23, and data of 8 sets of effective convolution operations will be generated.

以此類推，在時脈Clk n+i中，數據Ri*8-2~Ri*8-1會從輸入緩衝單元22輸入到重新映射單元23，數據Ri*8~Ri*8+7會從記憶體1透過輸入線路21輸入到重新映射單元23，數據Ri*8+6~Ri*8+7還會輸入到輸入緩衝單元22。重新映射單元23係在輸入緩衝單元22的輸出以及輸入線路21之中，也就是從數據Ri*8-2~Ri*8+7中，從起始端(Ri*8-2)朝末端(Ri*8+7)每間隔1步幅取得3個資料作為一組重新映射數據，各組重新映射數據分別數據Ri*8-2~Ri*8為一組、Ri*8-1~Ri*8+1為一組、...Ri*8+5~Ri*8+7為一組。輸入至重新映射單元23的數據已有10個，將產生8組有效卷積運算的數據，在同一個時脈中，從記憶體1讀取8個數據將對應地產生8個卷積運算結果。 By analogy, in the clock Clk n+i, the data Ri*8-2~Ri*8-1 will be input from the input buffer unit 22 to the remapping unit 23, and the data Ri*8~Ri*8+7 will be The memory 1 is input to the remapping unit 23 via the input line 21, and the data Ri*8+6~Ri*8+7 is also input to the input buffer unit 22. The remapping unit 23 is in the output of the input buffer unit 22 and the input line 21, that is, from the data Ri*8-2~Ri*8+7, from the start end (Ri*8-2) toward the end (Ri *8+7) Get 3 data per block as a set of remapping data, and remap the data in each group. Ri*8-2~Ri*8 are a group, Ri*8-1~Ri*8 +1 is a group, ...Ri*8+5~Ri*8+7 as a group. There are already 10 data input to the remapping unit 23, and 8 sets of data of the effective convolution operation will be generated. In the same clock, reading 8 data from the memory 1 will correspondingly generate 8 convolution operation results. .

另外，請再參閱圖3所示，緩衝裝置2更包括一控制單元24，控制單元24耦接並控制重新映射單元23，控制單元5控制卷積運算模組3、緩衝裝置2以及緩衝單元4的運作，例如數據從記憶體1輸入到緩衝裝置2以及卷積運算模組3的時序、卷積運算模組3的卷積運算的規模、數據從記憶體1到緩衝裝置2的讀取位址、數據從緩衝單元4到記憶體1 的寫入位址、卷積運算模組3及緩衝裝置2所運作的卷積模式。訊號conv_size、訊號Stride、訊號num_row、訊號num_col係輸入至控制單元24，訊號conv_size係表示當前該進行的卷積大小，訊號Stride表示卷積運算之間的步幅、訊號num_row表示當前數據的列數、訊號num_col係表示當前數據的行數。控制單元24根據訊號conv_size控制重新映射單元23以及卷積運算模組3的卷積運算的運作模式，並依據訊號Stride控制讀取位移。 In addition, please refer to FIG. 3 again, the buffer device 2 further includes a control unit 24, the control unit 24 is coupled to and controls the remapping unit 23, and the control unit 5 controls the convolution operation module 3, the buffer device 2, and the buffer unit 4. The operation, for example, the timing of data input from the memory 1 to the buffer device 2 and the convolution operation module 3, the scale of the convolution operation of the convolution operation module 3, and the read position of the data from the memory 1 to the buffer device 2. The address, the data from the buffer unit 4 to the write address of the memory 1, the convolution operation module 3, and the convolution mode operated by the buffer device 2. The signal conv_size, the signal Stride, the signal num_row, and the signal num_col are input to the control unit 24, the signal conv_size indicates the current convolution size, the signal Stride indicates the stride between the convolution operations, and the signal num_row indicates the number of columns of the current data. The signal num_col is the number of lines of the current data. The control unit 24 controls the operation mode of the convolution operation of the remapping unit 23 and the convolution operation module 3 according to the signal conv_size, and controls the read displacement according to the signal Stride.

舉例來說，重新映射單元23以及卷積運算模組3能夠運作在第一卷積模式以及第二卷積模式，運作在那一個卷積模式是由控制單元24決定，也就是控制單元24控制重新映射單元23以及卷積運算模組3運作在第一卷積模式或第二卷積模式，不同的卷積模式係以不同大小的移動窗(sliding window)或稱為卷積大小(convolution size)來進行卷積運算。例如，第一卷積模式是3×3卷積運算模式，第二卷積模式是1×1卷積運算模式。 For example, the remapping unit 23 and the convolution operation module 3 can operate in the first convolution mode and the second convolution mode, and the convolution mode is determined by the control unit 24, that is, the control unit 24 controls. The remapping unit 23 and the convolution operation module 3 operate in a first convolution mode or a second convolution mode, and different convolution modes are different sizes of a sliding window or a convolution size. ) to perform convolution operations. For example, the first convolution mode is a 3x3 convolution operation mode, and the second convolution mode is a 1x1 convolution operation mode.

當重新映射單元23以及卷積運算模組3運作在第一卷積模式時，重新映射單元23在當前時脈下依據輸入線路21上的數據以及輸入緩衝單元22的輸出而產生重新映射數據，重新映射數據係輸入至卷積運算模組3供卷積運算的輸入。輸入緩衝單元22會將前一時脈的部分輸入數據留存至當前時脈，重新映射單元23根據輸入緩衝單元22所輸出的前一時脈部分輸入數據以及當前時脈輸入線路21上的輸入數據產生多個重新映射數據，重新映射數據係輸入至卷積運算模組3。這些重新映射數據包括多組移動窗所需的數據，例如3個重新映射數據供3×3卷積運算模式。由於卷積運算模組3具備平行運算以及移位功能，在步幅為1的情況下，每個時脈下提供3個重新映射數據即可供卷積運算模組3進行一個3×3卷積運算。 When the remapping unit 23 and the convolution operation module 3 operate in the first convolution mode, the remapping unit 23 generates remapping data according to the data on the input line 21 and the output of the input buffer unit 22 under the current clock. The remapping data is input to the convolution operation module 3 for input of the convolution operation. The input buffer unit 22 retains part of the input data of the previous clock to the current clock, and the remapping unit 23 generates more data based on the previous clock portion input data output by the input buffer unit 22 and the input data on the current clock input line 21. The remapping data is input to the convolution operation module 3. These remapping data includes data required for multiple sets of moving windows, such as three remapping data for a 3x3 convolution mode of operation. Since the convolution operation module 3 has a parallel operation and a shift function, in the case of a stride of 1, three remapping data is provided under each clock, that is, the convolution operation module 3 can perform a 3×3 volume. Product operation.

當運作在第二卷積模式時，重新映射單元23將當前時脈下輸入線路21上的輸入數據輸出至卷積運算模組3，此時重新映射單元23不需要從輸入緩衝單元22取得留存的前一時脈的部分輸入數據。 When operating in the second convolution mode, the remapping unit 23 outputs the input data on the current clock input line 21 to the convolution operation module 3, at which time the remapping unit 23 does not need to acquire the retention from the input buffer unit 22. Part of the data from the previous clock.

圖5為依據本發明一實施例的卷積單元的區塊圖。以下將以33卷積單元為例進行說明，請參閱圖5所示，卷積單元6包括9個處理單位PE0~PE8(process engine)、一位址解碼器61及一加法器62。待卷積運算的輸入數據係經由線路data[47：0]輸入至處理單位PE0~PE2，處理單位PE0~PE2會將當前時脈的輸入數據在下一時脈輸入至處理單位PE3~PE5，處理單位PE3~PE5會將當前時脈的輸入數據在下一時脈輸入至處理單位PE6~PE8。濾波器係數透過線路fc_bus[47：0]輸入至處理單位PE0~PE8。執行卷積運算時，處理單位PE透過位址解碼器61將選定位址的濾波器係數與輸入至處理單位PE0~PE8的輸入數據做乘法運算。當卷積單元6進行3×3卷積運算時，加法器62會將各乘法運算的結果相加以得到卷積運算的結果作為輸出psum[35：0]。當卷積單元6進行1×1卷積運算時，加法器62會直接將處理單位PE0~PE2的卷積運算的結果作為輸出pm_0[31：0]、pm_1[31：0]、pm_2[31：0]。 Figure 5 is a block diagram of a convolution unit in accordance with an embodiment of the present invention. Hereinafter, a 33-convolution unit will be described as an example. Referring to FIG. 5, the convolution unit 6 includes nine processing units PE0 to PE8 (process engine), a bit address decoder 61, and an adder 62. The input data to be convoluted is input to the processing unit PE0~PE2 via the line data[47:0], and the processing units PE0~PE2 input the input data of the current clock to the processing unit PE3~PE5 in the next clock. PE3~PE5 will input the current clock input data to the processing unit PE6~PE8 at the next clock. The filter coefficients are input to the processing units PE0 to PE8 through the line fc_bus[47:0]. When the convolution operation is performed, the processing unit PE multiplies the filter coefficients of the selected address and the input data input to the processing units PE0 to PE8 through the address decoder 61. When the convolution unit 6 performs a 3 × 3 convolution operation, the adder 62 adds the results of the multiplication operations to obtain the result of the convolution operation as the output psum[35:0]. When the convolution unit 6 performs a 1×1 convolution operation, the adder 62 directly takes the result of the convolution operation of the processing units PE0 to PE2 as the output pm_0[31:0], pm_1[31:0], pm_2[31 :0].

另外，一種數據串流的卷積運算方法包括：從一緩衝區取得前一輪卷積運算已經輸入的數據；在一記憶體中從數據串流取得前一輪卷積運算還未輸入的數據；將從緩衝區以及數據串流取得的數據產生多組重新映射數據；基於一濾波器以及各組重新映射數據進行本輪卷積運算；以及保留本輪卷積運算的部分數據於緩衝區以供次一輪卷積運算。 In addition, a convolution operation method for data stream includes: obtaining data that has been input by a previous round of convolution operation from a buffer; and obtaining data that has not been input by a previous round of convolution operation from a data stream in a memory; The data obtained from the buffer and the data stream generates a plurality of sets of remapping data; the current convolution operation is performed based on a filter and each group of remapping data; and part of the data of the current convolution operation is reserved in the buffer for the second time A round of convolution operations.

在一實施例中，W個前一輪卷積運算還未輸入的數據是在記憶體中從數據串流取得，產生的重新映射數據為W組以分別供W個卷積運算的輸入。W個前一輪卷積運算還未輸入的數據中最後K個數據是保留於緩衝區以供次一輪卷積運算。各組重新映射數據包括M個重新映射數據，卷積計算係M×M卷積計算。 In one embodiment, the data that has not been input by the W previous convolution operations is taken from the data stream in the memory, and the generated remapping data is the W group for the input of the W convolution operations. The last K data in the data that has not been input by the previous W convolution operation is reserved in the buffer for the next round of convolution operations. Each group of remapping data includes M remapping data, and the convolution calculation is M×M convolution calculation.

在一實施例中，緩衝區為一處理器內部的一暫存器，記憶體為處理器內部的一快取記憶體。 In one embodiment, the buffer is a temporary memory inside the processor, and the memory is a cache memory inside the processor.

卷積運算方法可應用或實施在前述實施例的卷積運算裝置，相關的變化及實施方式故此不再贅述。卷積運算方法亦可應用或實施在其他計算裝置。舉例來說，數據串流的卷積運算方法可用在能夠執行指令的處理器，配置以執行卷積運算方法的指令是儲存在記憶體，處理器耦接記憶體並執行這些指令以進行卷積運算方法。例如，處理器包括快取記憶體、數學運算單元以及內部暫存器，快取記憶體儲存數據串流，數學運算單元能夠進行卷積運算，內部暫存器可留存本輪卷積運算的部分數據於卷積運算模組內以供次一輪卷積運算。 The convolution operation method can be applied or implemented in the convolution operation device of the foregoing embodiment, and related changes and implementations will not be described again. The convolution operation method can also be applied or implemented in other computing devices. For example, a convolution operation method of data stream can be used in a processor capable of executing an instruction, and an instruction configured to perform a convolution operation method is stored in a memory, and the processor couples the memory and executes the instruction to perform convolution. Algorithm. For example, the processor includes a cache memory, a math operation unit, and an internal register, the cache memory stores the data stream, the math operation unit can perform a convolution operation, and the internal register can retain the portion of the current convolution operation. The data is used in the convolutional computing module for the next round of convolution operations.

綜上所述，在緩衝裝置及卷積運算裝置中，緩衝裝置的輸入緩衝單元能夠將部分的輸入數據緩衝儲存當作下一時脈的輸入數據，即使卷積運算需要的輸入數據數量多於從記憶體能夠一次讀取的數量，重新映射單元仍能夠從輸入緩衝單元取得所缺的數據，因而能提供足夠的重新映射數據以供卷積運算，因而提高整體卷積運算效能。另外，由於從記憶體所輸入的數據數量和卷積運算所輸出的數量二者相等，使得這個架構也很適合處理數據串流。 In summary, in the buffer device and the convolution operation device, the input buffer unit of the buffer device can buffer part of the input data buffer as the input data of the next clock, even if the convolution operation requires more input data than The amount of memory that can be read at one time, the remapping unit can still obtain the missing data from the input buffer unit, thus providing sufficient remapping data for convolution operations, thereby improving the overall convolution operation performance. In addition, since the amount of data input from the memory and the number of outputs from the convolution operation are equal, this architecture is also well suited for processing data streams.

上述實施例並非用以限定本發明，任何熟悉此技藝者，在未脫離本發明之精神與範疇內，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above-mentioned embodiments are not intended to limit the invention, and any equivalent modifications and variations of the present invention are intended to be included within the scope of the appended claims.

Claims

The buffer device is coupled to a memory, and includes: an input line coupled to the memory, configured to input data from the memory under a current clock; an input buffer unit coupled to the input line, Configuring to buffer a portion of the input data at the current clock to output it at a subsequent clock; and a remapping unit coupled to the input line and the input buffer unit configured to be under the current clock A plurality of remapping data is generated by inputting data on the line and the output of the input buffer unit, the remapping data being input to a convolution operation.

The buffer device of claim 1, wherein the input line inputs W data from the memory under the current clock, and the remapping unit generates W groups of the remapping data for respectively The input of a convolution operation.

The buffer device of claim 2, wherein the input buffer unit buffers the last K data from the W data to output it at a subsequent clock, and the output of the input buffer unit is arranged in The front of the input line.

The buffer device of claim 3, wherein each of the sets of the remapping data comprises M of the remapping data, the convolution calculation being a M x M convolution calculation.

The buffer device of claim 4, wherein the remapping unit acquires M data as a group of information from the start end to the end of each J-step interval in the output of the input buffer unit and the input line. Map data.

The buffer device of claim 3, wherein the stride between the convolution operations is 1, and each set of remapping data comprises three such remapping data, the convolution operation being a 3×3 convolution Operation, the input buffer unit is configured to buffer the last 2 input data at the current clock to output it at a subsequent clock.

The buffer device of claim 1, further comprising: a control unit coupled to and controlling the remapping unit.

The buffer device according to claim 1, wherein the data stored in the memory is two-dimensional array data, the buffer device is used as a column buffer, and the input buffer unit is used as a partial row buffer. .

The buffer device of claim 1, wherein the remapping unit is operable in a first convolution mode and a second convolution mode, the remapping unit when operating in the first convolution mode The remapping data is generated at the current clock according to the data on the input line and the output of the input buffer unit, the remapping data being an input for a convolution operation; when operating in the second convolution mode The remapping unit outputs the data on the input line for input to the convolution operation at the current clock.

The buffer device according to claim 9, wherein the first convolution mode is a 3×3 convolution operation mode, and the second convolution mode is a 1×1 convolution operation mode.

A convolution operation device, comprising: a memory; a convolution operation module; and the buffer device according to any one of claims 1 to 10, wherein the remapping data is input to the volume Product computing module.

A convolution operation method for data stream, comprising: obtaining data that has been input by a previous round of convolution operation from a buffer; and obtaining data that has not been input by a previous round of convolution operation from the data stream in a memory; Generating multiple sets of remapping data from the buffer and the data obtained by the data stream; performing a current convolution operation based on a filter and each group of remapping data; and retaining part of the data of the current convolution operation in the buffer For the next round of convolution operations.

The convolution operation method according to claim 12, wherein the data that has not been input by the W previous convolution operations is obtained from the data stream in the memory, and the generated remapping data is W. Groups are input for W convolution operations, respectively.

The convolution operation method according to claim 13, wherein the last K data of the data that has not been input by the previous W convolution operations is retained in the buffer for the next round of convolution operations.

The convolution operation method of claim 12, wherein each of the sets of the remapping data comprises M such remapping data, and the convolution calculation is M×M convolution calculation.

The convolution operation method according to claim 12, wherein the stride between the convolution operations is 1, and each group of remapping data includes three such remapped data, and the convolution operation is 3×3. Convolution operation, the last 2 data in the data that has not been input in the previous round of convolution operation is retained in the buffer for the next round of convolution operations.

The convolution operation method of claim 12, wherein the buffer is a temporary memory inside a processor, and the memory is a cache memory inside the processor.