TWI766193B - Convolutional neural network processor and data processing method thereof - Google Patents
Convolutional neural network processor and data processing method thereof Download PDFInfo
- Publication number
- TWI766193B TWI766193B TW108136729A TW108136729A TWI766193B TW I766193 B TWI766193 B TW I766193B TW 108136729 A TW108136729 A TW 108136729A TW 108136729 A TW108136729 A TW 108136729A TW I766193 B TWI766193 B TW I766193B
- Authority
- TW
- Taiwan
- Prior art keywords
- sub
- parallel
- output
- bias
- input
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
Description
本發明是有關於一種卷積神經網路處理器及其資料處理方法,且尤其是有關一種具有訊息解碼單元以及卷積判斷單元的卷積神經網路處理器及其資料處理方法。 The present invention relates to a convolutional neural network processor and a data processing method thereof, and more particularly, to a convolutional neural network processor and a data processing method thereof having a message decoding unit and a convolution judgment unit.
卷積神經網路(Convolutional Neural Networks,CNN)近期被廣泛的應用於電腦視覺(Computer vision)及影像處理(image processing)領域。然而,近期的應用則多偏重於物體識別及物體偵測上,因此,卷積神經網路的硬體設計並沒有針對圖像處理網路進行優化,因為上述應用並不考慮(1)空間分辨率不會被大量降採樣(downsampled)以及(2)模型稀疏性之失效狀況(model sparsity),導致極高的內存帶寬及極高的運算能力需求。 Convolutional Neural Networks (CNN) have recently been widely used in the fields of computer vision and image processing. However, recent applications are more focused on object recognition and object detection. Therefore, the hardware design of convolutional neural networks is not optimized for image processing networks, because the above applications do not consider (1) spatial resolution Rates are not heavily downsampled and (2) the model sparsity fails, resulting in extremely high memory bandwidth and extremely high computing power requirements.
有鑑於此,本發明設計一種可執行高度平行運算的卷積神經網路處理器及其資料處理方法以提供高性能的運算。 In view of this, the present invention designs a convolutional neural network processor that can perform highly parallel operations and a data processing method thereof to provide high-performance operations.
本發明提供之卷積神經網路處理器及其資料處理方法,透過訊息解碼單元以及卷積判斷單元而可執行高度平行運算。 The convolutional neural network processor and the data processing method thereof provided by the present invention can perform highly parallel operations through the information decoding unit and the convolution determination unit.
依據本發明一實施方式提供一種卷積神經網路處理器,用以運算輸入資料,卷積神經網路處理器包含訊息解碼單元以及卷積判斷單元。訊息解碼單元用以接收輸入程式及複數輸入權重參數,且包含解碼模組及平行處理模組。解碼模組接收輸入程式,並根據輸入程式輸出運作指令。平行處理模組與解碼模組電性連接,並接收輸入權重參數,且平行處理模組包含複數平行處理子模組,平行處理子模組根據運作指令及輸入權重參數產生複數輸出權重參數。卷積判斷單元與訊息解碼單元電性連接,且包含運算模組。運算模組與平行處理模組電性連接,運算模組依據輸入資料與輸出權重參數運算而產生輸出資料。 According to an embodiment of the present invention, a convolutional neural network processor is provided for computing input data, and the convolutional neural network processor includes a message decoding unit and a convolution judgment unit. The message decoding unit is used for receiving the input program and the complex input weight parameter, and includes a decoding module and a parallel processing module. The decoding module receives the input program and outputs the operation command according to the input program. The parallel processing module is electrically connected to the decoding module, and receives input weight parameters. The parallel processing module includes a plurality of parallel processing sub-modules, and the parallel processing sub-module generates complex output weight parameters according to the operation command and the input weight parameters. The convolution determination unit is electrically connected with the message decoding unit, and includes an operation module. The operation module is electrically connected with the parallel processing module, and the operation module generates output data according to the operation of the input data and the output weight parameter.
藉此,卷積神經網路處理器可藉由訊息解碼單元的解碼模組及平行處理模組,以及卷積判斷單元的運算模組執行高度平行運算,進而提供高性能且低功耗的運算。 In this way, the convolutional neural network processor can perform highly parallel operations through the decoding module and parallel processing module of the message decoding unit, and the operation module of the convolution judgment unit, thereby providing high-performance and low-power computing .
根據前段所述實施方式的卷積神經網路處理器,其中解碼模組包含程式記憶體及指令解碼器。程式記憶體儲存輸入程式。指令解碼器與程式記憶體電性連接,指令解碼器將輸入程式解碼以輸出運作指令。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the decoding module includes a program memory and an instruction decoder. Program memory stores input programs. The instruction decoder is electrically connected with the program memory, and the instruction decoder decodes the input program to output the operation instruction.
根據前段所述實施方式的卷積神經網路處理器,當輸入權重參數為複數非壓縮輸入權重參數,平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存非壓縮輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接,平行子處理器根據運作指令平行地接收非壓縮輸入權重參數,產生輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, when the input weight parameter is a complex uncompressed input weight parameter, the parallel processing sub-module includes a complex parallel sub-memory and a complex parallel sub-processor. The complex parallel sub-memory stores the uncompressed input weight parameters in parallel. The plurality of parallel sub-processors are respectively electrically connected with the decoding module and the parallel sub-memory, and the parallel sub-processors receive the uncompressed input weight parameters in parallel according to the operation instruction, and generate the output weight parameters.
根據前段所述實施方式的卷積神經網路處理器,當輸入權重參數為複數壓縮輸入權重參數,平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存壓縮輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接,平行子處理器根據運作指令平行地接收並解壓縮壓縮輸入權重參數,產生輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, when the input weight parameter is a complex compression input weight parameter, the parallel processing sub-module includes a complex parallel sub-memory and a complex parallel sub-processor. The complex parallel sub-memory stores the compressed input weight parameters in parallel. The plurality of parallel sub-processors are respectively electrically connected to the decoding module and the parallel sub-memory, and the parallel sub-processors receive and decompress and compress the input weight parameters in parallel according to the operation instruction, and generate the output weight parameters.
根據前段所述實施方式的卷積神經網路處理器,其中輸入權重參數包含複數第一輸入權重參數、輸出權重參數包含複數第一輸出權重參數。平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存輸入權重參數,平行子記憶體包含複數第一平行子記憶體。複數第一平行子記憶體分別且平行地接收並儲存第一輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接,平行子處理器包含複數第一平行子處理器。複數第一平行子處理器分別與第一平 行子記憶體電性連接,根據運作指令接收第一輸入權重參數,以輸出第一輸出權重參數。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the input weight parameter includes a complex number of first input weight parameters, and the output weight parameter includes a complex number of first output weight parameters. The parallel processing sub-module includes a plurality of parallel sub-memory and a plurality of parallel sub-processors. The plurality of parallel sub-memory stores the input weight parameters in parallel, and the parallel sub-memory includes a plurality of first parallel sub-memory. The plurality of first parallel sub-memory respectively and in parallel receive and store the first input weight parameters. The plurality of parallel sub-processors are respectively electrically connected with the decoding module and the parallel sub-memory, and the parallel sub-processors include a plurality of first parallel sub-processors. The plurality of first parallel sub-processors are respectively connected with the first parallel sub-processors The row sub-memory is electrically connected, and receives the first input weight parameter according to the operation command to output the first output weight parameter.
根據前段所述實施方式的卷積神經網路處理器,其中第一輸出權重參數包含複數3×3權重參數。運算模組包含3×3運算子模組。3×3運算子模組與第一平行子處理器電性連接,並根據第一輸出權重參數與輸入資料進行運算,以產生3×3後處理運算資料,3×3運算子模組包含複數3×3卷積分配器組、複數3×3本地卷積運算單元及複數3×3後處理運算單元。各3×3卷積分配器組與一第一平行子處理器電性連接,3×3卷積分配器組用以接收及分配第一輸出權重參數之3×3權重參數。複數3×3本地卷積運算單元分別與一3×3卷積分配器組電性連接,各3×3本地卷積運算單元包含3×3本地暫存器組及3×3本地濾波運算單元。3×3本地暫存器組與一3×3卷積分配器組電性連接,3×3本地卷積運算單元之3×3本地暫存器組接收並儲存第一輸出權重參數之3×3權重參數,並根據第一輸出權重參數之3×3權重參數,以輸出複數3×3運算參數。3×3本地濾波運算單元與3×3本地暫存器組電性連接,3×3本地卷積運算單元之3×3本地濾波運算單元根據3×3運算參數與輸入資料進行運算以產生複數3×3運算資料。複數3×3後處理運算單元與3×3本地卷積運算單元電性連接,並依據3×3運算資料進行3×3後處理運算,以產生3×3後處理運算資料,其中輸出資料為3×3後處理運算資料。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the first output weight parameter comprises a complex 3×3 weight parameter. The operation module includes 3×3 operation sub-modules. The 3×3 operation sub-module is electrically connected to the first parallel sub-processor, and performs operations with the input data according to the first output weight parameter to generate 3×3 post-processing operation data. The 3×3 operation sub-module includes complex numbers 3×3 convolution distributor group, complex 3×3 local convolution operation unit and complex 3×3 post-processing operation unit. Each 3×3 convolutional distributor group is electrically connected with a first parallel sub-processor, and the 3×3 convolutional distributor group is used for receiving and distributing the 3×3 weight parameters of the first output weight parameter. The complex 3×3 local convolution operation units are respectively electrically connected with a 3×3 convolution distributor group, and each 3×3 local convolution operation unit includes a 3×3 local register group and a 3×3 local filter operation unit. The 3×3 local register group is electrically connected with a 3×3 convolution distributor group, and the 3×3 local register group of the 3×3 local convolution operation unit receives and stores the 3×3 first output weight parameters A weight parameter, and according to the 3×3 weight parameter of the first output weight parameter, a complex 3×3 operation parameter is output. The 3×3 local filter operation unit is electrically connected to the 3×3 local register group, and the 3×3 local filter operation unit of the 3×3 local convolution operation unit performs operations according to the 3×3 operation parameters and the input data to generate complex numbers 3×3 operation data. The complex 3×3 post-processing operation unit is electrically connected to the 3×3 local convolution operation unit, and performs 3×3 post-processing operation according to the 3×3 operation data to generate 3×3 post-processing operation data, wherein the output data is 3×3 post-processing operation data.
根據前段所述實施方式的卷積神經網路處理器,其中各3×3本地暫存器組包含二子3×3本地暫存器組。二子3×3本地暫存器組交替地儲存一3×3權重參數或輸出3×3運算參數給3×3本地濾波運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein each 3×3 local register set includes two sub-3×3 local register sets. The two sub 3×3 local register groups alternately store a 3×3 weight parameter or output a 3×3 operation parameter to the 3×3 local filtering operation unit.
根據前段所述實施方式的卷積神經網路處理器,其中輸入權重參數更包含偏壓輸入權重參數、輸出權重參數更包含偏壓輸出權重參數。平行子記憶體更包含偏壓平行子記憶體。偏壓平行子記憶體平行地儲存偏壓輸入權重參數。平行子處理器更包含偏壓平行子處理器。偏壓平行子處理器與偏壓平行子記憶體電性連接,根據運作指令接收偏壓輸入權重參數,以輸出偏壓輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the input weight parameter further includes a bias input weight parameter, and the output weight parameter further includes a bias output weight parameter. The parallel sub-memory further includes a biased parallel sub-memory. The bias parallel sub-memory stores the bias input weight parameters in parallel. The parallel sub-processor further includes a biased parallel sub-processor. The bias parallel sub-processor is electrically connected to the bias parallel sub-memory, receives the bias input weight parameter according to the operation command, and outputs the bias output weight parameter.
根據前段所述實施方式的卷積神經網路處理器,其中偏壓輸出權重參數包含複數偏壓權重參數。運算模組更包含偏壓分配器。偏壓分配器與偏壓平行子處理器、3×3運算子模組電性連接,偏壓分配器根據偏壓輸出權重參數以產生複數3×3偏壓權重參數,並將3×3偏壓權重參數輸出至3×3後處理運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the bias output weight parameter comprises a complex bias weight parameter. The computing module further includes a bias distributor. The bias divider is electrically connected with the bias parallel sub-processor and the 3×3 operation sub-module. The bias divider outputs the weight parameter according to the bias voltage to generate a complex 3×3 bias weight parameter, and the 3×3 bias The pressure weight parameters are output to the 3×3 post-processing operation unit.
根據前段所述實施方式的卷積神經網路處理器,其中輸入權重參數更包含至少一第二輸入權重參數、輸出權重參數更包含至少一第二輸出權重參數。平行子記憶體更包含至少一第二平行子記憶體。第二平行子記憶體分別且平行地接收並儲存至少一第二輸入權重參數。平行子處理器更包含至少一第二平行子處理器。至少一第二平行子處理器分別與至少一第二平行子記憶體電性連接,根 據運作指令接收至少一第二輸入權重參數,以輸出至少一第二輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the input weight parameter further includes at least one second input weight parameter, and the output weight parameter further includes at least one second output weight parameter. The parallel sub-memory further includes at least one second parallel sub-memory. The second parallel sub-memory respectively and in parallel receive and store at least one second input weight parameter. The parallel sub-processor further includes at least one second parallel sub-processor. The at least one second parallel sub-processor is electrically connected to the at least one second parallel sub-memory respectively, and the root At least one second input weight parameter is received according to the operation command to output at least one second output weight parameter.
根據前段所述實施方式的卷積神經網路處理器,其中運算模組包含3×3運算子模組及1×1運算子模組。3×3運算子模組與第一平行子處理器電性連接,並根據第一輸出權重參數與輸入資料進行運算,以產生3×3後處理運算資料。1×1運算子模組與至少一第二平行子處理器及3×3運算子模組電性連接,並根據至少一第二輸出權重參數與3×3後處理運算資料進行運算,以產生1×1後處理運算資料,其中,輸出資料可為1×1後處理運算資料。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the operation module includes a 3×3 operation sub-module and a 1×1 operation sub-module. The 3×3 operation sub-module is electrically connected to the first parallel sub-processor, and performs operation on the input data according to the first output weight parameter to generate 3×3 post-processing operation data. The 1×1 operation sub-module is electrically connected with at least one second parallel sub-processor and the 3×3 operation sub-module, and performs operations according to at least one second output weight parameter and 3×3 post-processing operation data to generate 1×1 post-processing operation data, wherein the output data can be 1×1 post-processing operation data.
根據前段所述實施方式的卷積神經網路處理器,其中至少一第二輸出權重參數包含複數1×1權重參數。1×1運算子模組包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元。1×1卷積分配器組與至少一第二平行子處理器電性連接,用以接收及分配至少一第二輸出權重參數之1×1權重參數。複數1×1本地卷積運算單元與至少一1×1卷積分配器電性連接,各1×1本地卷積運算單元包含1×1本地暫存器組及1×1本地濾波運算單元。1×1本地暫存器組與至少一1×1卷積分配器組電性連接,1×1本地卷積運算單元之1×1本地暫存器組接收並儲存第二輸出權重參數之1×1權重參數,並根據至少一第二輸出權重參數之1×1權重參數,以輸出複數1×1運算參數。1×1本地濾波運算單元與1×1本地暫存器組電性連接,1×1本地卷積運算單元之1×1本地濾波運算單元根據 1×1運算參數與3×3後處理運算資料進行運算以產生複數1×1運算資料。複數1×1後處理運算單元與1×1本地卷積運算單元電性連接,並依據1×1運算資料進行1×1後處理運算,以產生1×1後處理運算資料。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the at least one second output weight parameter includes a complex number of 1×1 weight parameters. The 1×1 operation sub-module includes at least one 1×1 convolution distributor group, a complex 1×1 local convolution operation unit and a complex 1×1 post-processing operation unit. The 1×1 convolutional distributor group is electrically connected to the at least one second parallel sub-processor for receiving and distributing the 1×1 weight parameter of the at least one second output weight parameter. The complex 1×1 local convolution operation unit is electrically connected with at least one 1×1 convolution distributor, and each 1×1 local convolution operation unit includes a 1×1 local register group and a 1×1 local filter operation unit. The 1×1 local register group is electrically connected to at least one 1×1 convolution distributor group, and the 1×1 local register group of the 1×1 local convolution operation unit receives and stores 1× of the second output weight parameter 1 weight parameter, and according to the 1×1 weight parameter of the at least one second output weight parameter, a complex 1×1 operation parameter is output. The 1×1 local filter operation unit is electrically connected to the 1×1 local register group, and the 1×1 local filter operation unit of the 1×1 local convolution operation unit is based on the The 1×1 operation parameters are operated on with the 3×3 post-processing operation data to generate complex 1×1 operation data. The complex 1×1 post-processing operation unit is electrically connected to the 1×1 local convolution operation unit, and performs 1×1 post-processing operation according to the 1×1 operation data to generate 1×1 post-processing operation data.
根據前段所述實施方式的卷積神經網路處理器,其中各1×1本地暫存器組包含二子1×1本地暫存器組。二子1×1本地暫存器組交替地儲存一1×1權重參數或輸出1×1運算參數給1×1運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein each 1×1 local register group includes two sub-1×1 local register groups. The two sub 1×1 local register groups alternately store a 1×1 weight parameter or output a 1×1 operation parameter to a 1×1 operation unit.
根據前段所述實施方式的卷積神經網路處理器,其中輸入權重參數更包含偏壓輸入權重參數、輸出權重參數更包含偏壓輸出權重參數。平行子記憶體更包含偏壓平行子記憶體。偏壓平行子記憶體平行地儲存偏壓輸入權重參數。平行子處理器更包含偏壓平行子處理器。偏壓平行子處理器與偏壓平行子記憶體電性連接,根據運作指令接收偏壓輸入權重參數,以輸出偏壓輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the input weight parameter further includes a bias input weight parameter, and the output weight parameter further includes a bias output weight parameter. The parallel sub-memory further includes a biased parallel sub-memory. The bias parallel sub-memory stores the bias input weight parameters in parallel. The parallel sub-processor further includes a biased parallel sub-processor. The bias parallel sub-processor is electrically connected to the bias parallel sub-memory, receives the bias input weight parameter according to the operation command, and outputs the bias output weight parameter.
根據前段所述實施方式的卷積神經網路處理器,其中偏壓輸出權重參數包含複數偏壓權重參數。運算模組更包含偏壓分配器。偏壓分配器與偏壓平行子處理器、3×3運算子模組及1×1運算子模組電性連接,偏壓分配器根據偏壓輸出權重參數以產生複數3×3偏壓權重參數及複數1×1偏壓權重參數。偏壓分配器將3×3偏壓權重參數輸出至3×3後處理運算單元。偏壓分配器將1×1偏壓權重參數輸出至1×1後處理運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the bias output weight parameter comprises a complex bias weight parameter. The computing module further includes a bias distributor. The bias divider is electrically connected to the bias parallel sub-processor, the 3×3 operation sub-module and the 1×1 operation sub-module, and the bias divider generates a complex 3×3 bias weight according to the bias output weight parameter parameter and complex 1×1 bias weight parameter. The bias distributor outputs the 3×3 bias weight parameters to the 3×3 post-processing arithmetic unit. The bias distributor outputs the 1×1 bias weight parameters to the 1×1 post-processing arithmetic unit.
依據本發明一實施方式提供一種卷積神經網路處理器的資料處理方法包含接收步驟、指令解碼步驟、平行處理步驟及運算步驟。接收步驟驅動訊息解碼單元接收輸入程式及複數輸入權重參數,其中訊息解碼單元包含解碼模組及平行處理模組。指令解碼步驟驅動解碼模組接收輸入程式,並根據輸入程式,以產生運作指令。平行處理步驟驅動平行處理模組接收輸入權重參數,並根據運作指令以平行地處理輸入權重參數,以產生複數輸出權重參數。運算步驟驅動運算模組接收輸入資料及輸出權重參數,並根據運作指令以將輸入資料與輸出權重參數進行運算,以產生輸出資料。 According to an embodiment of the present invention, a data processing method of a convolutional neural network processor is provided, which includes a receiving step, an instruction decoding step, a parallel processing step, and an operation step. The receiving step drives the message decoding unit to receive the input program and the complex input weight parameter, wherein the message decoding unit includes a decoding module and a parallel processing module. The instruction decoding step drives the decoding module to receive the input program and generate operation instructions according to the input program. The parallel processing step drives the parallel processing module to receive the input weight parameters, and process the input weight parameters in parallel according to the operation instruction to generate complex output weight parameters. The operation step drives the operation module to receive the input data and the output weight parameter, and operate the input data and the output weight parameter according to the operation instruction to generate the output data.
藉此,卷積神經網路處理器的資料處理方法可藉由接收步驟、指令解碼步驟、平行處理步驟及運算步驟驅動訊息解碼單元的解碼模組及平行處理模組,以及卷積判斷單元的運算模組執行高度平行運算,進而提供高性能且低功耗的運算。 Thereby, the data processing method of the convolutional neural network processor can drive the decoding module and parallel processing module of the information decoding unit, and the convolution judgment unit through the receiving step, the instruction decoding step, the parallel processing step and the operation step. The computing module performs highly parallel computing, thereby providing high-performance and low-power computing.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中解碼模組包含程式記憶體及指令解碼器,且指令解碼步驟包含程式儲存子步驟及程式解碼子步驟。程式儲存子步驟驅動程式記憶體儲存輸入程式。程式解碼子步驟驅動指令解碼器對輸入程式進行解碼,以產生運作指令。 According to the data processing method of a convolutional neural network processor according to the above-mentioned embodiment, the decoding module includes a program memory and an instruction decoder, and the instruction decoding step includes a program storage sub-step and a program decoding sub-step. Program storage sub-step driver memory stores input programs. The program decoding sub-step drives the instruction decoder to decode the input program to generate operation instructions.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中平行處理模組包含複數平行子記憶 體及複數平行子處理器,且平行處理步驟包含權重參數儲存子步驟及權重參數處理子步驟。權重參數儲存子步驟驅動平行子記憶體以平行地儲存輸入權重參數。權重參數處理子步驟驅動平行子處理器,平行子處理器根據運作指令,平行地讀取輸入權重參數並進行運作處理,以產生輸出權重參數。 The data processing method of a convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the parallel processing module includes a complex parallel sub-memory Volume and complex parallel sub-processors, and the parallel processing step includes a weight parameter storage sub-step and a weight parameter processing sub-step. The weight parameter storage sub-step drives the parallel sub-memory to store the input weight parameters in parallel. The weight parameter processing sub-step drives the parallel sub-processor, and the parallel sub-processor reads the input weight parameter in parallel according to the operation instruction and performs operation processing to generate the output weight parameter.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中當輸入權重參數為複數非壓縮輸入權重參數,運作處理為儲存非壓縮輸入權重參數。當輸入權重參數為複數壓縮輸入權重參數,運作處理為儲存及解壓縮壓縮輸入權重參數。 According to the data processing method of the convolutional neural network processor according to the embodiment described in the preceding paragraph, when the input weight parameter is a complex number of uncompressed input weight parameters, the operation process is to store the uncompressed input weight parameters. When the input weight parameter is a complex compressed input weight parameter, the operation process is to store and decompress the compressed input weight parameter.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中輸出權重參數包含複數第一輸出權重參數。運算模組包含3×3運算子模組。運算步驟包含第一運算子步驟。第一運算子步驟驅動3×3運算子模組接收輸入資料及第一輸出權重參數,以產生3×3後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter includes a complex number of first output weight parameters. The operation module includes 3×3 operation sub-modules. The operation step includes a first operation sub-step. The first operation sub-step drives the 3×3 operation sub-module to receive the input data and the first output weight parameter to generate 3×3 post-processing operation data.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中各第一輸出權重參數包含複數3×3權重參數。3×3運算子模組包含複數3×3卷積分配器組、複數3×3本地卷積運算單元及複數3×3後處理運算單元。第一運算子步驟包含3×3參數分配程序、3×3運算參數產生程序、3×3卷積運算程序及3×3後處理運算程序。3×3參數分配程序驅動3×3卷積分配器組接收第一輸出權重參數之3×3 權重參數,並將第一輸出權重參數之3×3權重參數分配至3×3本地卷積運算單元。各3×3本地卷積運算單元包含3×3本地暫存器組及3×3本地濾波運算單元。3×3運算參數產生程序驅動3×3本地卷積運算單元之3×3本地暫存器組接收第一輸出權重參數之3×3權重參數,並根據第一輸出權重參數之3×3權重參數產生複數3×3運算參數。3×3卷積運算程序驅動3×3本地卷積運算單元之3×3本地濾波運算單元以將3×3運算參數及輸入資料進行3×3卷積運算,以產生複數3×3運算資料。3×3後處理運算程序驅動3×3後處理運算單元以將3×3運算資料進行3×3後處理運算,以產生3×3後處理運算資料,其中輸出資料為3×3後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, each of the first output weight parameters includes a complex number of 3×3 weight parameters. The 3×3 operation sub-module includes a complex 3×3 convolution distributor group, a complex 3×3 local convolution operation unit and a complex 3×3 post-processing operation unit. The first operation sub-step includes a 3×3 parameter allocation procedure, a 3×3 operation parameter generation procedure, a 3×3 convolution operation procedure and a 3×3 post-processing operation procedure. The 3×3 parameter assignment program drives the 3×3 convolutional assigner group to receive 3×3 of the first output weight parameters weight parameter, and assign the 3×3 weight parameter of the first output weight parameter to the 3×3 local convolution operation unit. Each 3×3 local convolution operation unit includes a 3×3 local register group and a 3×3 local filter operation unit. The 3×3 operation parameter generation program drives the 3×3 local register group of the 3×3 local convolution operation unit to receive the 3×3 weight parameter of the first output weight parameter, and according to the 3×3 weight of the first output weight parameter The parameter yields a complex 3x3 arithmetic parameter. The 3×3 convolution operation program drives the 3×3 local filter operation unit of the 3×3 local convolution operation unit to perform a 3×3 convolution operation on the 3×3 operation parameters and input data to generate complex 3×3 operation data . The 3×3 post-processing operation program drives the 3×3 post-processing operation unit to perform 3×3 post-processing operations on the 3×3 operation data to generate 3×3 post-processing operation data, wherein the output data is the 3×3 post-processing operation material.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中輸出權重參數更包含偏壓輸出權重參數。運算模組更包含偏壓分配器。運算步驟更包含偏壓運算子步驟。偏壓運算子步驟驅動偏壓分配器根據偏壓輸出權重參數,以產生複數3×3偏壓權重參數,偏壓分配器將3×3偏壓權重參數提供予3×3運算子模組。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes a bias output weight parameter. The computing module further includes a bias distributor. The operation step further includes a bias voltage operation sub-step. The bias operation sub-step drives the bias distributor to output weight parameters according to the bias voltage to generate complex 3×3 bias weight parameters, and the bias distributor provides the 3×3 bias weight parameters to the 3×3 operation sub-module.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中輸出權重參數更包含至少一第二輸出權重參數。運算模組包含1×1運算子模組。運算步驟更包含第二運算子步驟。第二運算子步驟驅動1×1運算子模組接收3×3後處理運算資料及至少一第二輸出權重參數,以產生1×1後處理運算資料。 According to the data processing method of the convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes at least one second output weight parameter. The operation module contains 1×1 operation sub-modules. The operation step further includes a second operation sub-step. The second operation sub-step drives the 1×1 operation sub-module to receive 3×3 post-processing operation data and at least one second output weight parameter to generate 1×1 post-processing operation data.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中至少一第二輸出權重參數包含複數1×1權重參數。1×1運算子模組包含複數1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元。第二運算子步驟包含1×1參數分配程序、1×1運算參數產生程序、1×1卷積運算程序及1×1後處理運算程序。1×1參數分配程序驅動1×1卷積分配器組以接收至少一第二輸出權重參數之1×1權重參數,並將至少一第二輸出權重參數之1×1權重參數分配至1×1本地卷積運算單元,其中各1×1本地卷積運算單元包含1×1本地暫存器組及1×1本地濾波運算單元。1×1運算參數產生程序驅動1×1本地卷積運算單元之1×1本地暫存器組接收至少一第二輸出權重參數之1×1權重參數,並根據至少一第二輸出權重參數之1×1權重參數產生複數1×1運算參數。1×1卷積運算程序驅動1×1本地卷積運算單元之1×1本地濾波運算單元以將1×1運算參數及3×3後處理運算資料進行1×1卷積運算,以產生複數1×1運算資料。1×1後處理運算程序驅動1×1後處理運算單元以將1×1運算資料進行1×1後處理運算,以產生1×1後處理運算資料,其中輸出資料為1×1後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the at least one second output weight parameter includes a complex 1×1 weight parameter. The 1×1 operation sub-module includes a complex 1×1 convolution distributor group, a complex 1×1 local convolution operation unit and a complex 1×1 post-processing operation unit. The second operation sub-step includes a 1×1 parameter allocation procedure, a 1×1 operation parameter generation procedure, a 1×1 convolution operation procedure, and a 1×1 post-processing operation procedure. The 1×1 parameter assignment program drives the 1×1 convolutional assigner group to receive the 1×1 weight parameter of the at least one second output weight parameter, and assign the 1×1 weight parameter of the at least one second output weight parameter to the 1×1 A local convolution operation unit, wherein each 1×1 local convolution operation unit includes a 1×1 local register group and a 1×1 local filter operation unit. The 1×1 operation parameter generation program drives the 1×1 local register group of the 1×1 local convolution operation unit to receive the 1×1 weight parameter of the at least one second output weight parameter, and according to the at least one second output weight parameter The 1×1 weight parameter produces a complex 1×1 operation parameter. The 1×1 convolution operation program drives the 1×1 local filter operation unit of the 1×1 local convolution operation unit to perform 1×1 convolution operation on the 1×1 operation parameters and 3×3 post-processing operation data to generate complex numbers 1×1 operation data. The 1×1 post-processing operation program drives the 1×1 post-processing operation unit to perform 1×1 post-processing operations on the 1×1 operation data to generate 1×1 post-processing operation data, wherein the output data is the 1×1 post-processing operation material.
根據前段所述實施方式的卷積神經網路處理器的資料處理方法,其中輸出權重參數更包含偏壓輸出權重參數。運算模組更包含偏壓分配器。運算步驟更包含偏壓運算子步驟。偏壓運算子步驟驅動偏壓分配器根據偏壓輸出權重參數,以產生複數3×3偏壓權重參數及複數1×1偏壓 權重參數,其中偏壓分配器將3×3偏壓權重參數提供予3×3運算子模組,並將1×1偏壓權重參數提供予1×1運算子模組。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes a bias output weight parameter. The computing module further includes a bias distributor. The operation step further includes a bias voltage operation sub-step. The bias operation sub-step drives the bias distributor to output the weight parameter according to the bias to generate a complex 3×3 bias weight parameter and a complex 1×1 bias Weight parameters, wherein the bias distributor provides 3×3 bias weight parameters to the 3×3 arithmetic sub-modules, and provides 1×1 bias weight parameters to the 1×1 arithmetic sub-modules.
100‧‧‧卷積神經網路處理器 100‧‧‧Convolutional Neural Network Processor
102‧‧‧輸入程式 102‧‧‧Input Program
104‧‧‧輸入權重參數 104‧‧‧Input Weight Parameters
106‧‧‧輸入資料 106‧‧‧Enter data
1062‧‧‧3×3後處理運算資料 1062‧‧‧3×3 post-processing operation data
1064‧‧‧1×1後處理運算資料 1064‧‧‧1×1 post-processing operation data
108‧‧‧輸出資料 108‧‧‧Output data
110‧‧‧訊息解碼單元 110‧‧‧Message Decoding Unit
111‧‧‧解碼模組 111‧‧‧Decoding Module
1111‧‧‧程式記憶體 1111‧‧‧Program memory
1112‧‧‧指令解碼器 1112‧‧‧Instruction Decoder
112‧‧‧平行處理模組 112‧‧‧Parallel processing module
1121‧‧‧平行處理子模組 1121‧‧‧Parallel Processing Submodule
1121a‧‧‧平行子記憶體 1121a‧‧‧Parallel sub-memory
1121aa‧‧‧第一平行子記憶體 1121aa‧‧‧First parallel sub-memory
1121ab‧‧‧偏壓平行子記憶體 1121ab‧‧‧biased parallel sub-memory
1121ac‧‧‧第二平行子記憶體 1121ac‧‧‧Second parallel sub-memory
1213d‧‧‧1×1本地濾波運算單元 1213d‧‧‧1×1 local filtering operation unit
1213e‧‧‧1×1後處理運算單元 1213e‧‧‧1×1 post-processing operation unit
1213f‧‧‧第一1×1卷積分配器 1213f‧‧‧First 1×1 convolution distributor
1213g‧‧‧第二1×1卷積分配器 1213g‧‧‧Second 1×1 convolution distributor
122‧‧‧控制器 122‧‧‧Controller
s200‧‧‧卷積神經網路處理器的資料處理方法 s200‧‧‧Data processing method of convolutional neural network processor
s210‧‧‧接收步驟 s210‧‧‧Receive steps
s220‧‧‧指令解碼步驟 s220‧‧‧ instruction decoding steps
s221‧‧‧程式儲存子步驟 s221‧‧‧Program storage sub-step
s222‧‧‧程式解碼子步驟 s222‧‧‧Program decoding sub-step
s230‧‧‧平行處理步驟 s230‧‧‧Parallel processing steps
s231‧‧‧權重參數儲存子 步驟 s231‧‧‧Weight parameter storage sub step
1121b‧‧‧平行子處理器 1121b‧‧‧Parallel Subprocessor
1121ba‧‧‧第一平行子處理器 1121ba‧‧‧First parallel sub-processor
1121bb‧‧‧偏壓平行子處理器 1121bb‧‧‧biased parallel sub-processor
1121bc‧‧‧第二平行子處理器 1121bc‧‧‧Second Parallel Subprocessor
120‧‧‧卷積判斷單元 120‧‧‧Convolution judgment unit
121‧‧‧運算模組 121‧‧‧Computing Modules
1211‧‧‧3×3運算子模組 1211‧‧‧3×3 operation sub-module
1211a‧‧‧3×3運算電路 1211a‧‧‧3×3 arithmetic circuit
1211b‧‧‧3×3本地卷積運算單元 1211b‧‧‧3×3 local convolution operation unit
1211c‧‧‧3×3本地暫存器組 1211c‧‧‧3×3 local register bank
1211ca、1211cb‧‧‧子3×3本地暫存器組 1211ca, 1211cb‧‧‧Sub 3×3 local register bank
1211d‧‧‧3×3本地濾波運算單元 1211d‧‧‧3×3 local filtering operation unit
1211e‧‧‧3×3後處理運算單元 1211e‧‧‧3×3 post-processing operation unit
1211f‧‧‧第一3×3卷積分配器 1211f‧‧‧First 3×3 convolutional divider
1211g‧‧‧第二3×3卷積分配器 1211g‧‧‧Second 3×3 convolution distributor
1212‧‧‧偏壓分配器 1212‧‧‧bias distributor
1213‧‧‧1×1運算子模組 1213‧‧‧1×1 arithmetic sub-module
1213a‧‧‧1×1運算電路 1213a‧‧‧1×1 arithmetic circuit
1213b‧‧‧1×1本地卷積運算單元 1213b‧‧‧1×1 local convolution operation unit
1213c‧‧‧1×1本地暫存器組 1213c‧‧‧1×1 local register bank
1213ca、1213cb‧‧‧子1×1本地暫存器組 1213ca, 1213cb‧‧‧Sub 1×1 local register bank
s232‧‧‧權重參數處理子步驟 s232‧‧‧ weight parameter processing sub-step
s240‧‧‧運算步驟 s240‧‧‧Operation steps
s241‧‧‧第一運算子步驟 s241‧‧‧First operator substep
s2411‧‧‧3×3參數分配程序 s2411‧‧‧3×3 parameter assignment program
s2412‧‧‧3×3運算參數產生程序 s2412‧‧‧3×3 operation parameter generation program
s2413‧‧‧3×3卷積運算程序 s2413‧‧‧3×3 convolution operation program
s2414‧‧‧3×3後處理運算程序 s2414‧‧‧3×3 post-processing operation program
s242‧‧‧偏壓運算子步驟 s242‧‧‧Bias operation sub-step
s243‧‧‧第二運算子步驟 s243‧‧‧Second operator substep
s2431‧‧‧1×1參數分配程序 s2431‧‧‧1×1 parameter assignment program
s2432‧‧‧1×1運算參數產生程序 s2432‧‧‧1×1 operation parameter generation program
s2433‧‧‧1×1卷積運算程序 s2433‧‧‧1×1 convolution operation program
s2434‧‧‧1×1後處理運算程序 s2434‧‧‧1×1 post-processing operation program
第1圖繪示依照本發明一結構態樣之一實施方式的卷積神經網路處理器之方塊圖;第2圖繪示依照本發明另一結構態樣之一實施方式的卷積神經網路處理器之方塊圖;第3圖繪示依照第2圖結構態樣之實施方式的卷積神經網路處理器之3×3運算子模組之方塊圖;第4圖繪示依照第3圖結構態樣之實施方式的卷積神經網路處理器之3×3運算子模組之3×3本地卷積運算單元示意圖;第5圖繪示依照本發明又一結構態樣之一實施方式的卷積神經網路處理器之方塊圖;第6圖繪示依照第5圖結構態樣之實施方式的卷積神經網路處理器之1×1運算子模組之方塊圖;第7圖繪示依照第6圖結構態樣之實施方式的卷積神經網路處理器之1×1運算子模組之1×1本地卷積運算單元示意圖;第8圖繪示依照本發明一方法態樣之一實施方式的卷積神經網路處理器的資料處理方法之步驟方塊圖;第9圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的指令解碼步驟之步驟方塊 圖;第10圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的平行處理步驟之步驟方塊圖;第11圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的運算步驟之步驟方塊圖;以及第12圖繪示依照第8圖之方法態樣之另一實施方式的卷積神經網路處理器的資料處理方法的運算步驟之步驟方塊圖。 FIG. 1 shows a block diagram of a convolutional neural network processor according to an embodiment of one structural aspect of the present invention; FIG. 2 shows a convolutional neural network according to an embodiment of another structural aspect of the present invention. The block diagram of the processor; FIG. 3 shows the block diagram of the 3×3 operation sub-module of the convolutional neural network processor according to the implementation of the structural aspect of FIG. 2; FIG. 4 shows the block diagram according to the third FIG. 5 is a schematic diagram of a 3×3 local convolution operation unit of a 3×3 operation sub-module of a convolutional neural network processor according to an embodiment of a structural aspect; FIG. 5 shows an implementation according to another structural aspect of the present invention. A block diagram of a convolutional neural network processor in accordance with FIG. 5; FIG. 6 shows a block diagram of a 1×1 operation sub-module of a convolutional neural network processor according to an embodiment of the structural aspect of FIG. 5; FIG. 7 Fig. 6 shows a schematic diagram of a 1 × 1 local convolution operation unit of a 1 × 1 operation sub-module of a convolutional neural network processor according to the embodiment of the structural aspect of Fig. 6; Fig. 8 shows a method according to the present invention A block diagram of the steps of the data processing method of the convolutional neural network processor according to an embodiment of the aspect; FIG. 9 illustrates the data processing of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 Step block of the instruction decoding step of the method Fig. 10 shows a step block diagram of parallel processing steps of the data processing method of the convolutional neural network processor according to the embodiment of the method aspect of Fig. 8; Fig. 11 shows the method according to Fig. 8 A step block diagram of the operation steps of the data processing method of the convolutional neural network processor of an embodiment of the aspect; and FIG. 12 illustrates a convolutional neural network according to another embodiment of the method aspect of FIG. 8 A step block diagram of the operation steps of the data processing method of the processor.
以下將參照圖式說明本發明之複數個實施例。為明確說明起見,許多實務上的細節將在以下敘述中一併說明。然而,應瞭解到,這些實務上的細節不應用以限制本發明。也就是說,在本發明部分實施例中,這些實務上的細節是非必要的。此外,為簡化圖式起見,一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之;並且重複之元件將可能使用相同的編號表示之。 Several embodiments of the present invention will be described below with reference to the drawings. For the sake of clarity, many practical details are set forth in the following description. It should be understood, however, that these practical details should not be used to limit the invention. That is, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the purpose of simplifying the drawings, some well-known and conventional structures and elements will be shown in a simplified and schematic manner in the drawings; and repeated elements may be denoted by the same reference numerals.
第1圖繪示依照本發明一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。由第1圖可知,卷積神經網路處理器100包含訊息解碼單元110及卷積判斷單元120。卷積判斷單元120與訊息解碼單元110電性連接。
FIG. 1 shows a block diagram of a convolutional
訊息解碼單元110接收輸入程式102及複數輸入權重參數104。訊息解碼單元110包含解碼模組111及平行處理模組112。解碼模組111接收輸入程式102,並根據輸入程式102輸出運作指令。平行處理模組112與解碼模組111電性連接,平行處理模組112接收輸入權重參數104與運作指令。平行處理模組112包含複數平行處理子模組1121,平行處理子模組1121根據運作指令及輸入權重參數104產生複數輸出權重參數。卷積判斷單元120包含運算模組121。運算模組121與平行處理模組112電性連接,運算模組121依據輸入資料106與輸出權重參數運算而產生輸出資料108。詳細來說,卷積神經網路處理器100之訊息解碼單元110於接收輸入程式102及輸入權重參數104後,利用解碼模組111產生運作指令以處理輸入權重參數104。平行處理模組112之各平行處理子模組1121可分別與解碼模組111電性連接,以分別根據運作指令產生輸出權重參數。運算模組121可根據輸入資料106及平行處理模組112所產生之輸出權重參數進行運算以產生輸出資料108。輸入資料106可為儲存於區塊緩衝區(block buffer bank)中的資料或是來自外部的資料。此外,卷積神經網路處理器100可利用區塊緩衝區代替輸入緩衝區及輸出緩衝區以節省外部儲存器的帶寬。藉此,卷積神經網路處理器100可透過訊息解碼單元110及卷積判斷單元120執行高度平行運算以提供高性能的運算。
The
解碼模組111可包含程式記憶體1111及指令解碼器1112。程式記憶體1111可儲存輸入程式102。指令解碼器1112與程式記憶體1111電性連接。指令解碼器1112將輸入程式102解碼以輸出運作指令。也就是說,解碼模組111於接收輸入程式102後,將輸入程式102儲存於程式記憶體1111中並透過指令解碼器1112進行解碼以產生運作指令,進而透過運作指令驅動各平行處理子模組1121處理輸入權重參數104以產生輸出權重參數。
The
當輸入權重參數104為非壓縮輸入權重參數時,複數平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存非壓縮輸入權重參數。平行子處理器1121b分別與解碼模組111及一平行子記憶體1121a電性連接。平行子處理器1121b根據運作指令平行地接收非壓縮輸入權重參數以產生輸出權重參數。詳細來說,各平行處理子模組1121分別可包含一平行子記憶體1121a及一平行子處理器1121b。平行處理模組112於接收輸入權重參數104後,將輸入權重參數104分別且平行地儲存於各平行處理子模組1121之平行子記憶體1121a中。由於各平行處理子模組1121分別與解碼模組111電性連接,因此,各平行子處理器1121b可分別根據運作指令平行地從平行子記憶體1121a接收非壓縮輸入權重參數以產生輸出權重參數。藉此,平行處理模組112可平行地處理輸入權重參數104以產生輸出權重參數。
When the
當輸入權重參數104為複數壓縮輸入權重參數時,複數平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存壓縮輸入權重參數。平行子處理器1121b分別與解碼模組111及一平行子記憶體1121a電性連接。平行子處理器1121b根據運作指令平行地接收並解壓縮此些壓縮輸入權重參數以產生輸出權重參數。詳細來說,各平行處理子模組1121分別可包含一平行子記憶體1121a及一平行子處理器1121b。平行處理模組112於接收輸入權重參數104後,將輸入權重參數104分別且平行地儲存於各平行處理子模組1121之平行子記憶體1121a中。由於各平行處理子模組1121分別與解碼模組111電性連接,因此,各平行子處理器1121b可分別根據運作指令平行地從平行子記憶體1121a接收壓縮輸入權重參數,並將壓縮輸入權重參數進行解碼以產生輸出權重參數。藉此,平行處理模組112可平行地處理輸入權重參數104以產生輸出權重參數。
When the
請配合參照第1圖、第2圖、第3圖及第4圖。第2圖繪示依照本發明另一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。第3圖繪示依照第2圖結構態樣之實施方式的卷積神經網路處理器100之3×3運算子模組1211之方塊圖。第4圖繪示依照第3圖結構態樣之實施方式的卷積神經網路處理器100之3×3運算子模組1211之3×3本地卷積運算單元1211b之示意圖。在第2圖至第4圖實施方式中,輸入權重參數104可包含複數第一輸入權重參數及偏壓輸入權重
參數。輸出權重參數包含複數第一輸出權重參數及偏壓輸出權重參數。平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存輸入權重參數104,並包含複數第一平行子記憶體1121aa及偏壓平行子記憶體1121ab。第一平行子記憶體1121aa分別且平行地接收並儲存第一輸入權重參數。偏壓平行子記憶體1121ab平行地儲存偏壓輸入權重參數。平行子處理器1121b分別與解碼模組111及平行子記憶體1121a電性連接,並包含複數第一平行子處理器1121ba及偏壓平行子處理器1121bb。第一平行子處理器1121ba分別與第一平行子記憶體1121aa中之一者電性連接,並根據運作指令接收第一輸入權重參數,以輸出第一輸出權重參數。偏壓平行子處理器1121bb與偏壓平行子記憶體1121ab電性連接,並根據運作指令接收偏壓輸入權重參數,以輸出偏壓輸出權重參數。在第2圖實施方式中,第一平行子記憶體1121aa及第一平行子處理器1121ba的數量均為9,然在其他實施方式中,第一平行子記憶體1121aa及第一平行子處理器1121ba的數量可為9的倍數,本案不以此為限。偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb的數量均為1,但本發明不以此為限。詳細來說,平行處理模組112於接收輸入權重參數104後,將輸入權重參數104中的第一輸入權重參數儲存於第一平行子記憶體1121aa中,以及將偏壓輸入權重參數儲存於偏壓平行子記憶體1121ab中。第一平行子處理器1121ba根據運
作指令從第一平行子記憶體1121aa中讀取第一輸入權重參數,並進行處理以產生第一輸出權重參數。偏壓平行子處理器1121bb根據運作指令從偏壓平行子記憶體1121ab中讀取偏壓輸入權重參數,並進行處理以產生偏壓輸出權重參數。
Please refer to Figure 1, Figure 2, Figure 3 and Figure 4 together. FIG. 2 shows a block diagram of a convolutional
各第一輸出權重參數包含複數3×3權重參數。運算模組121可包含3×3運算子模組1211及偏壓分配器1212。3×3運算子模組1211與第一平行子處理器1121ba電性連接,並根據第一輸出權重參數與輸入資料106進行運算,以產生3×3後處理運算資料1062。3×3運算子模組1211包含3×3卷積分配器組、3×3本地卷積運算單元1211b及3×3後處理運算單元1211e。各3×3卷積分配器組與一第一平行子處理器1121ba電性連接,3×3卷積分配器組用以接收及分配第一輸出權重參數之3×3權重參數。3×3本地卷積運算單元1211b分別與一3×3卷積分配器組電性連接,並包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3本地暫存器組1211c與一3×3卷積分配器組電性連接,3×3本地卷積運算單元1211b之3×3本地暫存器組1211c接收並儲存第一輸出權重參數之些3×3權重參數,並根據第一輸出權重參數之3×3權重參數以輸出複數3×3運算參數。3×3本地濾波運算單元1211d與3×3本地暫存器組1211c電性連接,3×3本地卷積運算單元1211b之3×3本地濾波運算單元1211d根據3×3運算參數與輸入資料106進行運算以產生複數3×3運算資料。詳細來說,3×3本
地濾波運算單元1211d可執行3×3卷積運算,當第一平行子處理器1121ba的數量為9時,3×3本地濾波運算單元1211d之空間濾波位置(spatiaal filter position)分別可對應第一平行子處理器1121ba;當第一平行子處理器1121ba的數量為18時,3×3本地濾波運算單元1211d之空間濾波位置可對應二第一平行子處理器1121ba,以此類推,本案不另贅述。3×3後處理運算單元1211e與3×3本地卷積運算單元1211b電性連接,並依據3×3運算資料進行3×3後處理運算以產生3×3後處理運算資料1062。卷積神經網路處理器100之輸出資料108可為3×3後處理運算資料1062。偏壓分配器1212與偏壓平行子處理器1121bb、3×3運算子模組1211電性連接。偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數,並將3×3偏壓權重參數輸出至3×3後處理運算單元1211e。
Each first output weight parameter includes a complex 3×3 weight parameter. The
在第3圖中,3×3運算子模組1211包含複數3×3運算電路1211a,3×3運算電路1211a的數量可為32。各3×3運算電路1211a是由複數3×3本地卷積運算單元1211b及一3×3後處理運算單元1211e所組成,3×3本地卷積運算單元1211b的數量可為32。也就是說,3×3運算子模組1211中之3×3本地卷積運算單元1211b的數量為1024,3×3後處理運算單元1211e的數量為32。
In FIG. 3 , the 3×3 arithmetic sub-module 1211 includes complex 3×3
請配合參照第3圖及第4圖,3×3運算子模組1211於接收第一輸出權重參數之3×3權重參數後可藉由3×3卷積分配器組將3×3權重參數分配至3×3本地卷積運算
單元1211b。在第4圖中,3×3卷積分配器組的配置是採用二階段分配法,3×3卷積分配器組包含第一3×3卷積分配器1211f及複數第二3×3卷積分配器1211g。第一3×3卷積分配器1211f與第一平行子處理器1121ba電性連接以接收並分配第一輸出權重參數之3×3權重參數至第二3×3卷積分配器1211g,第二3×3卷積分配器1211g於接收3×3權重參數後,將3×3權重參數分配至3×3本地卷積運算單元1211b,本發明雖是利用二階段之分配方法,然其分配方式並不以此為限。3×3本地暫存器組1211c可包含二子3×3本地暫存器組1211ca、1211cb。二子3×3本地暫存器組1211ca、1211cb結合一個多工器可交替地儲存3×3權重參數或輸出3×3運算參數給3×3本地濾波運算單元1211d。也就是說,當子3×3本地暫存器組1211ca用以儲存3×3權重參數時,子3×3本地暫存器組1211cb輸出3×3運算參數給3×3本地濾波運算單元1211d;當子3×3本地暫存器組1211cb用以儲存3×3權重參數時,子3×3本地暫存器組1211ca輸出3×3運算參數給3×3本地濾波運算單元1211d,即本案之3×3本地暫存器組1211c是以乒乓(ping-pong)的方式執行儲存3×3權重參數及輸出3×3運算參數。
Please refer to FIG. 3 and FIG. 4 together. After receiving the 3×3 weight parameter of the first output weight parameter, the 3×3 operation sub-module 1211 can distribute the 3×3 weight parameter through the 3×3 convolutional distributor group. to 3×3 local
3×3本地濾波運算單元1211d可根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料。舉例來說,輸入資料106的圖塊大小可為6×4,3×3本地濾波運算單元1211d可根據3×3運算參數與輸入資料106進行3×3卷積運算。為了實現高度平行的運算,卷積神經網
路處理器100可在3×3運算子模組1211中佈署多個乘法器,於3×3本地濾波運算單元1211d中之乘法器的數量可為73728。3×3後處理運算單元1211e於接收3×3本地濾波運算單元1211d所產生之3×3運算資料及偏壓分配器所產生之3×3偏壓權重參數後,可根據3×3運算資料及3×3偏壓權重參數進行3×3後處理運算以產生3×3後處理運算資料1062。在第3圖及第4圖實施方式中,3×3後處理運算資料1062即為卷積神經網路處理器100之輸出資料108。
The 3×3 local
在第2圖中,卷積判斷單元120更包含控制器122。控制器122與訊息解碼單元110電性連接。詳細來說,控制器122與指令解碼器1112電性連接以接收運作指令,並根據運作指令控制運算模組121之3×3運算子模組1211及偏壓分配器1212。
In FIG. 2 , the
第5圖繪示依照本發明又一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。第6圖繪示依照第5圖結構態樣之實施方式的卷積神經網路處理器100之1×1運算子模組1213之方塊圖。第7圖繪示依照第6圖結構態樣之實施方式的卷積神經網路處理器100之1×1運算子模組1213之1×1本地卷積運算單元1213b示意圖。第5圖之卷積神經網路處理器100與第2圖之卷積神經網路處理器100的差異在於,第5圖之卷積神經網路處理器100之平行子記憶體1121a更包含至少一第二平行子記憶體1121ac,平行子處理器1121b更包含至少一第二平行子處理器1121bc,及運算模組121更包含1×1運算子模組1213。此外,輸入權重參數104更包含
至少一第二輸入權重參數。輸出權重參數更包含至少一第二輸出權重參數。至少一第二平行子記憶體1121ac分別且平行地接收並儲存至少一第二輸入權重參數。至少一第二平行子處理器1121bc分別與至少一第二平行子記憶體1121ac電性連接,並根據運作指令接收至少一第二輸入權重參數以輸出至少一第二輸出權重參數。3×3運算子模組1211的配置與第2圖之卷積神經網路處理器100中的3×3運算子模組1211相同,在此不另贅述。在第5圖實施方式中,第一平行子記憶體1121aa及第一平行子處理器1121ba的數量均為9,第二平行子記憶體1121ac及第二平行子處理器1121bc的數量均為1,然在其他實施方式中,當第一平行子記憶體1121aa及第一平行子處理器1121ba的數量為18時,第二平行子記憶體1121ac及第二平行子處理器1121bc的數量均為2,以此類推,本案不以此為限。偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb的數量均為1,但本發明不以此為限。
FIG. 5 shows a block diagram of a convolutional
詳細來說,平行處理模組112於接收輸入權重參數104後,將輸入權重參數104中的第一輸入權重參數儲存於第一平行子記憶體1121aa中,將輸入權重參數104中的第二輸入權重參數儲存於第二平行子記憶體1121ac中,以及將偏壓輸入權重參數儲存於偏壓平行子記憶體1121ab中。第5圖之第一平行子處理器1121ba及偏壓平行子處理器1121bb的運作方式與第2圖之第一平行子處理器1121ba及偏壓平行子處理器1121bb相同,在此不另贅述。第二平
行子處理器1121bc根據運作指令從第二平行子記憶體1121ac中讀取第二輸入權重參數,並進行處理以產生第二輸出權重參數。
Specifically, after receiving the
1×1運算子模組1213與至少一第二平行子處理器1121bc及3×3運算子模組1211電性連接,並根據至少一第二輸出權重參數與3×3後處理運算資料1062進行運算以產生1×1後處理運算資料1064。1×1運算子模組1213包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元1213e。至少一1×1卷積分配器組與至少一第二平行子處理器1121bc電性連接,用以接收及分配至少一第二輸出權重參數之1×1權重參數。1×1本地卷積運算單元1213b與至少一1×1卷積分配器電性連接。各1×1本地卷積運算單元1213b包含1×1本地暫存器組1213c及1×1本地濾波運算單元1213d。1×1本地暫存器組1213c與至少一1×1卷積分配器組電性連接。1×1本地卷積運算單元1213b之1×1本地暫存器組1213c接收並儲存至少一第二輸出權重參數之1×1權重參數,並根據至少一第二輸出權重參數之1×1權重參數,以輸出1×1運算參數。1×1本地濾波運算單元1213d與1×1本地暫存器組1213c電性連接。1×1本地卷積運算單元1213b之1×1本地濾波運算單元1213d根據1×1運算參數與3×3後處理運算資料1062進行運算以產生複數1×1運算資料。詳細來說,1×1本地濾波運算單元1213d可執行1×1卷積運算,當第二平行子處理器1121bc的數量為1時,1×1本地濾波運算單元1213d之空間濾波位
置可對應第二平行子處理器1121bc;當第二平行子處理器1121bc的數量為2時,1×1本地濾波運算單元1213d之空間濾波位置可對應二第二平行子處理器1121bc,以此類推,本案不另贅述。1×1後處理運算單元1213e與1×1本地卷積運算單元1213b電性連接,並依據1×1運算資料進行1×1後處理運算以產生1×1後處理運算資料1064。卷積神經網路處理器100之輸出資料108為1×1後處理運算資料1064。第5圖之偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb與第2圖之偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb相同,在此不另贅述。第5圖之偏壓分配器1212與3×3運算子模組1211的配置關係與第2圖之偏壓分配器1212與3×3運算子模組1211的配置關係相同,在此不另贅述。
The 1×1 arithmetic sub-module 1213 is electrically connected to at least one second parallel sub-processor 1121bc and the 3×3 arithmetic sub-module 1211 , and performs processing with the 3×3 post-processing
詳細來說,第5圖之偏壓分配器1212與偏壓平行子處理器1121bb、3×3運算子模組1211及1×1運算子模組1213電性連接。偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數及複數1×1偏壓權重參數。偏壓分配器1212將3×3偏壓權重參數輸出至3×3後處理運算單元1211e。偏壓分配器1212將1×1偏壓權重參數輸出至1×1後處理運算單元1213e。
Specifically, the
在第6圖中,1×1運算子模組1213包含複數1×1運算電路1213a,1×1運算電路1213a的數量可為32。各1×1運算電路1213a是由複數1×1本地卷積運算單元1213b及一1×1後處理運算單元1213e所組成,1×1本地卷
積運算單元1213b的數量可為32。也就是說,1×1運算子模組1213中之1×1本地卷積運算單元1213b的數量為1024,1×1後處理運算單元1213e的數量為32。
In FIG. 6 , the 1×1
請配合參照第6圖及第7圖,1×1運算子模組1213於接收第二輸出權重參數之1×1權重參數後可藉由1×1卷積分配器組將1×1權重參數分配至1×1本地卷積運算單元1213b。在第7圖中,1×1卷積分配器組的配置是採用二階段分配法,並包含第一1×1卷積分配器1213f及複數第二1×1卷積分配器1213g,其作動方式與3×3卷積分配器組相同,在此不另贅述。1×1本地暫存器組1213c可包含二子1×1本地暫存器組1213ca、1213cb。二子1×1本地暫存器組1213ca、1213cb結合一個多工器可交替地儲存1×1權重參數或輸出1×1運算參數給1×1本地濾波運算單元1213d。1×1本地暫存器組1213c的作動方式與3×3本地暫存器組1211c相同,在此不另贅述。也就是說,本案之3×3本地暫存器組1211c及1×1本地暫存器組1213c皆是以乒乓(ping-pong)的方式作動。因此,1×1本地濾波運算單元1213d可根據1×1運算參數及3×3後處理運算資料1062進行1×1後處理運算以產生1×1運算資料。在第5圖至第7圖實施方式中,1×1後處理運算資料1064即為卷積神經網路處理器100之輸出資料108。
Please refer to FIG. 6 and FIG. 7 together. After receiving the 1×1 weight parameter of the second output weight parameter, the 1×1 operation sub-module 1213 can allocate the 1×1 weight parameter through the 1×1 convolutional distributor group. to the 1×1 local
為了實現高度平行的運算,卷積神經網路處理器100可在3×3運算子模組1211及1×1運算子模組1213中佈署多個乘法器,舉例來說,3×3本地濾波運算單元
1211d中之乘法器的數量可為73728,1×1本地濾波運算單元1213d中之乘法器的數量可為8192。此外,第5圖中之控制器122與第2圖中之控制器122相同,在此不另贅述。
In order to achieve highly parallel operations, the convolutional
第8圖繪示依照本發明一方法態樣之一實施方式的卷積神經網路處理器的資料處理方法s200之步驟方塊圖。在第8圖中,卷積神經網路處理器的資料處理方法s200包含接收步驟s210、指令解碼步驟s220、平行處理步驟s230及運算步驟s240。 FIG. 8 is a block diagram showing the steps of the data processing method s200 of the convolutional neural network processor according to an embodiment of a method aspect of the present invention. In FIG. 8, the data processing method s200 of the convolutional neural network processor includes a receiving step s210, an instruction decoding step s220, a parallel processing step s230, and an operation step s240.
請配合參照第1圖,詳細來說,接收步驟s210驅動訊息解碼單元110接收輸入程式102及複數輸入權重參數104。訊息解碼單元110包含解碼模組111及平行處理模組112。指令解碼步驟s220驅動解碼模組111接收輸入程式102,並根據輸入程式102產生運作指令。平行處理步驟s230驅動平行處理模組112接收輸入權重參數104,並根據運作指令以平行地處理輸入權重參數104以產生複數輸出權重參數。運算步驟s240驅動運算模組121接收輸入資料106及輸出權重參數,並根據運作指令以將輸入資料106與輸出權重參數進行運算以產生輸出資料108。也就是說,卷積神經網路處理器100之訊息解碼單元110可透過接收步驟s210接收輸入程式102及輸入權重參數104以執行指令解碼步驟s220及平行處理步驟s230。由於平行處理模組112與解碼模組111電性連接,因此,平行處理模組112可根據解碼模組111於指令解碼步驟s220中所產生之運作指令產生輸出權重參數,即平行處理步驟s230。此外,運
算模組121與平行處理模組112電性連接,因此,於運算步驟s240中,運算模組121可於接收輸入資料106及輸出權重參數後,根據輸入資料106及輸出權重參數進行運算以產生輸出資料108。藉此,卷積神經網路處理器的資料處理方法s200可藉由接收步驟s210、指令解碼步驟s220、平行處理步驟s230及運算步驟s240驅動訊息解碼單元110的解碼模組111及平行處理模組112,以及卷積判斷單元120的運算模組121執行高度平行運算,進而提供高性能且低功耗的運算。
Please refer to FIG. 1 , in detail, the receiving step s210 drives the
舉例來說,在第8圖中,卷積神經網路處理器的資料處理方法s200之接收步驟s210所接收之輸入程式102及輸入權重參數104可包含對應複數輸入資料106之相關指令及參數。於執行指令解碼步驟s220及平行處理步驟s230時,將對應複數輸入資料106之相關指令及參數儲存於程式記憶體1111及平行子記憶體1121a中。於執行指令解碼步驟s220及平行處理步驟s230時,可針對與其中一輸入資料106的相關指令及參數進行處理,以於執行運算步驟s240時,針對所述其中一輸入資料106進行運算,並於執行運算步驟s240期間,卷積神經網路處理器的資料處理方法s200可針對其中之另一輸入資料106之相關指令及參數進行處理,即針對所述其中之另一輸入資料106執行指令解碼步驟s220以及平行處理步驟s230。換言之,卷積神經網路處理器的資料處理方法s200係先將全部的輸入資料106之相關指令及參數都儲存於程式記憶體1111及平行子
記憶體1121a中,然後再執行每一個輸入資料106所對應之指令解碼步驟s220、平行處理步驟s230及運算步驟s240。此外,當運算步驟s240在針對所述其中一輸入資料106進行運算時,指令解碼步驟s220以及平行處理步驟s230可針對所述其中之另一輸入資料106之相關指令及參數進行處理。因此,卷積神經網路處理器的資料處理方法s200於執行接收步驟s210後可對複數輸入資料106各別進行運算。
For example, in FIG. 8 , the
第9圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的指令解碼步驟s220之步驟方塊圖。解碼模組111可包含程式記憶體1111及指令解碼器1112。指令解碼步驟s220可包含程式儲存子步驟s221及程式解碼子步驟s222。程式儲存子步驟s221驅動程式記憶體1111儲存輸入程式102。程式解碼子步驟s222驅動指令解碼器1112對輸入程式102進行解碼以產生運作指令。也就是說,卷積神經網路處理器100可透過程式儲存子步驟s221及程式解碼子步驟s222驅動解碼模組111接收輸入程式102,並將輸入程式102儲存於程式記憶體1111中,再藉由指令解碼器1112對儲存於程式記憶體1111中之輸入程式102進行解碼以產生運作指令。
FIG. 9 shows a block diagram of the steps of the instruction decoding step s220 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . The
第10圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的平行處理步驟s230之步驟方塊圖。平行處理模組112可包含複數平行子記憶體1121a及複數平行子處理器1121b。平行處理步驟
s230包含權重參數儲存子步驟s231及權重參數處理子步驟s232。權重參數儲存子步驟s231驅動平行子記憶體1121a以平行地儲存輸入權重參數104。權重參數處理子步驟s232驅動平行子處理器1121b。平行子處理器1121b根據運作指令平行地讀取輸入權重參數104並進行運作處理以產生輸出權重參數。也就是說,卷積神經網路處理器100可透過權重參數儲存子步驟s231及權重參數處理子步驟s232驅動平行處理模組112接收輸入權重參數104,並將輸入權重參數104儲存於平行子記憶體1121a中,平行子處理器1121b再根據運作指令對儲存於平行子記憶體1121a中之輸入權重參數104進行運作處理以產生輸出權重參數。當輸入權重參數104為非壓縮輸入權重參數時,運作處理可為儲存非壓縮輸入權重參數。當輸入權重參數104為壓縮輸入權重參數時,運作處理可為儲存及解壓縮壓縮輸入權重參數。
FIG. 10 shows a block diagram of the steps of the parallel processing step s230 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . The
第11圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的運算步驟s240之步驟方塊圖。請配合參照第2圖至第4圖。輸出權重參數可包含複數第一輸出權重參數及偏壓輸出權重參數。第一輸出權重參數包含複數3×3權重參數。運算模組121可包含3×3運算子模組1211及偏壓分配器1212。3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b及複數3×3後處理運算單元1211e。運算步驟s240可包含第一運算子步驟s241及偏壓運算子步驟
s242。第一運算子步驟s241包含3×3參數分配程序s2411、3×3運算參數產生程序s2412、3×3卷積運算程序s2413及3×3後處理運算程序s2414。3×3參數分配程序s2411驅動3×3卷積分配器組接收第一輸出權重參數之3×3權重參數,並將第一輸出權重參數之3×3權重參數分配至3×3本地卷積運算單元1211b,其中各3×3本地卷積運算單元1211b包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3運算參數產生程序s2412驅動3×3本地卷積運算單元1211b之3×3本地暫存器組1211c接收第一輸出權重參數之3×3權重參數,並根據第一輸出權重參數之3×3權重參數產生複數3×3運算參數。3×3卷積運算程序s2413驅動3×3本地卷積運算單元1211b之3×3本地濾波運算單元1211d以將3×3運算參數及輸入資料106進行3×3卷積運算以產生複數3×3運算資料。3×3後處理運算程序s2414驅動3×3後處理運算單元1211e以將3×3運算資料進行3×3後處理運算以產生3×3後處理運算資料1062。偏壓運算子步驟s242驅動偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數。偏壓分配器1212將3×3偏壓權重參數提供予3×3運算子模組1211。也就是說,卷積神經網路處理器100可透過第一運算子步驟s241及偏壓運算子步驟s242產生3×3後處理運算資料1062。詳細來說,3×3運算子模組1211可用以執行第一運算子步驟s241,3×3運算子模組1211之3×3卷積分配器組可執行3×3參數分配程序s2411以將3×3權重參數分配至不同的3×3本地卷積運算單
元1211b中之3×3本地暫存器組1211c,以利3×3本地暫存器組1211c執行3×3運算參數產生程序s2412。3×3本地暫存器組1211c可包含二子3×3本地暫存器組1211ca、1211cb。二子3×3本地暫存器組1211ca、1211cb以乒乓的方式作動,進而接收3×3權重參數並輸出3×3運算參數至3×3本地濾波運算單元1211d。3×3本地濾波運算單元1211d於3×3卷積運算程序s2413中根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料。於3×3後處理運算程序s2414中,3×3後處理運算單元1211e根據偏壓分配器1212於偏壓運算子步驟s242中所輸出之3×3偏壓權重參數及3×3運算資料執行3×3後處理運算以產生3×3後處理運算資料1062。在第2圖至第4圖及第11圖的實施方式中,3×3後處理運算資料1062可為卷積神經網路處理器100的輸出資料108。
FIG. 11 shows a block diagram of the steps of the operation step s240 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . Please refer to Figure 2 to Figure 4 together. The output weight parameters may include a plurality of first output weight parameters and a bias output weight parameter. The first output weight parameter includes a complex 3×3 weight parameter. The
第12圖繪示依照第8圖之方法態樣之另一實施方式的卷積神經網路處理器的資料處理方法s200的運算步驟s240之步驟方塊圖。請配合參照第5圖至第7圖。輸出權重參數可包含複數第一輸出權重參數、至少一第二輸出權重參數及偏壓輸出權重參數。第一輸出權重參數包含複數3×3權重參數。至少一第二輸出權重參數包含複數1×1權重參數。運算模組121可包含3×3運算子模組1211、1×1運算子模組1213及偏壓分配器1212。3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b及複數3×3後處理運算單元1211e。1×1運算子模組
包含複數1×1卷積分配器組、複數1×1本地卷積運算單元1213b及複數1×1後處理運算單元1213e。運算步驟s240可包含第一運算子步驟s241、第二運算子步驟s243及偏壓運算子步驟s242。第12圖之第一運算子步驟s241與第11圖之第一運算子步驟s241相同,在此不另贅述。第二運算子步驟s243驅動1×1運算子模組1213接收3×3後處理運算資料1062及至少一第二輸出權重參數以產生1×1後處理運算資料1064。第二運算子步驟s243包含1×1參數分配程序s2431、1×1運算參數產生程序s2432、1×1卷積運算程序s2433及1×1後處理運算程序s2434。1×1參數分配程序s2431驅動至少一1×1卷積分配器組以接收至少一第二輸出權重參數之1×1權重參數,並將至少一第二輸出權重參數之1×1權重參數分配至1×1本地卷積運算單元s1213b,其中各1×1本地卷積運算單元s1213b包含1×1本地暫存器組s1213c及1×1本地濾波運算單元s1213d。1×1運算參數產生程序s2432驅動1×1本地卷積運算單元1213b之1×1本地暫存器組1213c接收至少一第二輸出權重參數之1×1權重參數,並根據至少一第二輸出權重參數之1×1權重參數產生複數1×1運算參數。1×1卷積運算程序s2433驅動1×1本地卷積運算單元1213b之1×1本地濾波運算單元1213d以將1×1運算參數及3×3後處理運算資料1062進行1×1卷積運算以產生複數1×1運算資料。1×1後處理運算程序s2434驅動1×1後處理運算單元1213e以將1×1運算資料進行1×1後處理運算以產生1×1後處理運算資料1064。也就是說,卷
積神經網路處理器100可透過第一運算子步驟s241、第二運算子步驟s243及偏壓運算子步驟s242產生1×1後處理運算資料1064。詳細來說,1×1運算子模組1213可用以執行第二運算子步驟s243,1×1運算子模組1213之1×1卷積分配器組可執行1×1參數分配程序s2431以將1×1權重參數分配至不同的1×1本地卷積運算單元1213b中的1×1本地暫存器組1213c以利1×1本地暫存器組1213c執行1×1運算參數產生程序s2432。1×1本地暫存器組1213c可包含二子1×1本地暫存器組1213ca、1213cb。二子1×1本地暫存器組1213ca、1213cb以乒乓的方式作動,進而接收1×1權重參數並輸出1×1運算參數至1×1本地濾波運算單元1213d。1×1本地濾波運算單元1213d於1×1卷積運算程序s2433中根據1×1運算參數及3×3後處理運算資料1062進行1×1卷積運算以產生1×1運算資料。於1×1後處理運算程序s2434中,1×1後處理運算單元1213e根據偏壓分配器1212於偏壓運算子步驟s242中所輸出之1×1偏壓權重參數及1×1運算資料執行1×1後處理運算以產生1×1後處理運算資料1064。在第5圖至第7圖及第12圖的實施方式中,1×1後處理運算資料1064可為卷積神經網路處理器100的輸出資料108。
FIG. 12 is a block diagram illustrating the steps of the operation step s240 of the data processing method s200 of the convolutional neural network processor according to another embodiment of the method aspect of FIG. 8 . Please refer to Figure 5 to Figure 7 together. The output weight parameters may include a plurality of first output weight parameters, at least one second output weight parameter, and a bias output weight parameter. The first output weight parameter includes a complex 3×3 weight parameter. The at least one second output weight parameter includes a complex 1×1 weight parameter. The
請配合參照第5圖至第10圖及第12圖。詳細來說,卷積神經網路處理器100可執行卷積神經網路處理器的資料處理方法s200,且卷積神經網路處理器100包含訊息解碼單元110及卷積判斷單元120。訊息解碼單元110可
執行接收步驟s210、指令解碼步驟s220及平行處理步驟s230。解碼模組111於接收步驟s210中接收輸入程式102後,透過程式記憶體1111儲存輸入程式102,即程式儲存子步驟s221,再透過指令解碼器1112於程式解碼子步驟s222中將儲存於程式記憶體1111中的輸入程式102解碼以輸出運作指令至平行處理模組112及卷積判斷單元120之控制器122,其中輸入程式102可包含對應複數輸入資料106之相關指令。簡單來說,於程式解碼子步驟s222中,指令解碼器1112將對應其中一輸入資料106之相關指令進行解碼,以輸出運作指令。控制器122於接收運作指令後可根據運作指令控制運算模組121。平行處理模組112於接收步驟s210中接收輸入權重參數104,並執行平行處理步驟s230。輸入權重參數104包含第一輸入權重參數、第二輸入權重參數及偏壓輸入權重參數,第一輸入權重參數的數量可為9216的倍數,第二輸入權重參數的數量可為1024的倍數,偏壓輸入權重參數的數量可為64的倍數。換句話說,輸入權重參數104包含對應複數輸入資料106之相關參數。於權重參數儲存子步驟s231中,第一平行子記憶體1121aa、第二平行子記憶體1121ac及偏壓平行子記憶體1121ab分別儲存第一輸入權重參數、第二輸入權重參數及偏壓輸入權重參數,其中第一平行子記憶體1121aa為9,第二平行子記憶體1121ac及偏壓平行子記憶體1121ab分別為1。此外,平行處理模組112中的第一平行子處理器1121ba的數量為9,第二平行子處理器1121bc及偏壓平行
子處理器1121bb的數量均為1。於權重參數處理子步驟s232中,第一平行子處理器1121ba及第二平行子處理器1121bc每週期可處理的第一輸入權重參數及第二輸入權重參數的數量均為4。第一平行子處理器1121ba及第二平行子處理器1121bc分別需使用256個週期處理與所述其中一輸入資料106相對應之第一輸入.權重參數及第二輸入權重參數,以分別輸出第一輸出權重參數及第二輸出權重參數,偏壓平行子處理器1121bb則使用64個週期處理與所述其中一輸入資料106相對應之偏壓輸入權重參數以輸出偏壓輸出權重參數。藉此,卷積神經網路處理器100可透過執行接收步驟s210、指令解碼步驟s220及平行處理步驟s230以平行處理輸入權重參數104。
Please refer to Figure 5 to Figure 10 and Figure 12 together. Specifically, the convolutional
卷積判斷單元120之運算模組121可執行運算步驟s240,且運算模組121包含3×3運算子模組1211、偏壓分配器1212及1×1運算子模組1213。偏壓分配器1212可執行偏壓運算子步驟s242。於偏壓運算子步驟s242中,偏壓分配器1212接收3×3偏壓權重參數及1×1偏壓權重參數,並將3×3偏壓權重參數分配至3×3運算子模組1211中之3×3後處理運算單元1211e,以利3×3後處理運算單元1211e執行3×3後處理運算程序s2414,以及將1×1偏壓權重參數分配至1×1運算子模組1213中之1×1後處理運算單元1213e,以利1×1後處理運算單元1213e執行1×1後處理運算程序s2434。
The
3×3運算子模組1211可執行第一運算子步驟s241,且3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b、複數3×3後處理運算單元1211e。3×3卷積分配器組與第一平行子處理器1121ba電性連接,且於3×3參數分配程序s2411中,接收並分配3×3權重參數至3×3本地卷積運算單元1211b,以利3×3本地卷積運算單元1211b執行3×3運算參數產生程序s2412及3×3卷積運算程序s2413。各3×3本地卷積運算單元1211b包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3本地暫存器組1211c可執行3×3運算參數產生程序s2412,3×3本地暫存器組1211c包含二子3×3本地暫存器組1211ca、1211cb,並以乒乓的方式執行3×3運算參數產生程序s2412以輸出3×3運算參數至3×3本地濾波運算單元1211d。於3×3卷積運算程序s2413中,3×3本地濾波運算單元1211d根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料,其中3×3卷積運算的空間濾波位置可分別與第一平行子處理器1121ba中之一者對應。於3×3後處理運算程序s2414中,3×3後處理運算單元1211e根據3×3運算資料及3×3偏壓權重參數執行3×3後處理運算以輸出3×3後處理運算資料1062。
The 3×3 operation sub-module 1211 can execute the first operation sub-step s241, and the 3×3
1×1運算子模組1213可執行第二運算子步驟s243,且1×1運算子模組1213包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元1213b、複數1×1後處理運算單元1213e。1×1卷積分配器組與至少一第二平行子處
理器1121bc電性連接,且於1×1參數分配程序s2431中,接收並分配1×1權重參數至1×1本地卷積運算單元1213b,以利1×1本地卷積運算單元1213b執行1×1運算參數產生程序s2432及1×1卷積運算程序s2433。各1×1本地卷積運算單元1213b包含1×1本地暫存器組1213c及1×1本地濾波運算單元1213d。1×1本地暫存器組1213c可執行1×1運算參數產生程序s2432,1×1本地暫存器組1213c包含二子1×1本地暫存器組1213ca、1213cb,並以乒乓的方式執行1×1運算參數產生程序s2432以輸出1×1運算參數至1×1本地濾波運算單元1213d。於1×1卷積運算程序s2433中,1×1本地濾波運算單元1213d根據1×1運算參數及於3×3後處理運算程序s2414中所產生之3×3後處理運算資料1062進行1×1卷積運算以產生1×1運算資料,其中1×1卷積運算的空間濾波位置可分別與至少一第二平行子處理器1121bc對應。於1×1後處理運算程序s2434中,1×1後處理運算單元1213e根據1×1運算資料及1×1偏壓權重參數執行1×1後處理運算以輸出1×1後處理運算資料1064。於1×1後處理運算程序s2434中所輸出之1×1後處理運算資料1064即為卷積神經網路處理器100執行卷積神經網路處理器的資料處理方法s200所產生之輸出資料108。
The 1×1 operation sub-module 1213 can execute the second operation sub-step s243, and the 1×1
綜上所述,卷積神經網路處理器100可透過執行卷積神經網路處理器的資料處理方法s200執行高度平行運算進而提供高性能且低功耗的運算。
To sum up, the convolutional
雖然本發明已以實施方式揭露如上,然其並非用以限定本發明,任何熟習此技藝者,在不脫離本發明的精神和範圍內,當可作各種的更動與潤飾,因此本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection of the present invention The scope shall be determined by the scope of the appended patent application.
100‧‧‧卷積神經網路處理器 100‧‧‧Convolutional Neural Network Processor
102‧‧‧輸入程式 102‧‧‧Input Program
104‧‧‧輸入權重參數 104‧‧‧Input Weight Parameters
106‧‧‧輸入資料 106‧‧‧Enter data
112‧‧‧平行處理模組 112‧‧‧Parallel processing module
1121‧‧‧平行處理子模組 1121‧‧‧Parallel Processing Submodule
1121a‧‧‧平行子記憶體 1121a‧‧‧Parallel sub-memory
1121b‧‧‧平行子處理器 1121b‧‧‧Parallel Subprocessor
108‧‧‧輸出資料 108‧‧‧Output data
110‧‧‧訊息解碼單元 110‧‧‧Message Decoding Unit
111‧‧‧解碼模組 111‧‧‧Decoding Module
1111‧‧‧程式記憶體 1111‧‧‧Program memory
1112‧‧‧指令解碼器 1112‧‧‧Instruction Decoder
120‧‧‧卷積判斷單元 120‧‧‧Convolution judgment unit
121‧‧‧運算模組 121‧‧‧Computing Modules
Claims (24)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/701,172 US11494645B2 (en) | 2018-12-06 | 2019-12-03 | Convolutional neural network processor and data processing method thereof |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201862776426P | 2018-12-06 | 2018-12-06 | |
| US62/776,426 | 2018-12-06 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202022710A TW202022710A (en) | 2020-06-16 |
| TWI766193B true TWI766193B (en) | 2022-06-01 |
Family
ID=71029040
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW108136729A TWI766193B (en) | 2018-12-06 | 2019-10-09 | Convolutional neural network processor and data processing method thereof |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN111291874B (en) |
| TW (1) | TWI766193B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11610134B2 (en) * | 2019-07-08 | 2023-03-21 | Vianai Systems, Inc. | Techniques for defining and executing program code specifying neural network architectures |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW201706871A (en) * | 2015-05-21 | 2017-02-16 | 咕果公司 | Calculate convolution using a neural network-like processor |
| TW201706872A (en) * | 2015-05-21 | 2017-02-16 | 咕果公司 | Prefetch weights for use in class-like neural network processors |
| US20180336165A1 (en) * | 2017-05-17 | 2018-11-22 | Google Llc | Performing matrix multiplication in hardware |
| US20180336164A1 (en) * | 2017-05-17 | 2018-11-22 | Google Llc | Low latency matrix multiply unit |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP3210319B2 (en) * | 1990-03-01 | 2001-09-17 | 株式会社東芝 | Neurochip and neurocomputer using the chip |
| CN105681628B (en) * | 2016-01-05 | 2018-12-07 | 西安交通大学 | A kind of convolutional network arithmetic element and restructural convolutional neural networks processor and the method for realizing image denoising processing |
| WO2017177446A1 (en) * | 2016-04-15 | 2017-10-19 | 北京中科寒武纪科技有限公司 | Discrete data representation-supporting apparatus and method for back-training of artificial neural network |
| US10726583B2 (en) * | 2016-12-30 | 2020-07-28 | Intel Corporation | System and method of encoding and decoding feature maps and weights for a convolutional neural network |
| CN108763191B (en) * | 2018-04-16 | 2022-02-11 | 华南师范大学 | A method and system for generating text summaries |
-
2019
- 2019-10-09 TW TW108136729A patent/TWI766193B/en active
- 2019-10-09 CN CN201910953878.XA patent/CN111291874B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW201706871A (en) * | 2015-05-21 | 2017-02-16 | 咕果公司 | Calculate convolution using a neural network-like processor |
| TW201706872A (en) * | 2015-05-21 | 2017-02-16 | 咕果公司 | Prefetch weights for use in class-like neural network processors |
| US20180336165A1 (en) * | 2017-05-17 | 2018-11-22 | Google Llc | Performing matrix multiplication in hardware |
| US20180336164A1 (en) * | 2017-05-17 | 2018-11-22 | Google Llc | Low latency matrix multiply unit |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111291874B (en) | 2023-12-01 |
| CN111291874A (en) | 2020-06-16 |
| TW202022710A (en) | 2020-06-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3624020B1 (en) | Computation method and product thereof | |
| US11531540B2 (en) | Processing apparatus and processing method with dynamically configurable operation bit width | |
| Nag et al. | ViTA: A vision transformer inference accelerator for edge applications | |
| CN110163362B (en) | A computing device and method | |
| US20190026626A1 (en) | Neural network accelerator and operation method thereof | |
| US11494645B2 (en) | Convolutional neural network processor and data processing method thereof | |
| US10908916B2 (en) | Apparatus and method for executing a plurality of threads | |
| US11468600B2 (en) | Information processing apparatus, information processing method, non-transitory computer-readable storage medium | |
| Kyrkou et al. | An embedded hardware-efficient architecture for real-time cascade support vector machine classification | |
| Meloni et al. | A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC | |
| TWI766193B (en) | Convolutional neural network processor and data processing method thereof | |
| JP2025138911A (en) | Processing method and computing system | |
| CN116796812A (en) | Programmable parallel processing device, neural network chip and electronic equipment | |
| Ryu et al. | A 44.1 TOPS/W precision-scalable accelerator for quantized neural networks in 28nm CMOS | |
| Dong et al. | UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA | |
| CN115345287B (en) | Methods for computing macro arrangements in memory, computer-readable media and electronic devices | |
| CN113407238A (en) | Many-core architecture with heterogeneous processors and data processing method thereof | |
| Shan et al. | Design of Approximate Multi-Granularity Multiply-Accumulate Unit for Convolutional Neural Network | |
| Beyer et al. | Exploiting subword permutations to maximize CNN compute performance and efficiency | |
| Zhao et al. | A microcode-based control unit for deep learning processors | |
| Rai et al. | Accelerating Automated Driving and ADAS Using HW/SW Codesign | |
| Modiboyina et al. | Accelerating U-Net: A patchwise memory optimization approach for image segmentation | |
| US20240411517A1 (en) | Data processing device, data processing method, and chip | |
| Piyasena et al. | Lowering dynamic power of a stream-based cnn hardware accelerator | |
| Im et al. | Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture |