TWI766193B

TWI766193B - Convolutional neural network processor and data processing method thereof

Info

Publication number: TWI766193B
Application number: TW108136729A
Authority: TW
Inventors: 黃朝宗
Original assignee: 神盾股份有限公司
Priority date: 2018-12-06
Filing date: 2019-10-09
Publication date: 2022-06-01
Also published as: CN111291874B; CN111291874A; TW202022710A

Abstract

A convolutional neural network processor includes an Information Decode Unit (IDU) and a Convolutional Neural Network Inference Unit (CIU). The IDU receives a program input and a plurality of parameter inputs, and the IDU includes a decoding module and a parallel processing module. The decoding module receives the program input, and the decoding module produces an operating command according to the program input. The parallel processing module is electrically connected to the decoding module and receives the parameter inputs. The parallel processing module includes a plurality of parallel processing sub-modules. The parallel processing sub-modules produce a plurality of output weight parameters. The CIU is electrically connected to the IDU and includes a computing module. The computing module is electrically connected to the parallel processing module, and the computing module produces an output data according to an input data and the output weight parameter. Therefore, the convolutional neural network processor can perform a highly parallel computation.

Description

Convolutional neural network processor and data processing method thereof

本發明是有關於一種卷積神經網路處理器及其資料處理方法，且尤其是有關一種具有訊息解碼單元以及卷積判斷單元的卷積神經網路處理器及其資料處理方法。 The present invention relates to a convolutional neural network processor and a data processing method thereof, and more particularly, to a convolutional neural network processor and a data processing method thereof having a message decoding unit and a convolution judgment unit.

卷積神經網路(Convolutional Neural Networks，CNN)近期被廣泛的應用於電腦視覺(Computer vision)及影像處理(image processing)領域。然而，近期的應用則多偏重於物體識別及物體偵測上，因此，卷積神經網路的硬體設計並沒有針對圖像處理網路進行優化，因為上述應用並不考慮(1)空間分辨率不會被大量降採樣(downsampled)以及(2)模型稀疏性之失效狀況(model sparsity)，導致極高的內存帶寬及極高的運算能力需求。 Convolutional Neural Networks (CNN) have recently been widely used in the fields of computer vision and image processing. However, recent applications are more focused on object recognition and object detection. Therefore, the hardware design of convolutional neural networks is not optimized for image processing networks, because the above applications do not consider (1) spatial resolution Rates are not heavily downsampled and (2) the model sparsity fails, resulting in extremely high memory bandwidth and extremely high computing power requirements.

有鑑於此，本發明設計一種可執行高度平行運算的卷積神經網路處理器及其資料處理方法以提供高性能的運算。 In view of this, the present invention designs a convolutional neural network processor that can perform highly parallel operations and a data processing method thereof to provide high-performance operations.

本發明提供之卷積神經網路處理器及其資料處理方法，透過訊息解碼單元以及卷積判斷單元而可執行高度平行運算。 The convolutional neural network processor and the data processing method thereof provided by the present invention can perform highly parallel operations through the information decoding unit and the convolution determination unit.

依據本發明一實施方式提供一種卷積神經網路處理器，用以運算輸入資料，卷積神經網路處理器包含訊息解碼單元以及卷積判斷單元。訊息解碼單元用以接收輸入程式及複數輸入權重參數，且包含解碼模組及平行處理模組。解碼模組接收輸入程式，並根據輸入程式輸出運作指令。平行處理模組與解碼模組電性連接，並接收輸入權重參數，且平行處理模組包含複數平行處理子模組，平行處理子模組根據運作指令及輸入權重參數產生複數輸出權重參數。卷積判斷單元與訊息解碼單元電性連接，且包含運算模組。運算模組與平行處理模組電性連接，運算模組依據輸入資料與輸出權重參數運算而產生輸出資料。 According to an embodiment of the present invention, a convolutional neural network processor is provided for computing input data, and the convolutional neural network processor includes a message decoding unit and a convolution judgment unit. The message decoding unit is used for receiving the input program and the complex input weight parameter, and includes a decoding module and a parallel processing module. The decoding module receives the input program and outputs the operation command according to the input program. The parallel processing module is electrically connected to the decoding module, and receives input weight parameters. The parallel processing module includes a plurality of parallel processing sub-modules, and the parallel processing sub-module generates complex output weight parameters according to the operation command and the input weight parameters. The convolution determination unit is electrically connected with the message decoding unit, and includes an operation module. The operation module is electrically connected with the parallel processing module, and the operation module generates output data according to the operation of the input data and the output weight parameter.

藉此，卷積神經網路處理器可藉由訊息解碼單元的解碼模組及平行處理模組，以及卷積判斷單元的運算模組執行高度平行運算，進而提供高性能且低功耗的運算。 In this way, the convolutional neural network processor can perform highly parallel operations through the decoding module and parallel processing module of the message decoding unit, and the operation module of the convolution judgment unit, thereby providing high-performance and low-power computing .

根據前段所述實施方式的卷積神經網路處理器，其中解碼模組包含程式記憶體及指令解碼器。程式記憶體儲存輸入程式。指令解碼器與程式記憶體電性連接，指令解碼器將輸入程式解碼以輸出運作指令。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the decoding module includes a program memory and an instruction decoder. Program memory stores input programs. The instruction decoder is electrically connected with the program memory, and the instruction decoder decodes the input program to output the operation instruction.

根據前段所述實施方式的卷積神經網路處理器，當輸入權重參數為複數非壓縮輸入權重參數，平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存非壓縮輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接，平行子處理器根據運作指令平行地接收非壓縮輸入權重參數，產生輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, when the input weight parameter is a complex uncompressed input weight parameter, the parallel processing sub-module includes a complex parallel sub-memory and a complex parallel sub-processor. The complex parallel sub-memory stores the uncompressed input weight parameters in parallel. The plurality of parallel sub-processors are respectively electrically connected with the decoding module and the parallel sub-memory, and the parallel sub-processors receive the uncompressed input weight parameters in parallel according to the operation instruction, and generate the output weight parameters.

根據前段所述實施方式的卷積神經網路處理器，當輸入權重參數為複數壓縮輸入權重參數，平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存壓縮輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接，平行子處理器根據運作指令平行地接收並解壓縮壓縮輸入權重參數，產生輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, when the input weight parameter is a complex compression input weight parameter, the parallel processing sub-module includes a complex parallel sub-memory and a complex parallel sub-processor. The complex parallel sub-memory stores the compressed input weight parameters in parallel. The plurality of parallel sub-processors are respectively electrically connected to the decoding module and the parallel sub-memory, and the parallel sub-processors receive and decompress and compress the input weight parameters in parallel according to the operation instruction, and generate the output weight parameters.

根據前段所述實施方式的卷積神經網路處理器，其中輸入權重參數包含複數第一輸入權重參數、輸出權重參數包含複數第一輸出權重參數。平行處理子模組包含複數平行子記憶體及複數平行子處理器。複數平行子記憶體平行地儲存輸入權重參數，平行子記憶體包含複數第一平行子記憶體。複數第一平行子記憶體分別且平行地接收並儲存第一輸入權重參數。複數平行子處理器分別與解碼模組及平行子記憶體電性連接，平行子處理器包含複數第一平行子處理器。複數第一平行子處理器分別與第一平行子記憶體電性連接，根據運作指令接收第一輸入權重參數，以輸出第一輸出權重參數。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the input weight parameter includes a complex number of first input weight parameters, and the output weight parameter includes a complex number of first output weight parameters. The parallel processing sub-module includes a plurality of parallel sub-memory and a plurality of parallel sub-processors. The plurality of parallel sub-memory stores the input weight parameters in parallel, and the parallel sub-memory includes a plurality of first parallel sub-memory. The plurality of first parallel sub-memory respectively and in parallel receive and store the first input weight parameters. The plurality of parallel sub-processors are respectively electrically connected with the decoding module and the parallel sub-memory, and the parallel sub-processors include a plurality of first parallel sub-processors. The plurality of first parallel sub-processors are respectively connected with the first parallel sub-processors The row sub-memory is electrically connected, and receives the first input weight parameter according to the operation command to output the first output weight parameter.

根據前段所述實施方式的卷積神經網路處理器，其中第一輸出權重參數包含複數3×3權重參數。運算模組包含3×3運算子模組。3×3運算子模組與第一平行子處理器電性連接，並根據第一輸出權重參數與輸入資料進行運算，以產生3×3後處理運算資料，3×3運算子模組包含複數3×3卷積分配器組、複數3×3本地卷積運算單元及複數3×3後處理運算單元。各3×3卷積分配器組與一第一平行子處理器電性連接，3×3卷積分配器組用以接收及分配第一輸出權重參數之3×3權重參數。複數3×3本地卷積運算單元分別與一3×3卷積分配器組電性連接，各3×3本地卷積運算單元包含3×3本地暫存器組及3×3本地濾波運算單元。3×3本地暫存器組與一3×3卷積分配器組電性連接，3×3本地卷積運算單元之3×3本地暫存器組接收並儲存第一輸出權重參數之3×3權重參數，並根據第一輸出權重參數之3×3權重參數，以輸出複數3×3運算參數。3×3本地濾波運算單元與3×3本地暫存器組電性連接，3×3本地卷積運算單元之3×3本地濾波運算單元根據3×3運算參數與輸入資料進行運算以產生複數3×3運算資料。複數3×3後處理運算單元與3×3本地卷積運算單元電性連接，並依據3×3運算資料進行3×3後處理運算，以產生3×3後處理運算資料，其中輸出資料為3×3後處理運算資料。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the first output weight parameter comprises a complex 3×3 weight parameter. The operation module includes 3×3 operation sub-modules. The 3×3 operation sub-module is electrically connected to the first parallel sub-processor, and performs operations with the input data according to the first output weight parameter to generate 3×3 post-processing operation data. The 3×3 operation sub-module includes complex numbers 3×3 convolution distributor group, complex 3×3 local convolution operation unit and complex 3×3 post-processing operation unit. Each 3×3 convolutional distributor group is electrically connected with a first parallel sub-processor, and the 3×3 convolutional distributor group is used for receiving and distributing the 3×3 weight parameters of the first output weight parameter. The complex 3×3 local convolution operation units are respectively electrically connected with a 3×3 convolution distributor group, and each 3×3 local convolution operation unit includes a 3×3 local register group and a 3×3 local filter operation unit. The 3×3 local register group is electrically connected with a 3×3 convolution distributor group, and the 3×3 local register group of the 3×3 local convolution operation unit receives and stores the 3×3 first output weight parameters A weight parameter, and according to the 3×3 weight parameter of the first output weight parameter, a complex 3×3 operation parameter is output. The 3×3 local filter operation unit is electrically connected to the 3×3 local register group, and the 3×3 local filter operation unit of the 3×3 local convolution operation unit performs operations according to the 3×3 operation parameters and the input data to generate complex numbers 3×3 operation data. The complex 3×3 post-processing operation unit is electrically connected to the 3×3 local convolution operation unit, and performs 3×3 post-processing operation according to the 3×3 operation data to generate 3×3 post-processing operation data, wherein the output data is 3×3 post-processing operation data.

根據前段所述實施方式的卷積神經網路處理器，其中各3×3本地暫存器組包含二子3×3本地暫存器組。二子3×3本地暫存器組交替地儲存一3×3權重參數或輸出3×3運算參數給3×3本地濾波運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein each 3×3 local register set includes two sub-3×3 local register sets. The two sub 3×3 local register groups alternately store a 3×3 weight parameter or output a 3×3 operation parameter to the 3×3 local filtering operation unit.

根據前段所述實施方式的卷積神經網路處理器，其中輸入權重參數更包含偏壓輸入權重參數、輸出權重參數更包含偏壓輸出權重參數。平行子記憶體更包含偏壓平行子記憶體。偏壓平行子記憶體平行地儲存偏壓輸入權重參數。平行子處理器更包含偏壓平行子處理器。偏壓平行子處理器與偏壓平行子記憶體電性連接，根據運作指令接收偏壓輸入權重參數，以輸出偏壓輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the input weight parameter further includes a bias input weight parameter, and the output weight parameter further includes a bias output weight parameter. The parallel sub-memory further includes a biased parallel sub-memory. The bias parallel sub-memory stores the bias input weight parameters in parallel. The parallel sub-processor further includes a biased parallel sub-processor. The bias parallel sub-processor is electrically connected to the bias parallel sub-memory, receives the bias input weight parameter according to the operation command, and outputs the bias output weight parameter.

根據前段所述實施方式的卷積神經網路處理器，其中偏壓輸出權重參數包含複數偏壓權重參數。運算模組更包含偏壓分配器。偏壓分配器與偏壓平行子處理器、3×3運算子模組電性連接，偏壓分配器根據偏壓輸出權重參數以產生複數3×3偏壓權重參數，並將3×3偏壓權重參數輸出至3×3後處理運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the bias output weight parameter comprises a complex bias weight parameter. The computing module further includes a bias distributor. The bias divider is electrically connected with the bias parallel sub-processor and the 3×3 operation sub-module. The bias divider outputs the weight parameter according to the bias voltage to generate a complex 3×3 bias weight parameter, and the 3×3 bias The pressure weight parameters are output to the 3×3 post-processing operation unit.

根據前段所述實施方式的卷積神經網路處理器，其中輸入權重參數更包含至少一第二輸入權重參數、輸出權重參數更包含至少一第二輸出權重參數。平行子記憶體更包含至少一第二平行子記憶體。第二平行子記憶體分別且平行地接收並儲存至少一第二輸入權重參數。平行子處理器更包含至少一第二平行子處理器。至少一第二平行子處理器分別與至少一第二平行子記憶體電性連接，根據運作指令接收至少一第二輸入權重參數，以輸出至少一第二輸出權重參數。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the input weight parameter further includes at least one second input weight parameter, and the output weight parameter further includes at least one second output weight parameter. The parallel sub-memory further includes at least one second parallel sub-memory. The second parallel sub-memory respectively and in parallel receive and store at least one second input weight parameter. The parallel sub-processor further includes at least one second parallel sub-processor. The at least one second parallel sub-processor is electrically connected to the at least one second parallel sub-memory respectively, and the root At least one second input weight parameter is received according to the operation command to output at least one second output weight parameter.

根據前段所述實施方式的卷積神經網路處理器，其中運算模組包含3×3運算子模組及1×1運算子模組。3×3運算子模組與第一平行子處理器電性連接，並根據第一輸出權重參數與輸入資料進行運算，以產生3×3後處理運算資料。1×1運算子模組與至少一第二平行子處理器及3×3運算子模組電性連接，並根據至少一第二輸出權重參數與3×3後處理運算資料進行運算，以產生1×1後處理運算資料，其中，輸出資料可為1×1後處理運算資料。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the operation module includes a 3×3 operation sub-module and a 1×1 operation sub-module. The 3×3 operation sub-module is electrically connected to the first parallel sub-processor, and performs operation on the input data according to the first output weight parameter to generate 3×3 post-processing operation data. The 1×1 operation sub-module is electrically connected with at least one second parallel sub-processor and the 3×3 operation sub-module, and performs operations according to at least one second output weight parameter and 3×3 post-processing operation data to generate 1×1 post-processing operation data, wherein the output data can be 1×1 post-processing operation data.

根據前段所述實施方式的卷積神經網路處理器，其中至少一第二輸出權重參數包含複數1×1權重參數。1×1運算子模組包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元。1×1卷積分配器組與至少一第二平行子處理器電性連接，用以接收及分配至少一第二輸出權重參數之1×1權重參數。複數1×1本地卷積運算單元與至少一1×1卷積分配器電性連接，各1×1本地卷積運算單元包含1×1本地暫存器組及1×1本地濾波運算單元。1×1本地暫存器組與至少一1×1卷積分配器組電性連接，1×1本地卷積運算單元之1×1本地暫存器組接收並儲存第二輸出權重參數之1×1權重參數，並根據至少一第二輸出權重參數之1×1權重參數，以輸出複數1×1運算參數。1×1本地濾波運算單元與1×1本地暫存器組電性連接，1×1本地卷積運算單元之1×1本地濾波運算單元根據 1×1運算參數與3×3後處理運算資料進行運算以產生複數1×1運算資料。複數1×1後處理運算單元與1×1本地卷積運算單元電性連接，並依據1×1運算資料進行1×1後處理運算，以產生1×1後處理運算資料。 According to the convolutional neural network processor of the embodiment described in the preceding paragraph, the at least one second output weight parameter includes a complex number of 1×1 weight parameters. The 1×1 operation sub-module includes at least one 1×1 convolution distributor group, a complex 1×1 local convolution operation unit and a complex 1×1 post-processing operation unit. The 1×1 convolutional distributor group is electrically connected to the at least one second parallel sub-processor for receiving and distributing the 1×1 weight parameter of the at least one second output weight parameter. The complex 1×1 local convolution operation unit is electrically connected with at least one 1×1 convolution distributor, and each 1×1 local convolution operation unit includes a 1×1 local register group and a 1×1 local filter operation unit. The 1×1 local register group is electrically connected to at least one 1×1 convolution distributor group, and the 1×1 local register group of the 1×1 local convolution operation unit receives and stores 1× of the second output weight parameter 1 weight parameter, and according to the 1×1 weight parameter of the at least one second output weight parameter, a complex 1×1 operation parameter is output. The 1×1 local filter operation unit is electrically connected to the 1×1 local register group, and the 1×1 local filter operation unit of the 1×1 local convolution operation unit is based on the The 1×1 operation parameters are operated on with the 3×3 post-processing operation data to generate complex 1×1 operation data. The complex 1×1 post-processing operation unit is electrically connected to the 1×1 local convolution operation unit, and performs 1×1 post-processing operation according to the 1×1 operation data to generate 1×1 post-processing operation data.

根據前段所述實施方式的卷積神經網路處理器，其中各1×1本地暫存器組包含二子1×1本地暫存器組。二子1×1本地暫存器組交替地儲存一1×1權重參數或輸出1×1運算參數給1×1運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein each 1×1 local register group includes two sub-1×1 local register groups. The two sub 1×1 local register groups alternately store a 1×1 weight parameter or output a 1×1 operation parameter to a 1×1 operation unit.

根據前段所述實施方式的卷積神經網路處理器，其中偏壓輸出權重參數包含複數偏壓權重參數。運算模組更包含偏壓分配器。偏壓分配器與偏壓平行子處理器、3×3運算子模組及1×1運算子模組電性連接，偏壓分配器根據偏壓輸出權重參數以產生複數3×3偏壓權重參數及複數1×1偏壓權重參數。偏壓分配器將3×3偏壓權重參數輸出至3×3後處理運算單元。偏壓分配器將1×1偏壓權重參數輸出至1×1後處理運算單元。 The convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the bias output weight parameter comprises a complex bias weight parameter. The computing module further includes a bias distributor. The bias divider is electrically connected to the bias parallel sub-processor, the 3×3 operation sub-module and the 1×1 operation sub-module, and the bias divider generates a complex 3×3 bias weight according to the bias output weight parameter parameter and complex 1×1 bias weight parameter. The bias distributor outputs the 3×3 bias weight parameters to the 3×3 post-processing arithmetic unit. The bias distributor outputs the 1×1 bias weight parameters to the 1×1 post-processing arithmetic unit.

依據本發明一實施方式提供一種卷積神經網路處理器的資料處理方法包含接收步驟、指令解碼步驟、平行處理步驟及運算步驟。接收步驟驅動訊息解碼單元接收輸入程式及複數輸入權重參數，其中訊息解碼單元包含解碼模組及平行處理模組。指令解碼步驟驅動解碼模組接收輸入程式，並根據輸入程式，以產生運作指令。平行處理步驟驅動平行處理模組接收輸入權重參數，並根據運作指令以平行地處理輸入權重參數，以產生複數輸出權重參數。運算步驟驅動運算模組接收輸入資料及輸出權重參數，並根據運作指令以將輸入資料與輸出權重參數進行運算，以產生輸出資料。 According to an embodiment of the present invention, a data processing method of a convolutional neural network processor is provided, which includes a receiving step, an instruction decoding step, a parallel processing step, and an operation step. The receiving step drives the message decoding unit to receive the input program and the complex input weight parameter, wherein the message decoding unit includes a decoding module and a parallel processing module. The instruction decoding step drives the decoding module to receive the input program and generate operation instructions according to the input program. The parallel processing step drives the parallel processing module to receive the input weight parameters, and process the input weight parameters in parallel according to the operation instruction to generate complex output weight parameters. The operation step drives the operation module to receive the input data and the output weight parameter, and operate the input data and the output weight parameter according to the operation instruction to generate the output data.

藉此，卷積神經網路處理器的資料處理方法可藉由接收步驟、指令解碼步驟、平行處理步驟及運算步驟驅動訊息解碼單元的解碼模組及平行處理模組，以及卷積判斷單元的運算模組執行高度平行運算，進而提供高性能且低功耗的運算。 Thereby, the data processing method of the convolutional neural network processor can drive the decoding module and parallel processing module of the information decoding unit, and the convolution judgment unit through the receiving step, the instruction decoding step, the parallel processing step and the operation step. The computing module performs highly parallel computing, thereby providing high-performance and low-power computing.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中解碼模組包含程式記憶體及指令解碼器，且指令解碼步驟包含程式儲存子步驟及程式解碼子步驟。程式儲存子步驟驅動程式記憶體儲存輸入程式。程式解碼子步驟驅動指令解碼器對輸入程式進行解碼，以產生運作指令。 According to the data processing method of a convolutional neural network processor according to the above-mentioned embodiment, the decoding module includes a program memory and an instruction decoder, and the instruction decoding step includes a program storage sub-step and a program decoding sub-step. Program storage sub-step driver memory stores input programs. The program decoding sub-step drives the instruction decoder to decode the input program to generate operation instructions.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中平行處理模組包含複數平行子記憶體及複數平行子處理器，且平行處理步驟包含權重參數儲存子步驟及權重參數處理子步驟。權重參數儲存子步驟驅動平行子記憶體以平行地儲存輸入權重參數。權重參數處理子步驟驅動平行子處理器，平行子處理器根據運作指令，平行地讀取輸入權重參數並進行運作處理，以產生輸出權重參數。 The data processing method of a convolutional neural network processor according to the embodiment described in the preceding paragraph, wherein the parallel processing module includes a complex parallel sub-memory Volume and complex parallel sub-processors, and the parallel processing step includes a weight parameter storage sub-step and a weight parameter processing sub-step. The weight parameter storage sub-step drives the parallel sub-memory to store the input weight parameters in parallel. The weight parameter processing sub-step drives the parallel sub-processor, and the parallel sub-processor reads the input weight parameter in parallel according to the operation instruction and performs operation processing to generate the output weight parameter.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中當輸入權重參數為複數非壓縮輸入權重參數，運作處理為儲存非壓縮輸入權重參數。當輸入權重參數為複數壓縮輸入權重參數，運作處理為儲存及解壓縮壓縮輸入權重參數。 According to the data processing method of the convolutional neural network processor according to the embodiment described in the preceding paragraph, when the input weight parameter is a complex number of uncompressed input weight parameters, the operation process is to store the uncompressed input weight parameters. When the input weight parameter is a complex compressed input weight parameter, the operation process is to store and decompress the compressed input weight parameter.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中輸出權重參數包含複數第一輸出權重參數。運算模組包含3×3運算子模組。運算步驟包含第一運算子步驟。第一運算子步驟驅動3×3運算子模組接收輸入資料及第一輸出權重參數，以產生3×3後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter includes a complex number of first output weight parameters. The operation module includes 3×3 operation sub-modules. The operation step includes a first operation sub-step. The first operation sub-step drives the 3×3 operation sub-module to receive the input data and the first output weight parameter to generate 3×3 post-processing operation data.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中各第一輸出權重參數包含複數3×3權重參數。3×3運算子模組包含複數3×3卷積分配器組、複數3×3本地卷積運算單元及複數3×3後處理運算單元。第一運算子步驟包含3×3參數分配程序、3×3運算參數產生程序、3×3卷積運算程序及3×3後處理運算程序。3×3參數分配程序驅動3×3卷積分配器組接收第一輸出權重參數之3×3 權重參數，並將第一輸出權重參數之3×3權重參數分配至3×3本地卷積運算單元。各3×3本地卷積運算單元包含3×3本地暫存器組及3×3本地濾波運算單元。3×3運算參數產生程序驅動3×3本地卷積運算單元之3×3本地暫存器組接收第一輸出權重參數之3×3權重參數，並根據第一輸出權重參數之3×3權重參數產生複數3×3運算參數。3×3卷積運算程序驅動3×3本地卷積運算單元之3×3本地濾波運算單元以將3×3運算參數及輸入資料進行3×3卷積運算，以產生複數3×3運算資料。3×3後處理運算程序驅動3×3後處理運算單元以將3×3運算資料進行3×3後處理運算，以產生3×3後處理運算資料，其中輸出資料為3×3後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, each of the first output weight parameters includes a complex number of 3×3 weight parameters. The 3×3 operation sub-module includes a complex 3×3 convolution distributor group, a complex 3×3 local convolution operation unit and a complex 3×3 post-processing operation unit. The first operation sub-step includes a 3×3 parameter allocation procedure, a 3×3 operation parameter generation procedure, a 3×3 convolution operation procedure and a 3×3 post-processing operation procedure. The 3×3 parameter assignment program drives the 3×3 convolutional assigner group to receive 3×3 of the first output weight parameters weight parameter, and assign the 3×3 weight parameter of the first output weight parameter to the 3×3 local convolution operation unit. Each 3×3 local convolution operation unit includes a 3×3 local register group and a 3×3 local filter operation unit. The 3×3 operation parameter generation program drives the 3×3 local register group of the 3×3 local convolution operation unit to receive the 3×3 weight parameter of the first output weight parameter, and according to the 3×3 weight of the first output weight parameter The parameter yields a complex 3x3 arithmetic parameter. The 3×3 convolution operation program drives the 3×3 local filter operation unit of the 3×3 local convolution operation unit to perform a 3×3 convolution operation on the 3×3 operation parameters and input data to generate complex 3×3 operation data . The 3×3 post-processing operation program drives the 3×3 post-processing operation unit to perform 3×3 post-processing operations on the 3×3 operation data to generate 3×3 post-processing operation data, wherein the output data is the 3×3 post-processing operation material.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中輸出權重參數更包含偏壓輸出權重參數。運算模組更包含偏壓分配器。運算步驟更包含偏壓運算子步驟。偏壓運算子步驟驅動偏壓分配器根據偏壓輸出權重參數，以產生複數3×3偏壓權重參數，偏壓分配器將3×3偏壓權重參數提供予3×3運算子模組。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes a bias output weight parameter. The computing module further includes a bias distributor. The operation step further includes a bias voltage operation sub-step. The bias operation sub-step drives the bias distributor to output weight parameters according to the bias voltage to generate complex 3×3 bias weight parameters, and the bias distributor provides the 3×3 bias weight parameters to the 3×3 operation sub-module.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中輸出權重參數更包含至少一第二輸出權重參數。運算模組包含1×1運算子模組。運算步驟更包含第二運算子步驟。第二運算子步驟驅動1×1運算子模組接收3×3後處理運算資料及至少一第二輸出權重參數，以產生1×1後處理運算資料。 According to the data processing method of the convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes at least one second output weight parameter. The operation module contains 1×1 operation sub-modules. The operation step further includes a second operation sub-step. The second operation sub-step drives the 1×1 operation sub-module to receive 3×3 post-processing operation data and at least one second output weight parameter to generate 1×1 post-processing operation data.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中至少一第二輸出權重參數包含複數1×1權重參數。1×1運算子模組包含複數1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元。第二運算子步驟包含1×1參數分配程序、1×1運算參數產生程序、1×1卷積運算程序及1×1後處理運算程序。1×1參數分配程序驅動1×1卷積分配器組以接收至少一第二輸出權重參數之1×1權重參數，並將至少一第二輸出權重參數之1×1權重參數分配至1×1本地卷積運算單元，其中各1×1本地卷積運算單元包含1×1本地暫存器組及1×1本地濾波運算單元。1×1運算參數產生程序驅動1×1本地卷積運算單元之1×1本地暫存器組接收至少一第二輸出權重參數之1×1權重參數，並根據至少一第二輸出權重參數之1×1權重參數產生複數1×1運算參數。1×1卷積運算程序驅動1×1本地卷積運算單元之1×1本地濾波運算單元以將1×1運算參數及3×3後處理運算資料進行1×1卷積運算，以產生複數1×1運算資料。1×1後處理運算程序驅動1×1後處理運算單元以將1×1運算資料進行1×1後處理運算，以產生1×1後處理運算資料，其中輸出資料為1×1後處理運算資料。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the at least one second output weight parameter includes a complex 1×1 weight parameter. The 1×1 operation sub-module includes a complex 1×1 convolution distributor group, a complex 1×1 local convolution operation unit and a complex 1×1 post-processing operation unit. The second operation sub-step includes a 1×1 parameter allocation procedure, a 1×1 operation parameter generation procedure, a 1×1 convolution operation procedure, and a 1×1 post-processing operation procedure. The 1×1 parameter assignment program drives the 1×1 convolutional assigner group to receive the 1×1 weight parameter of the at least one second output weight parameter, and assign the 1×1 weight parameter of the at least one second output weight parameter to the 1×1 A local convolution operation unit, wherein each 1×1 local convolution operation unit includes a 1×1 local register group and a 1×1 local filter operation unit. The 1×1 operation parameter generation program drives the 1×1 local register group of the 1×1 local convolution operation unit to receive the 1×1 weight parameter of the at least one second output weight parameter, and according to the at least one second output weight parameter The 1×1 weight parameter produces a complex 1×1 operation parameter. The 1×1 convolution operation program drives the 1×1 local filter operation unit of the 1×1 local convolution operation unit to perform 1×1 convolution operation on the 1×1 operation parameters and 3×3 post-processing operation data to generate complex numbers 1×1 operation data. The 1×1 post-processing operation program drives the 1×1 post-processing operation unit to perform 1×1 post-processing operations on the 1×1 operation data to generate 1×1 post-processing operation data, wherein the output data is the 1×1 post-processing operation material.

根據前段所述實施方式的卷積神經網路處理器的資料處理方法，其中輸出權重參數更包含偏壓輸出權重參數。運算模組更包含偏壓分配器。運算步驟更包含偏壓運算子步驟。偏壓運算子步驟驅動偏壓分配器根據偏壓輸出權重參數，以產生複數3×3偏壓權重參數及複數1×1偏壓權重參數，其中偏壓分配器將3×3偏壓權重參數提供予3×3運算子模組，並將1×1偏壓權重參數提供予1×1運算子模組。 According to the data processing method for a convolutional neural network processor according to the embodiment described in the preceding paragraph, the output weight parameter further includes a bias output weight parameter. The computing module further includes a bias distributor. The operation step further includes a bias voltage operation sub-step. The bias operation sub-step drives the bias distributor to output the weight parameter according to the bias to generate a complex 3×3 bias weight parameter and a complex 1×1 bias Weight parameters, wherein the bias distributor provides 3×3 bias weight parameters to the 3×3 arithmetic sub-modules, and provides 1×1 bias weight parameters to the 1×1 arithmetic sub-modules.

100‧‧‧卷積神經網路處理器 100‧‧‧Convolutional Neural Network Processor

102‧‧‧輸入程式 102‧‧‧Input Program

104‧‧‧輸入權重參數 104‧‧‧Input Weight Parameters

106‧‧‧輸入資料 106‧‧‧Enter data

1062‧‧‧3×3後處理運算資料 1062‧‧‧3×3 post-processing operation data

1064‧‧‧1×1後處理運算資料 1064‧‧‧1×1 post-processing operation data

108‧‧‧輸出資料 108‧‧‧Output data

110‧‧‧訊息解碼單元 110‧‧‧Message Decoding Unit

111‧‧‧解碼模組 111‧‧‧Decoding Module

1111‧‧‧程式記憶體 1111‧‧‧Program memory

1112‧‧‧指令解碼器 1112‧‧‧Instruction Decoder

112‧‧‧平行處理模組 112‧‧‧Parallel processing module

1121‧‧‧平行處理子模組 1121‧‧‧Parallel Processing Submodule

1121a‧‧‧平行子記憶體 1121a‧‧‧Parallel sub-memory

1121aa‧‧‧第一平行子記憶體 1121aa‧‧‧First parallel sub-memory

1121ab‧‧‧偏壓平行子記憶體 1121ab‧‧‧biased parallel sub-memory

1121ac‧‧‧第二平行子記憶體 1121ac‧‧‧Second parallel sub-memory

1213d‧‧‧1×1本地濾波運算單元 1213d‧‧‧1×1 local filtering operation unit

1213e‧‧‧1×1後處理運算單元 1213e‧‧‧1×1 post-processing operation unit

1213f‧‧‧第一1×1卷積分配器 1213f‧‧‧First 1×1 convolution distributor

1213g‧‧‧第二1×1卷積分配器 1213g‧‧‧Second 1×1 convolution distributor

122‧‧‧控制器 122‧‧‧Controller

s200‧‧‧卷積神經網路處理器的資料處理方法 s200‧‧‧Data processing method of convolutional neural network processor

s210‧‧‧接收步驟 s210‧‧‧Receive steps

s220‧‧‧指令解碼步驟 s220‧‧‧ instruction decoding steps

s221‧‧‧程式儲存子步驟 s221‧‧‧Program storage sub-step

s222‧‧‧程式解碼子步驟 s222‧‧‧Program decoding sub-step

s230‧‧‧平行處理步驟 s230‧‧‧Parallel processing steps

s231‧‧‧權重參數儲存子步驟 s231‧‧‧Weight parameter storage sub step

1121b‧‧‧平行子處理器 1121b‧‧‧Parallel Subprocessor

1121ba‧‧‧第一平行子處理器 1121ba‧‧‧First parallel sub-processor

1121bb‧‧‧偏壓平行子處理器 1121bb‧‧‧biased parallel sub-processor

1121bc‧‧‧第二平行子處理器 1121bc‧‧‧Second Parallel Subprocessor

120‧‧‧卷積判斷單元 120‧‧‧Convolution judgment unit

121‧‧‧運算模組 121‧‧‧Computing Modules

1211‧‧‧3×3運算子模組 1211‧‧‧3×3 operation sub-module

1211a‧‧‧3×3運算電路 1211a‧‧‧3×3 arithmetic circuit

1211b‧‧‧3×3本地卷積運算單元 1211b‧‧‧3×3 local convolution operation unit

1211c‧‧‧3×3本地暫存器組 1211c‧‧‧3×3 local register bank

1211ca、1211cb‧‧‧子3×3本地暫存器組 1211ca, 1211cb‧‧‧Sub 3×3 local register bank

1211d‧‧‧3×3本地濾波運算單元 1211d‧‧‧3×3 local filtering operation unit

1211e‧‧‧3×3後處理運算單元 1211e‧‧‧3×3 post-processing operation unit

1211f‧‧‧第一3×3卷積分配器 1211f‧‧‧First 3×3 convolutional divider

1211g‧‧‧第二3×3卷積分配器 1211g‧‧‧Second 3×3 convolution distributor

1212‧‧‧偏壓分配器 1212‧‧‧bias distributor

1213‧‧‧1×1運算子模組 1213‧‧‧1×1 arithmetic sub-module

1213a‧‧‧1×1運算電路 1213a‧‧‧1×1 arithmetic circuit

1213b‧‧‧1×1本地卷積運算單元 1213b‧‧‧1×1 local convolution operation unit

1213c‧‧‧1×1本地暫存器組 1213c‧‧‧1×1 local register bank

1213ca、1213cb‧‧‧子1×1本地暫存器組 1213ca, 1213cb‧‧‧Sub 1×1 local register bank

s232‧‧‧權重參數處理子步驟 s232‧‧‧ weight parameter processing sub-step

s240‧‧‧運算步驟 s240‧‧‧Operation steps

s241‧‧‧第一運算子步驟 s241‧‧‧First operator substep

s2411‧‧‧3×3參數分配程序 s2411‧‧‧3×3 parameter assignment program

s2412‧‧‧3×3運算參數產生程序 s2412‧‧‧3×3 operation parameter generation program

s2413‧‧‧3×3卷積運算程序 s2413‧‧‧3×3 convolution operation program

s2414‧‧‧3×3後處理運算程序 s2414‧‧‧3×3 post-processing operation program

s242‧‧‧偏壓運算子步驟 s242‧‧‧Bias operation sub-step

s243‧‧‧第二運算子步驟 s243‧‧‧Second operator substep

s2431‧‧‧1×1參數分配程序 s2431‧‧‧1×1 parameter assignment program

s2432‧‧‧1×1運算參數產生程序 s2432‧‧‧1×1 operation parameter generation program

s2433‧‧‧1×1卷積運算程序 s2433‧‧‧1×1 convolution operation program

s2434‧‧‧1×1後處理運算程序 s2434‧‧‧1×1 post-processing operation program

第1圖繪示依照本發明一結構態樣之一實施方式的卷積神經網路處理器之方塊圖；第2圖繪示依照本發明另一結構態樣之一實施方式的卷積神經網路處理器之方塊圖；第3圖繪示依照第2圖結構態樣之實施方式的卷積神經網路處理器之3×3運算子模組之方塊圖；第4圖繪示依照第3圖結構態樣之實施方式的卷積神經網路處理器之3×3運算子模組之3×3本地卷積運算單元示意圖；第5圖繪示依照本發明又一結構態樣之一實施方式的卷積神經網路處理器之方塊圖；第6圖繪示依照第5圖結構態樣之實施方式的卷積神經網路處理器之1×1運算子模組之方塊圖；第7圖繪示依照第6圖結構態樣之實施方式的卷積神經網路處理器之1×1運算子模組之1×1本地卷積運算單元示意圖；第8圖繪示依照本發明一方法態樣之一實施方式的卷積神經網路處理器的資料處理方法之步驟方塊圖；第9圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的指令解碼步驟之步驟方塊圖；第10圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的平行處理步驟之步驟方塊圖；第11圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法的運算步驟之步驟方塊圖；以及第12圖繪示依照第8圖之方法態樣之另一實施方式的卷積神經網路處理器的資料處理方法的運算步驟之步驟方塊圖。 FIG. 1 shows a block diagram of a convolutional neural network processor according to an embodiment of one structural aspect of the present invention; FIG. 2 shows a convolutional neural network according to an embodiment of another structural aspect of the present invention. The block diagram of the processor; FIG. 3 shows the block diagram of the 3×3 operation sub-module of the convolutional neural network processor according to the implementation of the structural aspect of FIG. 2; FIG. 4 shows the block diagram according to the third FIG. 5 is a schematic diagram of a 3×3 local convolution operation unit of a 3×3 operation sub-module of a convolutional neural network processor according to an embodiment of a structural aspect; FIG. 5 shows an implementation according to another structural aspect of the present invention. A block diagram of a convolutional neural network processor in accordance with FIG. 5; FIG. 6 shows a block diagram of a 1×1 operation sub-module of a convolutional neural network processor according to an embodiment of the structural aspect of FIG. 5; FIG. 7 Fig. 6 shows a schematic diagram of a 1 × 1 local convolution operation unit of a 1 × 1 operation sub-module of a convolutional neural network processor according to the embodiment of the structural aspect of Fig. 6; Fig. 8 shows a method according to the present invention A block diagram of the steps of the data processing method of the convolutional neural network processor according to an embodiment of the aspect; FIG. 9 illustrates the data processing of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 Step block of the instruction decoding step of the method Fig. 10 shows a step block diagram of parallel processing steps of the data processing method of the convolutional neural network processor according to the embodiment of the method aspect of Fig. 8; Fig. 11 shows the method according to Fig. 8 A step block diagram of the operation steps of the data processing method of the convolutional neural network processor of an embodiment of the aspect; and FIG. 12 illustrates a convolutional neural network according to another embodiment of the method aspect of FIG. 8 A step block diagram of the operation steps of the data processing method of the processor.

以下將參照圖式說明本發明之複數個實施例。為明確說明起見，許多實務上的細節將在以下敘述中一併說明。然而，應瞭解到，這些實務上的細節不應用以限制本發明。也就是說，在本發明部分實施例中，這些實務上的細節是非必要的。此外，為簡化圖式起見，一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之；並且重複之元件將可能使用相同的編號表示之。 Several embodiments of the present invention will be described below with reference to the drawings. For the sake of clarity, many practical details are set forth in the following description. It should be understood, however, that these practical details should not be used to limit the invention. That is, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the purpose of simplifying the drawings, some well-known and conventional structures and elements will be shown in a simplified and schematic manner in the drawings; and repeated elements may be denoted by the same reference numerals.

第1圖繪示依照本發明一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。由第1圖可知，卷積神經網路處理器100包含訊息解碼單元110及卷積判斷單元120。卷積判斷單元120與訊息解碼單元110電性連接。 FIG. 1 shows a block diagram of a convolutional neural network processor 100 in accordance with one embodiment of a structural aspect of the present invention. As can be seen from FIG. 1 , the convolutional neural network processor 100 includes a message decoding unit 110 and a convolution determination unit 120 . The convolution determination unit 120 is electrically connected to the message decoding unit 110 .

訊息解碼單元110接收輸入程式102及複數輸入權重參數104。訊息解碼單元110包含解碼模組111及平行處理模組112。解碼模組111接收輸入程式102，並根據輸入程式102輸出運作指令。平行處理模組112與解碼模組111電性連接，平行處理模組112接收輸入權重參數104與運作指令。平行處理模組112包含複數平行處理子模組1121，平行處理子模組1121根據運作指令及輸入權重參數104產生複數輸出權重參數。卷積判斷單元120包含運算模組121。運算模組121與平行處理模組112電性連接，運算模組121依據輸入資料106與輸出權重參數運算而產生輸出資料108。詳細來說，卷積神經網路處理器100之訊息解碼單元110於接收輸入程式102及輸入權重參數104後，利用解碼模組111產生運作指令以處理輸入權重參數104。平行處理模組112之各平行處理子模組1121可分別與解碼模組111電性連接，以分別根據運作指令產生輸出權重參數。運算模組121可根據輸入資料106及平行處理模組112所產生之輸出權重參數進行運算以產生輸出資料108。輸入資料106可為儲存於區塊緩衝區(block buffer bank)中的資料或是來自外部的資料。此外，卷積神經網路處理器100可利用區塊緩衝區代替輸入緩衝區及輸出緩衝區以節省外部儲存器的帶寬。藉此，卷積神經網路處理器100可透過訊息解碼單元110及卷積判斷單元120執行高度平行運算以提供高性能的運算。 The message decoding unit 110 receives the input program 102 and the complex input weight parameter 104 . The message decoding unit 110 includes a decoding module 111 and a parallel processing module 112 . The decoding module 111 receives the input program 102 and outputs an operation command according to the input program 102 . The parallel processing module 112 is electrically connected to the decoding module 111 , and the parallel processing module 112 receives the input weight parameter 104 and the operation command. The parallel processing module 112 includes a complex parallel processing sub-module 1121 , and the parallel processing sub-module 1121 generates complex output weight parameters according to the operation command and the input weight parameters 104 . The convolution determination unit 120 includes an operation module 121 . The operation module 121 is electrically connected to the parallel processing module 112 , and the operation module 121 generates the output data 108 according to the operation of the input data 106 and the output weight parameter. Specifically, after receiving the input program 102 and the input weight parameter 104 , the message decoding unit 110 of the convolutional neural network processor 100 uses the decoding module 111 to generate an operation command to process the input weight parameter 104 . Each of the parallel processing sub-modules 1121 of the parallel processing module 112 can be electrically connected to the decoding module 111 respectively to generate output weight parameters according to the operation command. The operation module 121 may perform operations according to the input data 106 and the output weight parameters generated by the parallel processing module 112 to generate the output data 108 . Input data 106 may be data stored in a block buffer bank or data from external sources. In addition, the CNN processor 100 can use block buffers instead of input buffers and output buffers to save bandwidth of external memory. Thus, the convolutional neural network processor 100 can perform highly parallel operations through the information decoding unit 110 and the convolution determination unit 120 to provide high-performance operations.

解碼模組111可包含程式記憶體1111及指令解碼器1112。程式記憶體1111可儲存輸入程式102。指令解碼器1112與程式記憶體1111電性連接。指令解碼器1112將輸入程式102解碼以輸出運作指令。也就是說，解碼模組111於接收輸入程式102後，將輸入程式102儲存於程式記憶體1111中並透過指令解碼器1112進行解碼以產生運作指令，進而透過運作指令驅動各平行處理子模組1121處理輸入權重參數104以產生輸出權重參數。 The decoding module 111 may include a program memory 1111 and an instruction decoder 1112 . The program memory 1111 can store the input program 102 . The instruction decoder 1112 is electrically connected to the program memory 1111 . The command decoder 1112 decodes the input program 102 to output operational commands. That is to say, after receiving the input program 102, the decoding module 111 stores the input program 102 in the program memory 1111 and decodes it through the command decoder 1112 to generate an operation command, and then drives each parallel processing sub-module through the operation command 1121 processes the input weight parameters 104 to generate output weight parameters.

當輸入權重參數104為非壓縮輸入權重參數時，複數平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存非壓縮輸入權重參數。平行子處理器1121b分別與解碼模組111及一平行子記憶體1121a電性連接。平行子處理器1121b根據運作指令平行地接收非壓縮輸入權重參數以產生輸出權重參數。詳細來說，各平行處理子模組1121分別可包含一平行子記憶體1121a及一平行子處理器1121b。平行處理模組112於接收輸入權重參數104後，將輸入權重參數104分別且平行地儲存於各平行處理子模組1121之平行子記憶體1121a中。由於各平行處理子模組1121分別與解碼模組111電性連接，因此，各平行子處理器1121b可分別根據運作指令平行地從平行子記憶體1121a接收非壓縮輸入權重參數以產生輸出權重參數。藉此，平行處理模組112可平行地處理輸入權重參數104以產生輸出權重參數。 When the input weight parameter 104 is an uncompressed input weight parameter, the complex parallel processing sub-module 1121 includes a complex parallel sub-memory 1121a and a complex parallel sub-processor 1121b. The parallel sub-memory 1121a stores the uncompressed input weight parameters in parallel. The parallel sub-processor 1121b is electrically connected to the decoding module 111 and a parallel sub-memory 1121a, respectively. The parallel sub-processor 1121b receives the uncompressed input weight parameter in parallel according to the operation instruction to generate the output weight parameter. Specifically, each parallel processing sub-module 1121 may include a parallel sub-memory 1121a and a parallel sub-processor 1121b, respectively. After receiving the input weight parameters 104 , the parallel processing module 112 stores the input weight parameters 104 in the parallel sub-memory 1121 a of each parallel processing sub-module 1121 respectively and in parallel. Since each parallel processing sub-module 1121 is electrically connected to the decoding module 111, respectively, each parallel sub-processor 1121b can receive the uncompressed input weight parameter from the parallel sub-memory 1121a in parallel according to the operation command to generate the output weight parameter. . Thus, the parallel processing module 112 can process the input weight parameters 104 in parallel to generate the output weight parameters.

當輸入權重參數104為複數壓縮輸入權重參數時，複數平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存壓縮輸入權重參數。平行子處理器1121b分別與解碼模組111及一平行子記憶體1121a電性連接。平行子處理器1121b根據運作指令平行地接收並解壓縮此些壓縮輸入權重參數以產生輸出權重參數。詳細來說，各平行處理子模組1121分別可包含一平行子記憶體1121a及一平行子處理器1121b。平行處理模組112於接收輸入權重參數104後，將輸入權重參數104分別且平行地儲存於各平行處理子模組1121之平行子記憶體1121a中。由於各平行處理子模組1121分別與解碼模組111電性連接，因此，各平行子處理器1121b可分別根據運作指令平行地從平行子記憶體1121a接收壓縮輸入權重參數，並將壓縮輸入權重參數進行解碼以產生輸出權重參數。藉此，平行處理模組112可平行地處理輸入權重參數104以產生輸出權重參數。 When the input weight parameter 104 is a complex compression input weight parameter, the complex parallel processing sub-module 1121 includes a complex parallel sub-memory 1121a and a complex parallel sub-processor 1121b. The parallel sub-memory 1121a stores the compression input weight parameters in parallel. The parallel sub-processor 1121b is electrically connected to the decoding module 111 and a parallel sub-memory 1121a, respectively. The parallel sub-processor 1121b receives and decompresses the compressed input weight parameters in parallel according to the operation instructions to generate output weight parameters. Specifically, each parallel processing sub-module 1121 may include a parallel sub-memory 1121a and a parallel sub-processor 1121b, respectively. After receiving the input weight parameters 104 , the parallel processing module 112 stores the input weight parameters 104 in the parallel sub-memory 1121 a of each parallel processing sub-module 1121 respectively and in parallel. Since each parallel processing sub-module 1121 is electrically connected to the decoding module 111, respectively, each parallel sub-processor 1121b can receive the compression input weight parameter from the parallel sub-memory 1121a in parallel according to the operation command, and apply the compression input weight parameters are decoded to produce output weight parameters. Thus, the parallel processing module 112 can process the input weight parameters 104 in parallel to generate the output weight parameters.

請配合參照第1圖、第2圖、第3圖及第4圖。第2圖繪示依照本發明另一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。第3圖繪示依照第2圖結構態樣之實施方式的卷積神經網路處理器100之3×3運算子模組1211之方塊圖。第4圖繪示依照第3圖結構態樣之實施方式的卷積神經網路處理器100之3×3運算子模組1211之3×3本地卷積運算單元1211b之示意圖。在第2圖至第4圖實施方式中，輸入權重參數104可包含複數第一輸入權重參數及偏壓輸入權重參數。輸出權重參數包含複數第一輸出權重參數及偏壓輸出權重參數。平行處理子模組1121包含複數平行子記憶體1121a及複數平行子處理器1121b。平行子記憶體1121a平行地儲存輸入權重參數104，並包含複數第一平行子記憶體1121aa及偏壓平行子記憶體1121ab。第一平行子記憶體1121aa分別且平行地接收並儲存第一輸入權重參數。偏壓平行子記憶體1121ab平行地儲存偏壓輸入權重參數。平行子處理器1121b分別與解碼模組111及平行子記憶體1121a電性連接，並包含複數第一平行子處理器1121ba及偏壓平行子處理器1121bb。第一平行子處理器1121ba分別與第一平行子記憶體1121aa中之一者電性連接，並根據運作指令接收第一輸入權重參數，以輸出第一輸出權重參數。偏壓平行子處理器1121bb與偏壓平行子記憶體1121ab電性連接，並根據運作指令接收偏壓輸入權重參數，以輸出偏壓輸出權重參數。在第2圖實施方式中，第一平行子記憶體1121aa及第一平行子處理器1121ba的數量均為9，然在其他實施方式中，第一平行子記憶體1121aa及第一平行子處理器1121ba的數量可為9的倍數，本案不以此為限。偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb的數量均為1，但本發明不以此為限。詳細來說，平行處理模組112於接收輸入權重參數104後，將輸入權重參數104中的第一輸入權重參數儲存於第一平行子記憶體1121aa中，以及將偏壓輸入權重參數儲存於偏壓平行子記憶體1121ab中。第一平行子處理器1121ba根據運作指令從第一平行子記憶體1121aa中讀取第一輸入權重參數，並進行處理以產生第一輸出權重參數。偏壓平行子處理器1121bb根據運作指令從偏壓平行子記憶體1121ab中讀取偏壓輸入權重參數，並進行處理以產生偏壓輸出權重參數。 Please refer to Figure 1, Figure 2, Figure 3 and Figure 4 together. FIG. 2 shows a block diagram of a convolutional neural network processor 100 according to an embodiment of another structural aspect of the present invention. FIG. 3 shows a block diagram of the 3×3 operation sub-module 1211 of the convolutional neural network processor 100 according to the implementation of the structural aspect of FIG. 2 . FIG. 4 is a schematic diagram of the 3×3 local convolution operation unit 1211b of the 3×3 operation sub-module 1211 of the convolutional neural network processor 100 according to the implementation of the structural aspect of FIG. 3 . In the embodiments shown in FIGS. 2 to 4, the input weight parameters 104 may include a plurality of first input weight parameters and bias input weights parameter. The output weight parameters include a complex number of first output weight parameters and a bias output weight parameter. The parallel processing sub-module 1121 includes a plurality of parallel sub-memory 1121a and a plurality of parallel sub-processors 1121b. The parallel sub-memory 1121a stores the input weight parameters 104 in parallel, and includes a plurality of first parallel sub-memory 1121aa and bias parallel sub-memory 1121ab. The first parallel sub-memory 1121aa respectively and in parallel receives and stores the first input weight parameters. The bias parallel sub-memory 1121ab stores the bias input weight parameters in parallel. The parallel sub-processors 1121b are electrically connected to the decoding module 111 and the parallel sub-memory 1121a, respectively, and include a plurality of first parallel sub-processors 1121ba and bias parallel sub-processors 1121bb. The first parallel sub-processors 1121ba are respectively electrically connected to one of the first parallel sub-memory 1121aa, and receive the first input weight parameter according to the operation command to output the first output weight parameter. The bias parallel sub-processor 1121bb is electrically connected to the bias parallel sub-memory 1121ab, and receives the bias input weight parameter according to the operation command to output the bias output weight parameter. In the embodiment shown in FIG. 2, the numbers of the first parallel sub-memory 1121aa and the first parallel sub-processors 1121ba are both 9. However, in other embodiments, the number of the first parallel sub-memory 1121aa and the first parallel sub-processors The quantity of 1121ba can be a multiple of 9, which is not limited in this case. The numbers of the biased parallel sub-memory 1121ab and the biased parallel sub-processor 1121bb are both 1, but the invention is not limited to this. Specifically, after receiving the input weight parameter 104, the parallel processing module 112 stores the first input weight parameter in the input weight parameter 104 in the first parallel sub-memory 1121aa, and stores the bias input weight parameter in the bias Flatten into parallel memory 1121ab. The first parallel sub-processor 1121ba operates according to The command reads the first input weight parameter from the first parallel sub-memory 1121aa, and performs processing to generate the first output weight parameter. The bias parallel sub-processor 1121bb reads the bias input weight parameter from the bias parallel sub-memory 1121ab according to the operation command, and processes it to generate the bias output weight parameter.

各第一輸出權重參數包含複數3×3權重參數。運算模組121可包含3×3運算子模組1211及偏壓分配器1212。3×3運算子模組1211與第一平行子處理器1121ba電性連接，並根據第一輸出權重參數與輸入資料106進行運算，以產生3×3後處理運算資料1062。3×3運算子模組1211包含3×3卷積分配器組、3×3本地卷積運算單元1211b及3×3後處理運算單元1211e。各3×3卷積分配器組與一第一平行子處理器1121ba電性連接，3×3卷積分配器組用以接收及分配第一輸出權重參數之3×3權重參數。3×3本地卷積運算單元1211b分別與一3×3卷積分配器組電性連接，並包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3本地暫存器組1211c與一3×3卷積分配器組電性連接，3×3本地卷積運算單元1211b之3×3本地暫存器組1211c接收並儲存第一輸出權重參數之些3×3權重參數，並根據第一輸出權重參數之3×3權重參數以輸出複數3×3運算參數。3×3本地濾波運算單元1211d與3×3本地暫存器組1211c電性連接，3×3本地卷積運算單元1211b之3×3本地濾波運算單元1211d根據3×3運算參數與輸入資料106進行運算以產生複數3×3運算資料。詳細來說，3×3本地濾波運算單元1211d可執行3×3卷積運算，當第一平行子處理器1121ba的數量為9時，3×3本地濾波運算單元1211d之空間濾波位置(spatiaal filter position)分別可對應第一平行子處理器1121ba；當第一平行子處理器1121ba的數量為18時，3×3本地濾波運算單元1211d之空間濾波位置可對應二第一平行子處理器1121ba，以此類推，本案不另贅述。3×3後處理運算單元1211e與3×3本地卷積運算單元1211b電性連接，並依據3×3運算資料進行3×3後處理運算以產生3×3後處理運算資料1062。卷積神經網路處理器100之輸出資料108可為3×3後處理運算資料1062。偏壓分配器1212與偏壓平行子處理器1121bb、3×3運算子模組1211電性連接。偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數，並將3×3偏壓權重參數輸出至3×3後處理運算單元1211e。 Each first output weight parameter includes a complex 3×3 weight parameter. The arithmetic module 121 may include a 3×3 arithmetic sub-module 1211 and a bias distributor 1212. The 3×3 arithmetic sub-module 1211 is electrically connected to the first parallel sub-processor 1121ba, and is connected to the input according to the first output weight parameter. Data 106 operates to generate 3x3 post-processing operation data 1062. 3x3 operation sub-module 1211 includes 3x3 convolution divider group, 3x3 local convolution operation unit 1211b, and 3x3 post-processing operation unit 1211e. Each 3×3 convolution distributor group is electrically connected to a first parallel sub-processor 1121ba, and the 3×3 convolution distributor group is used for receiving and distributing the 3×3 weight parameters of the first output weight parameter. The 3×3 local convolution operation units 1211b are respectively electrically connected to a 3×3 convolution distributor group, and include a 3×3 local register group 1211c and a 3×3 local filter operation unit 1211d. The 3×3 local register group 1211c is electrically connected to a 3×3 convolution distributor group, and the 3×3 local register group 1211c of the 3×3 local convolution operation unit 1211b receives and stores the first output weight parameter. some 3×3 weight parameters, and output complex 3×3 operation parameters according to the 3×3 weight parameters of the first output weight parameter. The 3×3 local filter operation unit 1211d is electrically connected to the 3×3 local register group 1211c, and the 3×3 local filter operation unit 1211d of the 3×3 local convolution operation unit 1211b is based on the 3×3 operation parameters and the input data 106 An operation is performed to generate complex 3x3 operation data. In detail, 3 × 3 copies The ground filter operation unit 1211d can perform a 3×3 convolution operation. When the number of the first parallel sub-processors 1121ba is 9, the spatial filter positions of the 3×3 local filter operation units 1211d can respectively correspond to the first parallel sub-processors 1121ba. The parallel sub-processors 1121ba; when the number of the first parallel sub-processors 1121ba is 18, the spatial filtering positions of the 3×3 local filtering operation units 1211d can correspond to the two first parallel sub-processors 1121ba, and so on. Repeat. The 3×3 post-processing operation unit 1211e is electrically connected to the 3×3 local convolution operation unit 1211b, and performs a 3×3 post-processing operation according to the 3×3 operation data to generate the 3×3 post-processing operation data 1062 . The output data 108 of the convolutional neural network processor 100 may be 3×3 post-processing operation data 1062 . The bias distributor 1212 is electrically connected to the bias parallel sub-processor 1121bb and the 3×3 operation sub-module 1211 . The bias distributor 1212 outputs the weight parameter according to the bias to generate a complex 3×3 bias weight parameter, and outputs the 3×3 bias weight parameter to the 3×3 post-processing operation unit 1211e.

在第3圖中，3×3運算子模組1211包含複數3×3運算電路1211a，3×3運算電路1211a的數量可為32。各3×3運算電路1211a是由複數3×3本地卷積運算單元1211b及一3×3後處理運算單元1211e所組成，3×3本地卷積運算單元1211b的數量可為32。也就是說，3×3運算子模組1211中之3×3本地卷積運算單元1211b的數量為1024，3×3後處理運算單元1211e的數量為32。 In FIG. 3 , the 3×3 arithmetic sub-module 1211 includes complex 3×3 arithmetic circuits 1211 a , and the number of the 3×3 arithmetic circuits 1211 a may be 32. Each 3×3 operation circuit 1211a is composed of a complex 3×3 local convolution operation unit 1211b and a 3×3 post-processing operation unit 1211e , and the number of the 3×3 local convolution operation units 1211b may be 32. That is to say, the number of 3×3 local convolution operation units 1211b in the 3×3 operation sub-module 1211 is 1024, and the number of 3×3 post-processing operation units 1211e is 32.

請配合參照第3圖及第4圖，3×3運算子模組1211於接收第一輸出權重參數之3×3權重參數後可藉由3×3卷積分配器組將3×3權重參數分配至3×3本地卷積運算單元1211b。在第4圖中，3×3卷積分配器組的配置是採用二階段分配法，3×3卷積分配器組包含第一3×3卷積分配器1211f及複數第二3×3卷積分配器1211g。第一3×3卷積分配器1211f與第一平行子處理器1121ba電性連接以接收並分配第一輸出權重參數之3×3權重參數至第二3×3卷積分配器1211g，第二3×3卷積分配器1211g於接收3×3權重參數後，將3×3權重參數分配至3×3本地卷積運算單元1211b，本發明雖是利用二階段之分配方法，然其分配方式並不以此為限。3×3本地暫存器組1211c可包含二子3×3本地暫存器組1211ca、1211cb。二子3×3本地暫存器組1211ca、1211cb結合一個多工器可交替地儲存3×3權重參數或輸出3×3運算參數給3×3本地濾波運算單元1211d。也就是說，當子3×3本地暫存器組1211ca用以儲存3×3權重參數時，子3×3本地暫存器組1211cb輸出3×3運算參數給3×3本地濾波運算單元1211d；當子3×3本地暫存器組1211cb用以儲存3×3權重參數時，子3×3本地暫存器組1211ca輸出3×3運算參數給3×3本地濾波運算單元1211d，即本案之3×3本地暫存器組1211c是以乒乓(ping-pong)的方式執行儲存3×3權重參數及輸出3×3運算參數。 Please refer to FIG. 3 and FIG. 4 together. After receiving the 3×3 weight parameter of the first output weight parameter, the 3×3 operation sub-module 1211 can distribute the 3×3 weight parameter through the 3×3 convolutional distributor group. to 3×3 local convolution operations unit 1211b. In Figure 4, the configuration of the 3×3 convolutional distributor group adopts the two-stage allocation method, and the 3×3 convolutional distributor group includes the first 3×3 convolutional distributor 1211f and the complex second 3×3 convolutional distributor 1211g . The first 3×3 convolutional distributor 1211f is electrically connected to the first parallel sub-processor 1121ba to receive and distribute the 3×3 weighting parameters of the first output weighting parameters to the second 3×3 convolutional distributor 1211g, and the second 3×3 convolutional distributor 1211g 3. After receiving the 3×3 weight parameters, the convolution distributor 1211g distributes the 3×3 weight parameters to the 3×3 local convolution operation unit 1211b. Although the present invention uses a two-stage distribution method, the distribution method is not This is limited. The 3×3 local register bank 1211c may include two sub-3×3 local register banks 1211ca and 1211cb. The two sub-3×3 local register groups 1211ca and 1211cb combined with a multiplexer can alternately store 3×3 weight parameters or output 3×3 operation parameters to the 3×3 local filtering operation unit 1211d. That is to say, when the sub 3×3 local register bank 1211ca is used to store the 3×3 weight parameters, the sub 3×3 local register bank 1211cb outputs the 3×3 operation parameters to the 3×3 local filter operation unit 1211d ; When the sub 3×3 local register group 1211cb is used to store the 3×3 weight parameters, the sub 3×3 local register group 1211ca outputs the 3×3 operation parameters to the 3×3 local filter operation unit 1211d, that is, this case The 3×3 local register group 1211c performs storage of 3×3 weight parameters and output of 3×3 operation parameters in a ping-pong manner.

3×3本地濾波運算單元1211d可根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料。舉例來說，輸入資料106的圖塊大小可為6×4，3×3本地濾波運算單元1211d可根據3×3運算參數與輸入資料106進行3×3卷積運算。為了實現高度平行的運算，卷積神經網路處理器100可在3×3運算子模組1211中佈署多個乘法器，於3×3本地濾波運算單元1211d中之乘法器的數量可為73728。3×3後處理運算單元1211e於接收3×3本地濾波運算單元1211d所產生之3×3運算資料及偏壓分配器所產生之3×3偏壓權重參數後，可根據3×3運算資料及3×3偏壓權重參數進行3×3後處理運算以產生3×3後處理運算資料1062。在第3圖及第4圖實施方式中，3×3後處理運算資料1062即為卷積神經網路處理器100之輸出資料108。 The 3×3 local filtering operation unit 1211d can perform a 3×3 convolution operation according to the 3×3 operation parameters and the input data 106 to generate 3×3 operation data. For example, the block size of the input data 106 may be 6×4, and the 3×3 local filtering operation unit 1211d may perform a 3×3 convolution operation with the input data 106 according to the 3×3 operation parameters. In order to achieve highly parallel operations, convolutional neural networks The path processor 100 can deploy a plurality of multipliers in the 3×3 operation sub-module 1211, and the number of multipliers in the 3×3 local filtering operation unit 1211d can be 73728. The 3×3 post-processing operation unit 1211e is in After receiving the 3×3 operation data generated by the 3×3 local filtering operation unit 1211d and the 3×3 bias weight parameters generated by the bias distributor, the processing can be performed according to the 3×3 operation data and the 3×3 bias weight parameters. 3x3 postprocessing operation to generate 3x3 postprocessing operation data 1062. In the embodiments shown in FIGS. 3 and 4 , the 3×3 post-processing operation data 1062 is the output data 108 of the convolutional neural network processor 100 .

在第2圖中，卷積判斷單元120更包含控制器122。控制器122與訊息解碼單元110電性連接。詳細來說，控制器122與指令解碼器1112電性連接以接收運作指令，並根據運作指令控制運算模組121之3×3運算子模組1211及偏壓分配器1212。 In FIG. 2 , the convolution determination unit 120 further includes a controller 122 . The controller 122 is electrically connected to the message decoding unit 110 . Specifically, the controller 122 is electrically connected to the command decoder 1112 to receive the operation command, and control the 3×3 operation sub-module 1211 and the bias voltage distributor 1212 of the operation module 121 according to the operation command.

第5圖繪示依照本發明又一結構態樣之一實施方式的卷積神經網路處理器100之方塊圖。第6圖繪示依照第5圖結構態樣之實施方式的卷積神經網路處理器100之1×1運算子模組1213之方塊圖。第7圖繪示依照第6圖結構態樣之實施方式的卷積神經網路處理器100之1×1運算子模組1213之1×1本地卷積運算單元1213b示意圖。第5圖之卷積神經網路處理器100與第2圖之卷積神經網路處理器100的差異在於，第5圖之卷積神經網路處理器100之平行子記憶體1121a更包含至少一第二平行子記憶體1121ac，平行子處理器1121b更包含至少一第二平行子處理器1121bc，及運算模組121更包含1×1運算子模組1213。此外，輸入權重參數104更包含至少一第二輸入權重參數。輸出權重參數更包含至少一第二輸出權重參數。至少一第二平行子記憶體1121ac分別且平行地接收並儲存至少一第二輸入權重參數。至少一第二平行子處理器1121bc分別與至少一第二平行子記憶體1121ac電性連接，並根據運作指令接收至少一第二輸入權重參數以輸出至少一第二輸出權重參數。3×3運算子模組1211的配置與第2圖之卷積神經網路處理器100中的3×3運算子模組1211相同，在此不另贅述。在第5圖實施方式中，第一平行子記憶體1121aa及第一平行子處理器1121ba的數量均為9，第二平行子記憶體1121ac及第二平行子處理器1121bc的數量均為1，然在其他實施方式中，當第一平行子記憶體1121aa及第一平行子處理器1121ba的數量為18時，第二平行子記憶體1121ac及第二平行子處理器1121bc的數量均為2，以此類推，本案不以此為限。偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb的數量均為1，但本發明不以此為限。 FIG. 5 shows a block diagram of a convolutional neural network processor 100 according to an embodiment of another structural aspect of the present invention. FIG. 6 shows a block diagram of the 1×1 operation sub-module 1213 of the convolutional neural network processor 100 according to the implementation of the structural aspect of FIG. 5 . FIG. 7 is a schematic diagram of the 1×1 local convolution operation unit 1213b of the 1×1 operation sub-module 1213 of the convolutional neural network processor 100 according to the embodiment of the structural aspect of FIG. 6 . The difference between the convolutional neural network processor 100 of FIG. 5 and the convolutional neural network processor 100 of FIG. 2 is that the parallel sub-memory 1121a of the convolutional neural network processor 100 of FIG. 5 further includes at least A second parallel sub-memory 1121ac, the parallel sub-processor 1121b further includes at least one second parallel sub-processor 1121bc, and the operation module 121 further includes a 1×1 operation sub-module 1213 . In addition, the input weight parameter 104 further includes At least one second input weight parameter. The output weight parameter further includes at least one second output weight parameter. At least one second parallel sub-memory 1121ac receives and stores at least one second input weight parameter respectively and in parallel. The at least one second parallel sub-processor 1121bc is respectively electrically connected to the at least one second parallel sub-memory 1121ac, and receives at least one second input weight parameter according to the operation command to output at least one second output weight parameter. The configuration of the 3×3 operation sub-module 1211 is the same as that of the 3×3 operation sub-module 1211 in the convolutional neural network processor 100 in FIG. 2 , and details are not described here. In the embodiment shown in FIG. 5, the numbers of the first parallel sub-memory 1121aa and the first parallel sub-processors 1121ba are both 9, and the numbers of the second parallel sub-memory 1121ac and the second parallel sub-processors 1121bc are both 1. However, in other embodiments, when the number of the first parallel sub-memory 1121aa and the first parallel sub-processor 1121ba is 18, the number of the second parallel sub-memory 1121ac and the second parallel sub-processor 1121bc is both 2, By analogy, this case is not limited to this. The numbers of the biased parallel sub-memory 1121ab and the biased parallel sub-processor 1121bb are both 1, but the invention is not limited to this.

詳細來說，平行處理模組112於接收輸入權重參數104後，將輸入權重參數104中的第一輸入權重參數儲存於第一平行子記憶體1121aa中，將輸入權重參數104中的第二輸入權重參數儲存於第二平行子記憶體1121ac中，以及將偏壓輸入權重參數儲存於偏壓平行子記憶體1121ab中。第5圖之第一平行子處理器1121ba及偏壓平行子處理器1121bb的運作方式與第2圖之第一平行子處理器1121ba及偏壓平行子處理器1121bb相同，在此不另贅述。第二平行子處理器1121bc根據運作指令從第二平行子記憶體1121ac中讀取第二輸入權重參數，並進行處理以產生第二輸出權重參數。 Specifically, after receiving the input weight parameters 104 , the parallel processing module 112 stores the first input weight parameters in the input weight parameters 104 in the first parallel sub-memory 1121aa, and stores the second input weight parameters in the input weight parameters 104 in the first parallel sub-memory 1121aa. The weight parameter is stored in the second parallel sub-memory 1121ac, and the bias input weight parameter is stored in the bias parallel sub-memory 1121ab. The operation modes of the first parallel sub-processor 1121ba and the bias parallel sub-processor 1121bb in FIG. 5 are the same as those of the first parallel sub-processor 1121ba and the bias parallel sub-processor 1121bb in FIG. 2, and will not be repeated here. second level The row sub-processor 1121bc reads the second input weight parameter from the second parallel sub-memory 1121ac according to the operation instruction, and performs processing to generate the second output weight parameter.

1×1運算子模組1213與至少一第二平行子處理器1121bc及3×3運算子模組1211電性連接，並根據至少一第二輸出權重參數與3×3後處理運算資料1062進行運算以產生1×1後處理運算資料1064。1×1運算子模組1213包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元及複數1×1後處理運算單元1213e。至少一1×1卷積分配器組與至少一第二平行子處理器1121bc電性連接，用以接收及分配至少一第二輸出權重參數之1×1權重參數。1×1本地卷積運算單元1213b與至少一1×1卷積分配器電性連接。各1×1本地卷積運算單元1213b包含1×1本地暫存器組1213c及1×1本地濾波運算單元1213d。1×1本地暫存器組1213c與至少一1×1卷積分配器組電性連接。1×1本地卷積運算單元1213b之1×1本地暫存器組1213c接收並儲存至少一第二輸出權重參數之1×1權重參數，並根據至少一第二輸出權重參數之1×1權重參數，以輸出1×1運算參數。1×1本地濾波運算單元1213d與1×1本地暫存器組1213c電性連接。1×1本地卷積運算單元1213b之1×1本地濾波運算單元1213d根據1×1運算參數與3×3後處理運算資料1062進行運算以產生複數1×1運算資料。詳細來說，1×1本地濾波運算單元1213d可執行1×1卷積運算，當第二平行子處理器1121bc的數量為1時，1×1本地濾波運算單元1213d之空間濾波位置可對應第二平行子處理器1121bc；當第二平行子處理器1121bc的數量為2時，1×1本地濾波運算單元1213d之空間濾波位置可對應二第二平行子處理器1121bc，以此類推，本案不另贅述。1×1後處理運算單元1213e與1×1本地卷積運算單元1213b電性連接，並依據1×1運算資料進行1×1後處理運算以產生1×1後處理運算資料1064。卷積神經網路處理器100之輸出資料108為1×1後處理運算資料1064。第5圖之偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb與第2圖之偏壓平行子記憶體1121ab及偏壓平行子處理器1121bb相同，在此不另贅述。第5圖之偏壓分配器1212與3×3運算子模組1211的配置關係與第2圖之偏壓分配器1212與3×3運算子模組1211的配置關係相同，在此不另贅述。 The 1×1 arithmetic sub-module 1213 is electrically connected to at least one second parallel sub-processor 1121bc and the 3×3 arithmetic sub-module 1211 , and performs processing with the 3×3 post-processing arithmetic data 1062 according to at least one second output weight parameter Operates to generate 1×1 post-processing operation data 1064. The 1×1 operation sub-module 1213 includes at least one 1×1 convolution distributor group, a complex 1×1 local convolution operation unit, and a complex 1×1 post-processing operation unit 1213e . The at least one 1×1 convolutional distributor group is electrically connected to the at least one second parallel sub-processor 1121bc for receiving and distributing the 1×1 weight parameter of the at least one second output weight parameter. The 1×1 local convolution operation unit 1213b is electrically connected with at least one 1×1 convolution distributor. Each 1×1 local convolution operation unit 1213b includes a 1×1 local register group 1213c and a 1×1 local filter operation unit 1213d. The 1×1 local register group 1213c is electrically connected to at least one 1×1 convolutional distributor group. The 1×1 local register group 1213c of the 1×1 local convolution operation unit 1213b receives and stores the 1×1 weight parameter of the at least one second output weight parameter, and according to the 1×1 weight of the at least one second output weight parameter parameter to output a 1×1 operation parameter. The 1×1 local filter operation unit 1213d is electrically connected to the 1×1 local register group 1213c. The 1×1 local filtering operation unit 1213d of the 1×1 local convolution operation unit 1213b performs operations according to the 1×1 operation parameters and the 3×3 post-processing operation data 1062 to generate complex 1×1 operation data. Specifically, the 1×1 local filtering operation unit 1213d can perform a 1×1 convolution operation. When the number of the second parallel sub-processors 1121bc is 1, the spatial filtering bits of the 1×1 local filtering operation unit 1213d can correspond to the second parallel sub-processors 1121bc; when the number of the second parallel sub-processors 1121bc is 2, the spatial filtering position of the 1×1 local filtering operation unit 1213d can correspond to the two second parallel sub-processors 1121bc, so that By analogy, this case will not be repeated. The 1×1 post-processing operation unit 1213e is electrically connected to the 1×1 local convolution operation unit 1213b, and performs a 1×1 post-processing operation according to the 1×1 operation data to generate the 1×1 post-processing operation data 1064 . The output data 108 of the convolutional neural network processor 100 is 1×1 post-processing operation data 1064 . The biased parallel sub-memory 1121ab and the biased parallel sub-processor 1121bb in FIG. 5 are the same as the biased parallel sub-memory 1121ab and the biased parallel sub-processor 1121bb in FIG. 2, and will not be described here. The configuration relationship between the bias distributor 1212 and the 3×3 operation sub-module 1211 in FIG. 5 is the same as the configuration relationship between the bias distributor 1212 and the 3×3 operation sub-module 1211 in FIG. 2 , and will not be repeated here. .

詳細來說，第5圖之偏壓分配器1212與偏壓平行子處理器1121bb、3×3運算子模組1211及1×1運算子模組1213電性連接。偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數及複數1×1偏壓權重參數。偏壓分配器1212將3×3偏壓權重參數輸出至3×3後處理運算單元1211e。偏壓分配器1212將1×1偏壓權重參數輸出至1×1後處理運算單元1213e。 Specifically, the bias distributor 1212 in FIG. 5 is electrically connected to the bias parallel sub-processor 1121bb, the 3×3 arithmetic sub-module 1211 and the 1×1 arithmetic sub-module 1213 . The bias distributor 1212 outputs the weight parameter according to the bias voltage to generate a complex 3×3 bias weight parameter and a complex 1×1 bias weight parameter. The bias distributor 1212 outputs the 3×3 bias weight parameter to the 3×3 post-processing arithmetic unit 1211e. The bias distributor 1212 outputs the 1×1 bias weight parameter to the 1×1 post-processing arithmetic unit 1213e.

在第6圖中，1×1運算子模組1213包含複數1×1運算電路1213a，1×1運算電路1213a的數量可為32。各1×1運算電路1213a是由複數1×1本地卷積運算單元1213b及一1×1後處理運算單元1213e所組成，1×1本地卷積運算單元1213b的數量可為32。也就是說，1×1運算子模組1213中之1×1本地卷積運算單元1213b的數量為1024，1×1後處理運算單元1213e的數量為32。 In FIG. 6 , the 1×1 operation sub-module 1213 includes a complex number of 1×1 operation circuits 1213 a , and the number of the 1×1 operation circuits 1213 a may be 32. Each 1×1 operation circuit 1213a is composed of a complex 1×1 local convolution operation unit 1213b and a 1×1 post-processing operation unit 1213e. The number of product operation units 1213b may be 32. That is to say, the number of 1×1 local convolution operation units 1213b in the 1×1 operation sub-module 1213 is 1024, and the number of 1×1 post-processing operation units 1213e is 32.

請配合參照第6圖及第7圖，1×1運算子模組1213於接收第二輸出權重參數之1×1權重參數後可藉由1×1卷積分配器組將1×1權重參數分配至1×1本地卷積運算單元1213b。在第7圖中，1×1卷積分配器組的配置是採用二階段分配法，並包含第一1×1卷積分配器1213f及複數第二1×1卷積分配器1213g，其作動方式與3×3卷積分配器組相同，在此不另贅述。1×1本地暫存器組1213c可包含二子1×1本地暫存器組1213ca、1213cb。二子1×1本地暫存器組1213ca、1213cb結合一個多工器可交替地儲存1×1權重參數或輸出1×1運算參數給1×1本地濾波運算單元1213d。1×1本地暫存器組1213c的作動方式與3×3本地暫存器組1211c相同，在此不另贅述。也就是說，本案之3×3本地暫存器組1211c及1×1本地暫存器組1213c皆是以乒乓(ping-pong)的方式作動。因此，1×1本地濾波運算單元1213d可根據1×1運算參數及3×3後處理運算資料1062進行1×1後處理運算以產生1×1運算資料。在第5圖至第7圖實施方式中，1×1後處理運算資料1064即為卷積神經網路處理器100之輸出資料108。 Please refer to FIG. 6 and FIG. 7 together. After receiving the 1×1 weight parameter of the second output weight parameter, the 1×1 operation sub-module 1213 can allocate the 1×1 weight parameter through the 1×1 convolutional distributor group. to the 1×1 local convolution operation unit 1213b. In Fig. 7, the configuration of the 1×1 convolution distributor group adopts a two-stage distribution method, and includes a first 1×1 convolution distributor 1213f and a complex number of second 1×1 convolution distributors 1213g. The operation mode is the same as that of 3 The ×3 convolutional allocator group is the same, and will not be repeated here. The 1×1 local register bank 1213c may include two sub-1×1 local register banks 1213ca and 1213cb. The two sub 1×1 local register groups 1213ca and 1213cb combined with a multiplexer can alternately store 1×1 weight parameters or output 1×1 operation parameters to the 1×1 local filtering operation unit 1213d . The operation mode of the 1×1 local register group 1213c is the same as that of the 3×3 local register group 1211c, and will not be described here. That is to say, both the 3×3 local register bank 1211c and the 1×1 local register bank 1213c in this case operate in a ping-pong manner. Therefore, the 1×1 local filtering operation unit 1213d can perform a 1×1 post-processing operation according to the 1×1 operation parameters and the 3×3 post-processing operation data 1062 to generate 1×1 operation data. In the embodiments shown in FIGS. 5 to 7 , the 1×1 post-processing operation data 1064 is the output data 108 of the convolutional neural network processor 100 .

為了實現高度平行的運算，卷積神經網路處理器100可在3×3運算子模組1211及1×1運算子模組1213中佈署多個乘法器，舉例來說，3×3本地濾波運算單元 1211d中之乘法器的數量可為73728，1×1本地濾波運算單元1213d中之乘法器的數量可為8192。此外，第5圖中之控制器122與第2圖中之控制器122相同，在此不另贅述。 In order to achieve highly parallel operations, the convolutional neural network processor 100 may deploy multiple multipliers in the 3×3 operation sub-module 1211 and the 1×1 operation sub-module 1213 , for example, 3×3 local Filter operation unit The number of multipliers in 1211d may be 73728, and the number of multipliers in 1×1 local filtering operation unit 1213d may be 8192. In addition, the controller 122 in FIG. 5 is the same as the controller 122 in FIG. 2, and details are not described here.

第8圖繪示依照本發明一方法態樣之一實施方式的卷積神經網路處理器的資料處理方法s200之步驟方塊圖。在第8圖中，卷積神經網路處理器的資料處理方法s200包含接收步驟s210、指令解碼步驟s220、平行處理步驟s230及運算步驟s240。 FIG. 8 is a block diagram showing the steps of the data processing method s200 of the convolutional neural network processor according to an embodiment of a method aspect of the present invention. In FIG. 8, the data processing method s200 of the convolutional neural network processor includes a receiving step s210, an instruction decoding step s220, a parallel processing step s230, and an operation step s240.

請配合參照第1圖，詳細來說，接收步驟s210驅動訊息解碼單元110接收輸入程式102及複數輸入權重參數104。訊息解碼單元110包含解碼模組111及平行處理模組112。指令解碼步驟s220驅動解碼模組111接收輸入程式102，並根據輸入程式102產生運作指令。平行處理步驟s230驅動平行處理模組112接收輸入權重參數104，並根據運作指令以平行地處理輸入權重參數104以產生複數輸出權重參數。運算步驟s240驅動運算模組121接收輸入資料106及輸出權重參數，並根據運作指令以將輸入資料106與輸出權重參數進行運算以產生輸出資料108。也就是說，卷積神經網路處理器100之訊息解碼單元110可透過接收步驟s210接收輸入程式102及輸入權重參數104以執行指令解碼步驟s220及平行處理步驟s230。由於平行處理模組112與解碼模組111電性連接，因此，平行處理模組112可根據解碼模組111於指令解碼步驟s220中所產生之運作指令產生輸出權重參數，即平行處理步驟s230。此外，運算模組121與平行處理模組112電性連接，因此，於運算步驟s240中，運算模組121可於接收輸入資料106及輸出權重參數後，根據輸入資料106及輸出權重參數進行運算以產生輸出資料108。藉此，卷積神經網路處理器的資料處理方法s200可藉由接收步驟s210、指令解碼步驟s220、平行處理步驟s230及運算步驟s240驅動訊息解碼單元110的解碼模組111及平行處理模組112，以及卷積判斷單元120的運算模組121執行高度平行運算，進而提供高性能且低功耗的運算。 Please refer to FIG. 1 , in detail, the receiving step s210 drives the message decoding unit 110 to receive the input program 102 and the complex input weight parameter 104 . The message decoding unit 110 includes a decoding module 111 and a parallel processing module 112 . The instruction decoding step s220 drives the decoding module 111 to receive the input program 102 and generate an operation instruction according to the input program 102 . The parallel processing step s230 drives the parallel processing module 112 to receive the input weight parameters 104 and process the input weight parameters 104 in parallel according to the operation instruction to generate complex output weight parameters. The operation step s240 drives the operation module 121 to receive the input data 106 and the output weight parameter, and operates the input data 106 and the output weight parameter to generate the output data 108 according to the operation instruction. That is, the message decoding unit 110 of the convolutional neural network processor 100 can receive the input program 102 and the input weight parameter 104 through the receiving step s210 to execute the instruction decoding step s220 and the parallel processing step s230. Since the parallel processing module 112 is electrically connected to the decoding module 111 , the parallel processing module 112 can generate the output weight parameter according to the operation command generated by the decoding module 111 in the command decoding step s220 , that is, the parallel processing step s230 . In addition, shipping The arithmetic module 121 is electrically connected to the parallel processing module 112. Therefore, in the operation step s240, the arithmetic module 121 can perform operations according to the input data 106 and the output weight parameters after receiving the input data 106 and the output weight parameters to generate Output data 108. Thereby, the data processing method s200 of the convolutional neural network processor can drive the decoding module 111 and the parallel processing module of the message decoding unit 110 through the receiving step s210, the instruction decoding step s220, the parallel processing step s230 and the operation step s240 112, and the operation module 121 of the convolution determination unit 120 performs highly parallel operations, thereby providing high performance and low power consumption operations.

舉例來說，在第8圖中，卷積神經網路處理器的資料處理方法s200之接收步驟s210所接收之輸入程式102及輸入權重參數104可包含對應複數輸入資料106之相關指令及參數。於執行指令解碼步驟s220及平行處理步驟s230時，將對應複數輸入資料106之相關指令及參數儲存於程式記憶體1111及平行子記憶體1121a中。於執行指令解碼步驟s220及平行處理步驟s230時，可針對與其中一輸入資料106的相關指令及參數進行處理，以於執行運算步驟s240時，針對所述其中一輸入資料106進行運算，並於執行運算步驟s240期間，卷積神經網路處理器的資料處理方法s200可針對其中之另一輸入資料106之相關指令及參數進行處理，即針對所述其中之另一輸入資料106執行指令解碼步驟s220以及平行處理步驟s230。換言之，卷積神經網路處理器的資料處理方法s200係先將全部的輸入資料106之相關指令及參數都儲存於程式記憶體1111及平行子記憶體1121a中，然後再執行每一個輸入資料106所對應之指令解碼步驟s220、平行處理步驟s230及運算步驟s240。此外，當運算步驟s240在針對所述其中一輸入資料106進行運算時，指令解碼步驟s220以及平行處理步驟s230可針對所述其中之另一輸入資料106之相關指令及參數進行處理。因此，卷積神經網路處理器的資料處理方法s200於執行接收步驟s210後可對複數輸入資料106各別進行運算。 For example, in FIG. 8 , the input program 102 and the input weight parameter 104 received in the receiving step s210 of the data processing method s200 of the convolutional neural network processor may include related commands and parameters corresponding to the complex input data 106 . When the instruction decoding step s220 and the parallel processing step s230 are executed, the relevant instructions and parameters corresponding to the complex input data 106 are stored in the program memory 1111 and the parallel sub-memory 1121a. During the execution of the instruction decoding step s220 and the parallel processing step s230, the relevant instructions and parameters related to one of the input data 106 may be processed. During the execution of the operation step s240, the data processing method s200 of the convolutional neural network processor can process the related instructions and parameters of the other input data 106, that is, perform the instruction decoding step for the other input data 106. s220 and parallel processing step s230. In other words, the data processing method s200 of the convolutional neural network processor first stores all relevant instructions and parameters of the input data 106 in the program memory 1111 and the parallel subroutines In the memory 1121a, the instruction decoding step s220, the parallel processing step s230 and the operation step s240 corresponding to each input data 106 are then executed. In addition, when the operation step s240 is operating on one of the input data 106 , the instruction decoding step s220 and the parallel processing step s230 can process related commands and parameters of the other input data 106 . Therefore, the data processing method s200 of the convolutional neural network processor can perform operations on the complex input data 106 after the receiving step s210 is performed.

第9圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的指令解碼步驟s220之步驟方塊圖。解碼模組111可包含程式記憶體1111及指令解碼器1112。指令解碼步驟s220可包含程式儲存子步驟s221及程式解碼子步驟s222。程式儲存子步驟s221驅動程式記憶體1111儲存輸入程式102。程式解碼子步驟s222驅動指令解碼器1112對輸入程式102進行解碼以產生運作指令。也就是說，卷積神經網路處理器100可透過程式儲存子步驟s221及程式解碼子步驟s222驅動解碼模組111接收輸入程式102，並將輸入程式102儲存於程式記憶體1111中，再藉由指令解碼器1112對儲存於程式記憶體1111中之輸入程式102進行解碼以產生運作指令。 FIG. 9 shows a block diagram of the steps of the instruction decoding step s220 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . The decoding module 111 may include a program memory 1111 and an instruction decoder 1112 . The instruction decoding step s220 may include a program storage sub-step s221 and a program decoding sub-step s222. The program storage sub-step s221 stores the input program 102 in the driver memory 1111 . The program decoding sub-step s222 drives the instruction decoder 1112 to decode the input program 102 to generate operation instructions. That is to say, the convolutional neural network processor 100 can drive the decoding module 111 to receive the input program 102 through the program storage sub-step s221 and the program decoding sub-step s222, store the input program 102 in the program memory 1111, and then borrow the input program 102. The input program 102 stored in the program memory 1111 is decoded by the instruction decoder 1112 to generate operation instructions.

第10圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的平行處理步驟s230之步驟方塊圖。平行處理模組112可包含複數平行子記憶體1121a及複數平行子處理器1121b。平行處理步驟 s230包含權重參數儲存子步驟s231及權重參數處理子步驟s232。權重參數儲存子步驟s231驅動平行子記憶體1121a以平行地儲存輸入權重參數104。權重參數處理子步驟s232驅動平行子處理器1121b。平行子處理器1121b根據運作指令平行地讀取輸入權重參數104並進行運作處理以產生輸出權重參數。也就是說，卷積神經網路處理器100可透過權重參數儲存子步驟s231及權重參數處理子步驟s232驅動平行處理模組112接收輸入權重參數104，並將輸入權重參數104儲存於平行子記憶體1121a中，平行子處理器1121b再根據運作指令對儲存於平行子記憶體1121a中之輸入權重參數104進行運作處理以產生輸出權重參數。當輸入權重參數104為非壓縮輸入權重參數時，運作處理可為儲存非壓縮輸入權重參數。當輸入權重參數104為壓縮輸入權重參數時，運作處理可為儲存及解壓縮壓縮輸入權重參數。 FIG. 10 shows a block diagram of the steps of the parallel processing step s230 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . The parallel processing module 112 may include a plurality of parallel sub-memory 1121a and a plurality of parallel sub-processors 1121b. Parallel processing steps s230 includes a weight parameter storage sub-step s231 and a weight parameter processing sub-step s232. The weight parameter storage sub-step s231 drives the parallel sub-memory 1121a to store the input weight parameters 104 in parallel. The weight parameter processing sub-step s232 drives the parallel sub-processor 1121b. The parallel sub-processor 1121b reads the input weight parameter 104 in parallel according to the operation instruction and performs operation processing to generate the output weight parameter. That is to say, the convolutional neural network processor 100 can drive the parallel processing module 112 to receive the input weight parameters 104 through the weight parameter storage sub-step s231 and the weight parameter processing sub-step s232, and store the input weight parameters 104 in the parallel sub-memory In the body 1121a, the parallel sub-processor 1121b further processes the input weight parameters 104 stored in the parallel sub-memory 1121a according to the operation instruction to generate the output weight parameters. When the input weight parameter 104 is an uncompressed input weight parameter, the operation process may be to store the uncompressed input weight parameter. When the input weight parameters 104 are compressed input weight parameters, the operational process may be to store and decompress the compressed input weight parameters.

第11圖繪示依照第8圖之方法態樣之實施方式的卷積神經網路處理器的資料處理方法s200的運算步驟s240之步驟方塊圖。請配合參照第2圖至第4圖。輸出權重參數可包含複數第一輸出權重參數及偏壓輸出權重參數。第一輸出權重參數包含複數3×3權重參數。運算模組121可包含3×3運算子模組1211及偏壓分配器1212。3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b及複數3×3後處理運算單元1211e。運算步驟s240可包含第一運算子步驟s241及偏壓運算子步驟 s242。第一運算子步驟s241包含3×3參數分配程序s2411、3×3運算參數產生程序s2412、3×3卷積運算程序s2413及3×3後處理運算程序s2414。3×3參數分配程序s2411驅動3×3卷積分配器組接收第一輸出權重參數之3×3權重參數，並將第一輸出權重參數之3×3權重參數分配至3×3本地卷積運算單元1211b，其中各3×3本地卷積運算單元1211b包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3運算參數產生程序s2412驅動3×3本地卷積運算單元1211b之3×3本地暫存器組1211c接收第一輸出權重參數之3×3權重參數，並根據第一輸出權重參數之3×3權重參數產生複數3×3運算參數。3×3卷積運算程序s2413驅動3×3本地卷積運算單元1211b之3×3本地濾波運算單元1211d以將3×3運算參數及輸入資料106進行3×3卷積運算以產生複數3×3運算資料。3×3後處理運算程序s2414驅動3×3後處理運算單元1211e以將3×3運算資料進行3×3後處理運算以產生3×3後處理運算資料1062。偏壓運算子步驟s242驅動偏壓分配器1212根據偏壓輸出權重參數以產生複數3×3偏壓權重參數。偏壓分配器1212將3×3偏壓權重參數提供予3×3運算子模組1211。也就是說，卷積神經網路處理器100可透過第一運算子步驟s241及偏壓運算子步驟s242產生3×3後處理運算資料1062。詳細來說，3×3運算子模組1211可用以執行第一運算子步驟s241，3×3運算子模組1211之3×3卷積分配器組可執行3×3參數分配程序s2411以將3×3權重參數分配至不同的3×3本地卷積運算單元1211b中之3×3本地暫存器組1211c，以利3×3本地暫存器組1211c執行3×3運算參數產生程序s2412。3×3本地暫存器組1211c可包含二子3×3本地暫存器組1211ca、1211cb。二子3×3本地暫存器組1211ca、1211cb以乒乓的方式作動，進而接收3×3權重參數並輸出3×3運算參數至3×3本地濾波運算單元1211d。3×3本地濾波運算單元1211d於3×3卷積運算程序s2413中根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料。於3×3後處理運算程序s2414中，3×3後處理運算單元1211e根據偏壓分配器1212於偏壓運算子步驟s242中所輸出之3×3偏壓權重參數及3×3運算資料執行3×3後處理運算以產生3×3後處理運算資料1062。在第2圖至第4圖及第11圖的實施方式中，3×3後處理運算資料1062可為卷積神經網路處理器100的輸出資料108。 FIG. 11 shows a block diagram of the steps of the operation step s240 of the data processing method s200 of the convolutional neural network processor according to the embodiment of the method aspect of FIG. 8 . Please refer to Figure 2 to Figure 4 together. The output weight parameters may include a plurality of first output weight parameters and a bias output weight parameter. The first output weight parameter includes a complex 3×3 weight parameter. The operation module 121 may include a 3×3 operation sub-module 1211 and a bias distributor 1212. The 3×3 operation sub-module 1211 includes a complex 3×3 convolution distributor group, a complex 3×3 local convolution operation unit 1211b and A complex number 3×3 post-processing operation unit 1211e. The operation step s240 may include a first operation sub-step s241 and a bias operation sub-step s242. The first operation sub-step s241 includes a 3×3 parameter assignment program s2411, a 3×3 operation parameter generation program s2412, a 3×3 convolution operation program s2413 and a 3×3 post-processing operation program s2414. The 3×3 parameter assignment program s2411 drives the The 3×3 convolution distributor group receives the 3×3 weight parameters of the first output weight parameter, and distributes the 3×3 weight parameters of the first output weight parameter to the 3×3 local convolution operation unit 1211b, wherein each 3×3 weight parameter The local convolution operation unit 1211b includes a 3×3 local register group 1211c and a 3×3 local filter operation unit 1211d. The 3×3 operation parameter generation program s2412 drives the 3×3 local register group 1211c of the 3×3 local convolution operation unit 1211b to receive the 3×3 weight parameters of the first output weight parameter, and according to the first output weight parameter 3 The ×3 weight parameter produces a complex 3×3 operation parameter. The 3×3 convolution operation program s2413 drives the 3×3 local filter operation unit 1211d of the 3×3 local convolution operation unit 1211b to perform a 3×3 convolution operation on the 3×3 operation parameters and the input data 106 to generate a complex number 3× 3 Operation data. The 3×3 post-processing operation program s2414 drives the 3×3 post-processing operation unit 1211e to perform a 3×3 post-processing operation on the 3×3 operation data to generate the 3×3 post-processing operation data 1062 . The bias operation sub-step s242 drives the bias distributor 1212 to output the weight parameter according to the bias to generate a complex 3×3 bias weight parameter. The bias distributor 1212 provides the 3×3 bias weight parameters to the 3×3 arithmetic sub-module 1211 . That is, the convolutional neural network processor 100 can generate 3×3 post-processing operation data 1062 through the first operation sub-step s241 and the bias operation sub-step s242 . In detail, the 3×3 operation sub-module 1211 can be used to execute the first operation sub-step s241 , and the 3×3 convolution allocator group of the 3×3 operation sub-module 1211 can execute the 3×3 parameter assignment program s2411 to ×3 weight parameters are assigned to different 3×3 local convolution operation units The 3×3 local register bank 1211c in the element 1211b is used for the 3×3 local register bank 1211c to execute the 3×3 operation parameter generation program s2412. The 3×3 local register bank 1211c can include two sub-3×3 Local register banks 1211ca, 1211cb. The two sub-3×3 local register groups 1211ca and 1211cb act in a ping-pong manner, and further receive the 3×3 weight parameters and output the 3×3 operation parameters to the 3×3 local filtering operation unit 1211d. The 3×3 local filtering operation unit 1211d performs a 3×3 convolution operation according to the 3×3 operation parameters and the input data 106 in the 3×3 convolution operation program s2413 to generate 3×3 operation data. In the 3×3 post-processing operation program s2414, the 3×3 post-processing operation unit 1211e executes the operation according to the 3×3 bias weight parameters and the 3×3 operation data output by the bias distributor 1212 in the bias operation sub-step s242 3x3 post-processing operation to generate 3x3 post-processing operation data 1062. In the embodiments of FIGS. 2-4 and 11 , the 3×3 post-processing operation data 1062 may be the output data 108 of the convolutional neural network processor 100 .

第12圖繪示依照第8圖之方法態樣之另一實施方式的卷積神經網路處理器的資料處理方法s200的運算步驟s240之步驟方塊圖。請配合參照第5圖至第7圖。輸出權重參數可包含複數第一輸出權重參數、至少一第二輸出權重參數及偏壓輸出權重參數。第一輸出權重參數包含複數3×3權重參數。至少一第二輸出權重參數包含複數1×1權重參數。運算模組121可包含3×3運算子模組1211、1×1運算子模組1213及偏壓分配器1212。3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b及複數3×3後處理運算單元1211e。1×1運算子模組包含複數1×1卷積分配器組、複數1×1本地卷積運算單元1213b及複數1×1後處理運算單元1213e。運算步驟s240可包含第一運算子步驟s241、第二運算子步驟s243及偏壓運算子步驟s242。第12圖之第一運算子步驟s241與第11圖之第一運算子步驟s241相同，在此不另贅述。第二運算子步驟s243驅動1×1運算子模組1213接收3×3後處理運算資料1062及至少一第二輸出權重參數以產生1×1後處理運算資料1064。第二運算子步驟s243包含1×1參數分配程序s2431、1×1運算參數產生程序s2432、1×1卷積運算程序s2433及1×1後處理運算程序s2434。1×1參數分配程序s2431驅動至少一1×1卷積分配器組以接收至少一第二輸出權重參數之1×1權重參數，並將至少一第二輸出權重參數之1×1權重參數分配至1×1本地卷積運算單元s1213b，其中各1×1本地卷積運算單元s1213b包含1×1本地暫存器組s1213c及1×1本地濾波運算單元s1213d。1×1運算參數產生程序s2432驅動1×1本地卷積運算單元1213b之1×1本地暫存器組1213c接收至少一第二輸出權重參數之1×1權重參數，並根據至少一第二輸出權重參數之1×1權重參數產生複數1×1運算參數。1×1卷積運算程序s2433驅動1×1本地卷積運算單元1213b之1×1本地濾波運算單元1213d以將1×1運算參數及3×3後處理運算資料1062進行1×1卷積運算以產生複數1×1運算資料。1×1後處理運算程序s2434驅動1×1後處理運算單元1213e以將1×1運算資料進行1×1後處理運算以產生1×1後處理運算資料1064。也就是說，卷積神經網路處理器100可透過第一運算子步驟s241、第二運算子步驟s243及偏壓運算子步驟s242產生1×1後處理運算資料1064。詳細來說，1×1運算子模組1213可用以執行第二運算子步驟s243，1×1運算子模組1213之1×1卷積分配器組可執行1×1參數分配程序s2431以將1×1權重參數分配至不同的1×1本地卷積運算單元1213b中的1×1本地暫存器組1213c以利1×1本地暫存器組1213c執行1×1運算參數產生程序s2432。1×1本地暫存器組1213c可包含二子1×1本地暫存器組1213ca、1213cb。二子1×1本地暫存器組1213ca、1213cb以乒乓的方式作動，進而接收1×1權重參數並輸出1×1運算參數至1×1本地濾波運算單元1213d。1×1本地濾波運算單元1213d於1×1卷積運算程序s2433中根據1×1運算參數及3×3後處理運算資料1062進行1×1卷積運算以產生1×1運算資料。於1×1後處理運算程序s2434中，1×1後處理運算單元1213e根據偏壓分配器1212於偏壓運算子步驟s242中所輸出之1×1偏壓權重參數及1×1運算資料執行1×1後處理運算以產生1×1後處理運算資料1064。在第5圖至第7圖及第12圖的實施方式中，1×1後處理運算資料1064可為卷積神經網路處理器100的輸出資料108。 FIG. 12 is a block diagram illustrating the steps of the operation step s240 of the data processing method s200 of the convolutional neural network processor according to another embodiment of the method aspect of FIG. 8 . Please refer to Figure 5 to Figure 7 together. The output weight parameters may include a plurality of first output weight parameters, at least one second output weight parameter, and a bias output weight parameter. The first output weight parameter includes a complex 3×3 weight parameter. The at least one second output weight parameter includes a complex 1×1 weight parameter. The arithmetic module 121 may include a 3×3 arithmetic sub-module 1211 , a 1×1 arithmetic sub-module 1213 and a bias distributor 1212 . The 3×3 arithmetic sub-module 1211 includes a complex 3×3 convolutional distributor group, a complex 3 A ×3 local convolution operation unit 1211b and a complex 3×3 post-processing operation unit 1211e. 1×1 arithmetic submodule It includes a complex 1×1 convolution distributor group, a complex 1×1 local convolution operation unit 1213b and a complex 1×1 post-processing operation unit 1213e. The operation step s240 may include a first operation sub-step s241, a second operation sub-step s243 and a bias operation sub-step s242. The first operation sub-step s241 in FIG. 12 is the same as the first operation sub-step s241 in FIG. 11 , and details are not described here. The second operation sub-step s243 drives the 1×1 operation sub-module 1213 to receive the 3×3 post-processing operation data 1062 and at least one second output weight parameter to generate the 1×1 post-processing operation data 1064 . The second operation sub-step s243 includes a 1×1 parameter assignment program s2431, a 1×1 operation parameter generation program s2432, a 1×1 convolution operation program s2433 and a 1×1 post-processing operation program s2434. The 1×1 parameter assignment program s2431 drives the at least one 1×1 convolution distributor group to receive the 1×1 weight parameter of the at least one second output weight parameter, and distribute the 1×1 weight parameter of the at least one second output weight parameter to the 1×1 local convolution operation unit s1213b, wherein each 1×1 local convolution operation unit s1213b includes a 1×1 local register group s1213c and a 1×1 local filter operation unit s1213d. The 1×1 operation parameter generation program s2432 drives the 1×1 local register group 1213c of the 1×1 local convolution operation unit 1213b to receive the 1×1 weight parameter of the at least one second output weight parameter, and according to the at least one second output The 1×1 weight parameter of the weight parameter generates a complex 1×1 operation parameter. The 1×1 convolution operation program s2433 drives the 1×1 local filter operation unit 1213d of the 1×1 local convolution operation unit 1213b to perform a 1×1 convolution operation on the 1×1 operation parameters and the 3×3 post-processing operation data 1062 to generate complex 1×1 arithmetic data. The 1×1 post-processing operation program s2434 drives the 1×1 post-processing operation unit 1213e to perform a 1×1 post-processing operation on the 1×1 operation data to generate the 1×1 post-processing operation data 1064 . That is, the volume The product neural network processor 100 can generate 1×1 post-processing operation data 1064 through the first operation sub-step s241 , the second operation sub-step s243 and the bias operation sub-step s242 . Specifically, the 1×1 operation sub-module 1213 can be used to execute the second operation sub-step s243 , and the 1×1 convolution allocator group of the 1×1 operation sub-module 1213 can execute the 1×1 parameter assignment program s2431 to convert 1 The ×1 weight parameter is allocated to the 1×1 local register group 1213c in the different 1×1 local convolution operation units 1213b so that the 1×1 local register group 1213c can execute the 1×1 operation parameter generation program s2432. 1 The ×1 local register bank 1213c may include two sub-1×1 local register banks 1213ca and 1213cb. The two sub-1×1 local register groups 1213ca and 1213cb act in a ping-pong manner, and further receive the 1×1 weight parameter and output the 1×1 operation parameter to the 1×1 local filtering operation unit 1213d. The 1×1 local filtering operation unit 1213d performs a 1×1 convolution operation according to the 1×1 operation parameters and the 3×3 post-processing operation data 1062 in the 1×1 convolution operation program s2433 to generate 1×1 operation data. In the 1×1 post-processing operation program s2434, the 1×1 post-processing operation unit 1213e executes the operation according to the 1×1 bias weight parameters and the 1×1 operation data output by the bias distributor 1212 in the bias operation sub-step s242 1x1 postprocessing operation to generate 1x1 postprocessing operation data 1064. In the embodiments of FIGS. 5-7 and 12 , the 1×1 post-processing operation data 1064 may be the output data 108 of the convolutional neural network processor 100 .

請配合參照第5圖至第10圖及第12圖。詳細來說，卷積神經網路處理器100可執行卷積神經網路處理器的資料處理方法s200，且卷積神經網路處理器100包含訊息解碼單元110及卷積判斷單元120。訊息解碼單元110可執行接收步驟s210、指令解碼步驟s220及平行處理步驟s230。解碼模組111於接收步驟s210中接收輸入程式102後，透過程式記憶體1111儲存輸入程式102，即程式儲存子步驟s221，再透過指令解碼器1112於程式解碼子步驟s222中將儲存於程式記憶體1111中的輸入程式102解碼以輸出運作指令至平行處理模組112及卷積判斷單元120之控制器122，其中輸入程式102可包含對應複數輸入資料106之相關指令。簡單來說，於程式解碼子步驟s222中，指令解碼器1112將對應其中一輸入資料106之相關指令進行解碼，以輸出運作指令。控制器122於接收運作指令後可根據運作指令控制運算模組121。平行處理模組112於接收步驟s210中接收輸入權重參數104，並執行平行處理步驟s230。輸入權重參數104包含第一輸入權重參數、第二輸入權重參數及偏壓輸入權重參數，第一輸入權重參數的數量可為9216的倍數，第二輸入權重參數的數量可為1024的倍數，偏壓輸入權重參數的數量可為64的倍數。換句話說，輸入權重參數104包含對應複數輸入資料106之相關參數。於權重參數儲存子步驟s231中，第一平行子記憶體1121aa、第二平行子記憶體1121ac及偏壓平行子記憶體1121ab分別儲存第一輸入權重參數、第二輸入權重參數及偏壓輸入權重參數，其中第一平行子記憶體1121aa為9，第二平行子記憶體1121ac及偏壓平行子記憶體1121ab分別為1。此外，平行處理模組112中的第一平行子處理器1121ba的數量為9，第二平行子處理器1121bc及偏壓平行子處理器1121bb的數量均為1。於權重參數處理子步驟s232中，第一平行子處理器1121ba及第二平行子處理器1121bc每週期可處理的第一輸入權重參數及第二輸入權重參數的數量均為4。第一平行子處理器1121ba及第二平行子處理器1121bc分別需使用256個週期處理與所述其中一輸入資料106相對應之第一輸入.權重參數及第二輸入權重參數，以分別輸出第一輸出權重參數及第二輸出權重參數，偏壓平行子處理器1121bb則使用64個週期處理與所述其中一輸入資料106相對應之偏壓輸入權重參數以輸出偏壓輸出權重參數。藉此，卷積神經網路處理器100可透過執行接收步驟s210、指令解碼步驟s220及平行處理步驟s230以平行處理輸入權重參數104。 Please refer to Figure 5 to Figure 10 and Figure 12 together. Specifically, the convolutional neural network processor 100 can execute the data processing method s200 of the convolutional neural network processor, and the convolutional neural network processor 100 includes a message decoding unit 110 and a convolution determination unit 120 . The message decoding unit 110 may The receiving step s210, the instruction decoding step s220 and the parallel processing step s230 are performed. After receiving the input program 102 in the receiving step s210, the decoding module 111 stores the input program 102 through the program memory 1111, that is, the program storage sub-step s221, and then stores the input program 102 in the program memory through the instruction decoder 1112 in the program decoding sub-step s222 The input program 102 in the body 1111 is decoded to output operation instructions to the parallel processing module 112 and the controller 122 of the convolution determination unit 120 , wherein the input program 102 may include relevant instructions corresponding to the complex input data 106 . To put it simply, in the program decoding sub-step s222, the command decoder 1112 decodes the relevant command corresponding to one of the input data 106 to output the operation command. After receiving the operation command, the controller 122 can control the computing module 121 according to the operation command. The parallel processing module 112 receives the input weight parameter 104 in the receiving step s210, and executes the parallel processing step s230. The input weight parameter 104 includes a first input weight parameter, a second input weight parameter, and a bias input weight parameter. The number of the first input weight parameter may be a multiple of 9216, the number of the second input weight parameter may be a multiple of 1024, and the bias The number of input weight parameters can be a multiple of 64. In other words, the input weight parameters 104 include relevant parameters corresponding to the complex input data 106 . In the weight parameter storage sub-step s231, the first parallel sub-memory 1121aa, the second parallel sub-memory 1121ac and the bias parallel sub-memory 1121ab store the first input weight parameter, the second input weight parameter and the bias input weight respectively The parameters, wherein the first parallel sub-memory 1121aa is 9, the second parallel sub-memory 1121ac and the bias parallel sub-memory 1121ab are 1 respectively. In addition, the number of the first parallel sub-processors 1121ba in the parallel processing module 112 is 9, the number of the second parallel sub-processors 1121bc and the bias parallel The number of sub-processors 1121bb is all one. In the weight parameter processing sub-step s232 , the numbers of the first input weight parameters and the second input weight parameters that can be processed by the first parallel sub-processor 1121ba and the second parallel sub-processor 1121bc per cycle are both four. The first parallel sub-processor 1121ba and the second parallel sub-processor 1121bc respectively need to use 256 cycles to process the first input.weight parameter and the second input weight parameter corresponding to the one of the input data 106 to output the first input weight parameter and the second input weight parameter respectively. With an output weight parameter and a second output weight parameter, the bias parallel sub-processor 1121bb uses 64 cycles to process the bias input weight parameter corresponding to one of the input data 106 to output the bias output weight parameter. Thus, the convolutional neural network processor 100 can process the input weight parameters 104 in parallel by executing the receiving step s210, the instruction decoding step s220 and the parallel processing step s230.

卷積判斷單元120之運算模組121可執行運算步驟s240，且運算模組121包含3×3運算子模組1211、偏壓分配器1212及1×1運算子模組1213。偏壓分配器1212可執行偏壓運算子步驟s242。於偏壓運算子步驟s242中，偏壓分配器1212接收3×3偏壓權重參數及1×1偏壓權重參數，並將3×3偏壓權重參數分配至3×3運算子模組1211中之3×3後處理運算單元1211e，以利3×3後處理運算單元1211e執行3×3後處理運算程序s2414，以及將1×1偏壓權重參數分配至1×1運算子模組1213中之1×1後處理運算單元1213e，以利1×1後處理運算單元1213e執行1×1後處理運算程序s2434。 The operation module 121 of the convolution determination unit 120 can execute the operation step s240 , and the operation module 121 includes a 3×3 operation sub-module 1211 , a bias distributor 1212 and a 1×1 operation sub-module 1213 . The bias voltage distributor 1212 may perform the bias voltage operation sub-step s242. In the bias operation sub-step s242, the bias distributor 1212 receives the 3×3 bias weight parameter and the 1×1 bias weight parameter, and distributes the 3×3 bias weight parameter to the 3×3 operation sub-module 1211 Among them, the 3×3 post-processing arithmetic unit 1211e is used to facilitate the 3×3 post-processing arithmetic unit 1211e to execute the 3×3 post-processing arithmetic program s2414, and to assign the 1×1 bias weight parameter to the 1×1 arithmetic sub-module 1213 Among them, the 1×1 post-processing operation unit 1213e enables the 1×1 post-processing operation unit 1213e to execute the 1×1 post-processing operation program s2434.

3×3運算子模組1211可執行第一運算子步驟s241，且3×3運算子模組1211包含複數3×3卷積分配器組、複數3×3本地卷積運算單元1211b、複數3×3後處理運算單元1211e。3×3卷積分配器組與第一平行子處理器1121ba電性連接，且於3×3參數分配程序s2411中，接收並分配3×3權重參數至3×3本地卷積運算單元1211b，以利3×3本地卷積運算單元1211b執行3×3運算參數產生程序s2412及3×3卷積運算程序s2413。各3×3本地卷積運算單元1211b包含3×3本地暫存器組1211c及3×3本地濾波運算單元1211d。3×3本地暫存器組1211c可執行3×3運算參數產生程序s2412，3×3本地暫存器組1211c包含二子3×3本地暫存器組1211ca、1211cb，並以乒乓的方式執行3×3運算參數產生程序s2412以輸出3×3運算參數至3×3本地濾波運算單元1211d。於3×3卷積運算程序s2413中，3×3本地濾波運算單元1211d根據3×3運算參數及輸入資料106進行3×3卷積運算以產生3×3運算資料，其中3×3卷積運算的空間濾波位置可分別與第一平行子處理器1121ba中之一者對應。於3×3後處理運算程序s2414中，3×3後處理運算單元1211e根據3×3運算資料及3×3偏壓權重參數執行3×3後處理運算以輸出3×3後處理運算資料1062。 The 3×3 operation sub-module 1211 can execute the first operation sub-step s241, and the 3×3 operation sub-module 1211 includes a complex 3×3 convolution allocator group, a complex 3×3 local convolution operation unit 1211b, a complex 3× 3. Post-processing operation unit 1211e. The 3×3 convolution assigner group is electrically connected to the first parallel sub-processor 1121ba, and in the 3×3 parameter assignment program s2411, receives and assigns the 3×3 weight parameters to the 3×3 local convolution operation unit 1211b, so as to The 3×3 local convolution operation unit 1211b executes the 3×3 operation parameter generation program s2412 and the 3×3 convolution operation program s2413. Each 3×3 local convolution operation unit 1211b includes a 3×3 local register group 1211c and a 3×3 local filter operation unit 1211d. The 3×3 local register bank 1211c can execute the 3×3 operation parameter generation program s2412. The 3×3 local register bank 1211c includes two sub-3×3 local register banks 1211ca and 1211cb, and executes the 3×3 local register bank 1211cb in a ping-pong manner. The ×3 operation parameter generating program s2412 outputs the 3×3 operation parameters to the 3×3 local filtering operation unit 1211d. In the 3×3 convolution operation program s2413, the 3×3 local filter operation unit 1211d performs a 3×3 convolution operation according to the 3×3 operation parameters and the input data 106 to generate 3×3 operation data, wherein the 3×3 convolution The spatial filtering positions of the operations may respectively correspond to one of the first parallel sub-processors 1121ba. In the 3×3 post-processing operation program s2414, the 3×3 post-processing operation unit 1211e performs the 3×3 post-processing operation according to the 3×3 operation data and the 3×3 bias weight parameter to output the 3×3 post-processing operation data 1062 .

1×1運算子模組1213可執行第二運算子步驟s243，且1×1運算子模組1213包含至少一1×1卷積分配器組、複數1×1本地卷積運算單元1213b、複數1×1後處理運算單元1213e。1×1卷積分配器組與至少一第二平行子處理器1121bc電性連接，且於1×1參數分配程序s2431中，接收並分配1×1權重參數至1×1本地卷積運算單元1213b，以利1×1本地卷積運算單元1213b執行1×1運算參數產生程序s2432及1×1卷積運算程序s2433。各1×1本地卷積運算單元1213b包含1×1本地暫存器組1213c及1×1本地濾波運算單元1213d。1×1本地暫存器組1213c可執行1×1運算參數產生程序s2432，1×1本地暫存器組1213c包含二子1×1本地暫存器組1213ca、1213cb，並以乒乓的方式執行1×1運算參數產生程序s2432以輸出1×1運算參數至1×1本地濾波運算單元1213d。於1×1卷積運算程序s2433中，1×1本地濾波運算單元1213d根據1×1運算參數及於3×3後處理運算程序s2414中所產生之3×3後處理運算資料1062進行1×1卷積運算以產生1×1運算資料，其中1×1卷積運算的空間濾波位置可分別與至少一第二平行子處理器1121bc對應。於1×1後處理運算程序s2434中，1×1後處理運算單元1213e根據1×1運算資料及1×1偏壓權重參數執行1×1後處理運算以輸出1×1後處理運算資料1064。於1×1後處理運算程序s2434中所輸出之1×1後處理運算資料1064即為卷積神經網路處理器100執行卷積神經網路處理器的資料處理方法s200所產生之輸出資料108。 The 1×1 operation sub-module 1213 can execute the second operation sub-step s243, and the 1×1 operation sub-module 1213 includes at least one 1×1 convolution allocator group, a complex 1×1 local convolution operation unit 1213b, a complex 1 ×1 post-processing arithmetic unit 1213e. 1×1 convolutional divider group with at least one second parallel sub-place The processor 1121bc is electrically connected, and in the 1×1 parameter assignment program s2431, the 1×1 weight parameters are received and assigned to the 1×1 local convolution operation unit 1213b, so that the 1×1 local convolution operation unit 1213b can perform 1 The ×1 operation parameter generation program s2432 and the 1×1 convolution operation program s2433. Each 1×1 local convolution operation unit 1213b includes a 1×1 local register group 1213c and a 1×1 local filter operation unit 1213d. The 1×1 local register bank 1213c can execute the 1×1 operation parameter generation program s2432. The 1×1 local register bank 1213c includes two sub-1×1 local register banks 1213ca and 1213cb, and executes 1 in a ping-pong manner. The ×1 operation parameter generating program s2432 outputs the 1×1 operation parameter to the 1×1 local filtering operation unit 1213d. In the 1×1 convolution operation program s2433, the 1×1 local filter operation unit 1213d performs the 1×1 operation according to the 1×1 operation parameters and the 3×3 post-processing operation data 1062 generated in the 3×3 post-processing operation program s2414. 1 convolution operation to generate 1×1 operation data, wherein the spatial filtering positions of the 1×1 convolution operation can respectively correspond to at least one second parallel sub-processor 1121bc. In the 1×1 post-processing operation program s2434, the 1×1 post-processing operation unit 1213e performs a 1×1 post-processing operation according to the 1×1 operation data and the 1×1 bias weight parameter to output the 1×1 post-processing operation data 1064 . The 1×1 post-processing operation data 1064 output in the 1×1 post-processing operation program s2434 is the output data 108 generated by the convolutional neural network processor 100 executing the data processing method s200 of the convolutional neural network processor .

綜上所述，卷積神經網路處理器100可透過執行卷積神經網路處理器的資料處理方法s200執行高度平行運算進而提供高性能且低功耗的運算。 To sum up, the convolutional neural network processor 100 can perform highly parallel operations by executing the data processing method s200 of the convolutional neural network processor to provide high performance and low power consumption operations.

雖然本發明已以實施方式揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明的精神和範圍內，當可作各種的更動與潤飾，因此本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection of the present invention The scope shall be determined by the scope of the appended patent application.

102‧‧‧輸入程式 102‧‧‧Input Program

104‧‧‧輸入權重參數 104‧‧‧Input Weight Parameters

106‧‧‧輸入資料 106‧‧‧Enter data

112‧‧‧平行處理模組 112‧‧‧Parallel processing module

1121‧‧‧平行處理子模組 1121‧‧‧Parallel Processing Submodule

1121a‧‧‧平行子記憶體 1121a‧‧‧Parallel sub-memory

1121b‧‧‧平行子處理器 1121b‧‧‧Parallel Subprocessor

108‧‧‧輸出資料 108‧‧‧Output data

110‧‧‧訊息解碼單元 110‧‧‧Message Decoding Unit

111‧‧‧解碼模組 111‧‧‧Decoding Module

1111‧‧‧程式記憶體 1111‧‧‧Program memory

1112‧‧‧指令解碼器 1112‧‧‧Instruction Decoder

120‧‧‧卷積判斷單元 120‧‧‧Convolution judgment unit

121‧‧‧運算模組 121‧‧‧Computing Modules

Claims

A convolutional neural network processor for computing an input data, the convolutional neural network processor comprising: a message decoding unit for receiving an input program and a complex input weight parameter, and comprising: a decoding module , receiving the input program, and outputting an operation command according to the input program; and a parallel processing module, electrically connected with the decoding module, and receiving the input weight parameters, and the parallel processing module includes complex parallel processing sub-modules, the parallel processing sub-modules generate complex output weight parameters according to the operation command and the input weight parameters, wherein the parallel processing sub-modules include: a plurality of parallel sub-memory; and a plurality of parallel sub-processors, respectively electrically connected to the decoding module and a parallel sub-memory; and a convolution determination unit electrically connected to the message decoding unit, and comprising: an operation module electrically connected to the parallel processing module , the operation module generates an output data according to the operation of the input data and the output weight parameters, When the input weight parameters are complex uncompressed input weight parameters, the parallel sub-memory stores the uncompressed input weight parameters in parallel, and the parallel sub-processors receive the uncompressed input in parallel according to the operation command Weight parameters, which generate these output weight parameters.

The convolutional neural network processor as described in claim 1, wherein the decoding module comprises: a program memory, which stores the input program; and an instruction decoder, which is electrically connected to the program memory, The command decoder decodes the input program to output an operation command.

The convolutional neural network processor described in claim 1, wherein when the input weight parameters are complex compressed input weight parameters, the parallel sub-memory stores the compressed input weight parameters in parallel, the The parallel sub-processors receive and decompress the compressed input weight parameters in parallel according to the operation command to generate the output weight parameters.

The convolutional neural network processor as described in claim 1, wherein the input weight parameters comprise a plurality of first input weight parameters The output weight parameters include a plurality of first output weight parameters; the parallel sub-memory stores the input weight parameters in parallel, and the parallel sub-memory includes: a plurality of first parallel sub-memory, respectively and in parallel receiving and storing a first input weight parameter; and the parallel sub-processors include: a plurality of first parallel sub-processors, respectively electrically connected with a first parallel sub-memory, and receiving a first parallel sub-processor according to the operation command an input weight parameter to output the first output weight parameter.

The convolutional neural network processor as described in claim 4, wherein each of the first output weight parameters includes a complex number of 3×3 weight parameters; and the operation module includes: a 3×3 operation sub-module , which is electrically connected to the first parallel sub-processors, and performs operations on the input data according to the first output weight parameters to generate a 3×3 post-processing operation Data, the 3×3 operation sub-module includes: a plurality of 3×3 convolution divider groups, each of the 3×3 convolution divider groups is electrically connected with a first parallel sub-processor, the 3×3 convolution dividers The distributor group is used for receiving and distributing the 3×3 weight parameters of the first output weight parameters; the complex 3×3 local convolution operation units are respectively electrically connected with one of the 3×3 convolution distributor groups, each of the The 3×3 local convolution operation unit includes: a 3×3 local register group, the 3×3 local register group is electrically connected with the 3×3 convolution allocator group, the 3×3 local volume The 3×3 local register groups of the product operation unit receive and store the 3×3 weight parameters of the first output weight parameters, and according to the 3×3 weight parameters of the first output weight parameters , to output complex 3×3 operation parameters; and a 3×3 local filtering operation unit, which is electrically connected to the 3×3 local register group, and these 3×3 local convolution operation units of these 3×3 The local filtering operation unit is based on the 3×3 operation parameters and the input input data for operation to generate complex 3×3 operation data; and a complex 3×3 post-processing operation unit, electrically connected with the 3×3 local convolution operation units, and perform a 3×3 operation according to the 3×3 operation data ×3 post-processing operation to generate the 3×3 post-processing operation data; wherein, the output data is the 3×3 post-processing operation data.

The convolutional neural network processor as described in claim 5, wherein each of the 3×3 local register sets includes: two sub-3×3 local register sets, alternately storing one of the 3×3 weights parameter or output the 3×3 operation parameter to the 3×3 local filtering operation unit.

The convolutional neural network processor described in claim 5, wherein the input weight parameters further comprise a bias input weight parameter; the output weight parameters further comprise a bias output weight parameter; and the Some parallel sub-memory also include: A bias parallel sub-memory stores the bias input weight parameter in parallel; the parallel sub-processors further include: a bias parallel sub-processor electrically connected to the bias parallel sub-memory, according to the operation The instruction receives the bias input weight parameter to output the bias output weight parameter.

The convolutional neural network processor described in claim 7, wherein the bias output weight parameter comprises a complex bias weight parameter; and the operation module further comprises: a bias distributor, which is connected to the bias The parallel sub-processor and the 3×3 operation sub-module are electrically connected, and the bias divider outputs the weight parameter according to the bias voltage to generate a plurality of 3×3 bias voltage weight parameters, and distributes the 3×3 bias voltage The weight parameters are output to the 3×3 post-processing operation units.

The convolutional neural network processor described in claim 4, wherein the input weight parameters further include at least one second input weight weight parameters; the output weight parameters further include at least one second output weight parameter; and the parallel sub-memory further includes: at least one second parallel sub-memory, respectively and in parallel receiving and storing the at least one second input weight parameters; the parallel sub-processors further include: at least one second parallel sub-processor, respectively electrically connected to the at least one second parallel sub-memory, receiving the at least one second input weight parameter according to the operation command, to output the at least one second output weight parameter.

The convolutional neural network processor as described in claim 9, wherein the operation module comprises: a 3×3 operation sub-module electrically connected with the first parallel sub-processors, and according to the some first output weight parameters are operated with the input data to generate a 3×3 post-processing operation data; and a 1×1 operation sub-module with the at least one second parallel sub-processor and the 3×3 operation The sub-module is electrically connected, and performs operation on the 3×3 post-processing operation data according to the at least one second output weight parameter to generate a 1×1 post-processing operation data; wherein, the output data is the 1×1 Post-processing operation data.

The convolutional neural network processor described in claim 10, wherein the at least one second output weight parameter comprises a complex 1×1 weight parameter; the 1×1 operation sub-module comprises: at least one 1×1 A convolution distributor group, electrically connected to the at least one second parallel sub-processor, for receiving and distributing the 1×1 weight parameters of the at least one second output weight parameter; a complex 1×1 local convolution operation unit , electrically connected with the at least one 1×1 convolution distributor, each of the 1×1 local convolution operation units includes: a 1×1 local register group, the 1×1 local register group and the at least one A 1×1 convolution distributor group is electrically connected, and the 1×1 local register group of the 1×1 local convolution operation units receives and stores the 1×1 weight parameters of the at least one second output weight parameter , and according to the 1×1 weight parameters of the at least one second output weight parameter, to output complex 1×1 operation parameters; and a 1×1 local filtering operation unit, which is electrically connected to the 1×1 local register bank connection, the 1×1 local filtering operation unit of the 1×1 local convolution operation units is based on the some 1×1 operation parameters are operated with the 3×3 post-processing operation data to generate complex 1×1 operation data; and a complex 1×1 post-processing operation unit is electrically connected to the 1×1 local convolution operation units , and perform a 1×1 post-processing operation according to the 1×1 operation data to generate the 1×1 post-processing operation data.

The convolutional neural network processor of claim 11, wherein each of the 1×1 local register sets includes: two sub-1×1 local register sets, alternately storing one of the 1×1 weights Parameter or output the 1×1 operation parameter to the 1×1 operation unit.

The convolutional neural network processor of claim 11, wherein the input weight parameters further include a bias input weight parameter; the output weight parameters further include a bias output weight parameter; and the The parallel sub-memory further includes: a biased parallel sub-memory storing the bias input weight parameter in parallel; and The parallel sub-processors further include: a bias parallel sub-processor electrically connected to the bias parallel sub-memory, receiving the bias input weight parameter according to the operation command, and outputting the bias output weight parameter.

The convolutional neural network processor described in claim 13, wherein the bias output weight parameter comprises a complex bias weight parameter; the operation module further comprises: a bias distributor parallel to the bias The sub-processor, the 3×3 arithmetic sub-module and the 1×1 arithmetic sub-module are electrically connected, and the bias divider outputs the weight parameter according to the bias to generate a complex 3×3 bias weight parameter and a complex 1 ×1 bias weight parameters; wherein, the bias divider outputs the 3×3 bias weight parameters to the 3×3 post-processing operation units; wherein the bias divider outputs the 1×1 biases The compression weight parameters are output to the 1×1 post-processing operation units.

A data processing method for a convolutional neural network processor, comprising: a receiving step, driving a message decoding unit to receive an input program and complex input weight parameters, wherein the message decoding unit includes a decoding module and a parallel processing module; an instruction decoding step, driving the decoding module to receive the input program , and generate an operation command according to the input program; a parallel processing step drives the parallel processing module to receive the input weight parameters, and process the input weight parameters in parallel according to the operation command to generate a complex output weight parameters; and an operation step for driving an operation module to receive an input data and the output weight parameters, and to operate the input data and the output weight parameters according to the operation instruction to generate an output data, wherein When the input weight parameters are complex uncompressed input weight parameters, the operation process is to store the uncompressed input weight parameters.

The data processing method of a convolutional neural network processor as described in claim 15, wherein the decoding module includes a program memory and an instruction decoder, and the instruction decoding step includes: a program storage sub-step , which drives the program memory to store the an input program; and a program decoding sub-step, driving the instruction decoder to decode the input program to generate the operation instruction.

The data processing method of a convolutional neural network processor as described in claim 15, wherein the parallel processing module comprises a plurality of parallel sub-memory and a plurality of parallel sub-processors, and the parallel processing step comprises: a weight A parameter storage sub-step, driving the parallel sub-memory to store the input weight parameters in parallel; and a weight parameter processing sub-step, driving the parallel sub-processors, the parallel sub-processors, according to the operation command, in parallel The input weight parameters are read and an operation process is performed to generate the output weight parameters.

The data processing method of a convolutional neural network processor as described in claim 17, wherein when the input weight parameters are complex compressed input weight parameters, the operation process is to store and decompress the compressed input weight parameters .

The data processing method for a convolutional neural network processor as described in claim 15, wherein the output weight parameters include plural first output weight parameters; the operation module includes a 3×3 operation sub-module ; and the operation step includes: a first operation sub-step, driving the 3×3 operation sub-module to receive the input data and the first output weight parameters to generate a 3×3 post-processing operation data.

The data processing method for a convolutional neural network processor as described in claim 19, wherein each of the first output weight parameters comprises a complex number of 3×3 weight parameters; the 3×3 operation sub-module comprises a complex number of 3 ×3 convolution allocator group, complex 3×3 local convolution arithmetic unit and complex 3×3 post-processing arithmetic unit; and the first operation sub-step includes: a 3×3 parameter assignment procedure, driving the 3×3 volumes The integration divider group receives the 3×3 weight parameters of the first output weight parameters, and assigns the 3×3 weight parameters of the first output weight parameters 3×3 weight parameters are allocated to the 3×3 local convolution operation units, wherein each of the 3×3 local convolution operation units includes a 3×3 local register group and a 3×3 local filter operation unit; a The 3×3 operation parameter generation program drives the 3×3 local register groups of the 3×3 local convolution operation units to receive the 3×3 weight parameters of the first output weight parameters, and according to the The 3×3 weight parameters of the first output weight parameters generate complex 3×3 operation parameters; a 3×3 convolution operation program drives the 3×3 local filters of the 3×3 local convolution operation units The operation unit performs a 3×3 convolution operation on the 3×3 operation parameters and the input data to generate complex 3×3 operation data; and a 3×3 post-processing operation program to drive the 3×3 post-processing The processing operation unit performs a 3×3 post-processing operation on the 3×3 operation data to generate the 3×3 post-processing operation data, wherein the output data is the 3×3 post-processing operation data.

The data processing method for a convolutional neural network processor as described in claim 19, wherein the output weight parameters further include a bias output weight parameter; The operation module further includes a bias voltage distributor; and the operation step further includes: a bias voltage operation sub-step, driving the bias voltage distributor to output weight parameters according to the bias voltages to generate complex 3×3 bias voltage weight parameters , the bias distributor provides the 3×3 bias weight parameters to the 3×3 operation sub-module.

The data processing method for a convolutional neural network processor as described in claim 19, wherein the output weight parameters further include at least one second output weight parameter; the operation module includes a 1×1 operator module; and the operation step further comprises: a second operation sub-step, driving the 1×1 operation sub-module to receive the 3×3 post-processing operation data and the at least one second output weight parameter to generate a 1×1 1 Post-processing operation data.

The data processing method for a convolutional neural network processor as described in claim 22, wherein the at least one second output weight parameter comprises a complex 1×1 weight parameter; The 1×1 operation sub-module includes a complex 1×1 convolution distributor group, a complex 1×1 local convolution operation unit and a complex 1×1 post-processing operation unit; and the second operation sub-step includes: a 1×1 A parameter assignment program, driving the at least one 1×1 convolutional distributor group to receive the 1×1 weight parameters of the at least one second output weight parameter, and assign the 1×1 weight parameters of the at least one second output weight parameter The weight parameters are allocated to the 1×1 local convolution operation units, wherein each of the 1×1 local convolution operation units includes a 1×1 local register group and a 1×1 local filter operation unit; a 1×1 an operation parameter generating program, which drives the 1×1 local register groups of the 1×1 local convolution operation units to receive the 1×1 weight parameters of the at least one second output weight parameter, and according to the at least one The 1×1 weight parameters of the second output weight parameters generate complex 1×1 operation parameters; a 1×1 convolution operation program drives the 1×1 local filter operation unit of the 1×1 local convolution operation units A 1×1 convolution operation is performed on the 1×1 operation parameters and the 3×3 post-processing operation data to generate complex 1×1 operation data; and a 1×1 post-processing operation program is used to drive the 1 ×1 back The logical operation unit performs a 1×1 post-processing operation on the 1×1 operation data to generate the 1×1 post-processing operation data, wherein the output data is the 1×1 post-processing operation data.

The data processing method for a convolutional neural network processor as described in claim 22, wherein the output weight parameters further comprise a bias output weight parameter; the operation module further comprises a bias distributor; And the operation step further includes: a bias operation sub-step, driving the bias distributor to output the weight parameter according to the bias to generate a complex 3×3 bias weight parameter and a complex 1×1 bias weight parameter, wherein the The bias distributor provides the 3×3 bias weight parameters to the 3×3 arithmetic sub-module, and provides the 1×1 bias weight parameters to the 1×1 arithmetic sub-module.