TWI886942B

TWI886942B - Application specific integrated circuit for computing a convolutional neural network

Info

Publication number: TWI886942B
Application number: TW113117061A
Authority: TW
Inventors: 馬克Ａ馬修斯
Original assignee: 美商吉根托科技股份有限公司
Priority date: 2023-05-26
Filing date: 2024-05-08
Publication date: 2025-06-11
Also published as: EP4490666A1; JP7763543B2; KR20240172733A; CN119404199A; EP4490666A4; WO2024248910A1; IL312270A; TW202501278A; JP2025525265A

Abstract

An Application Specific Integrated Circuit (ASIC) for computing a convolutional neural network (CNN) has a first input bus receiving an ordered stream of values from an array, each position in the array having one or more channels, and a plurality of kernel processing tiles receiving inputs through configurable multiplexors. The kernel processing tiles and buses are arranged and connected in a manner that the ASIC operates as a pipelined system delivering an output stream in synchronization with the input stream.

Description

Special application integrated circuit for computing convolutional neural networks

相關申請案：本申請案為2022年5月11日提交之共同未決申請案17/742,245之部分接續申請案，該共同未決申請案為2022年1月7日提交、現已於2022年6月7日作為US 11,354,571發佈之申請案17/570,757之部分接續申請案，該申請案17/570,757為2021年7月12日提交、現已於2022年2月22日作為US 11256981發佈之申請案17/373,497之部分接續申請案，該申請案17/373,497為2021年4月15日提交、現已於2021年8月24日作為US 11099854發佈之申請案17/231,711之部分接續申請案，該申請案17/231,711為2020年10月15提交之共同未決申請案17/071,875之部分接續申請案。母申請案之所有揭露內容至少以引用方式併入本文中。 Related Applications: This application is a continuation-in-part of co-pending application 17/742,245 filed on May 11, 2022, which was a continuation-in-part of application 17/570,757 filed on January 7, 2022 and published on June 7, 2022 as US 11,354,571, which was a continuation-in-part of application 17/373,497 filed on July 12, 2021 and published on February 22, 2022 as US 11256981, which was a continuation-in-part of application 17/373,497 filed on April 15, 2021 and published on August 24, 2021 as US 17/231,711, which is a continuation-in-part of co-pending application 17/071,875 filed on October 15, 2020. All disclosures of the parent application are incorporated herein by reference at least in part.

發明領域：本發明屬於涉及矩陣輸入及輸出之電腦運算之技術領域，並且更具體而言，係關於經設計用於矩陣運算中之大量乘法的電路。 Field of Invention: This invention belongs to the technical field of computer operations involving matrix input and output, and more specifically, it relates to circuits designed for large-scale multiplication in matrix operations.

發明背景：電腦在矩陣運算中之使用係本領域中眾所周知的，特定實例為影像處理以及神經網路之開發及使用。神經網路為人工智慧之重要組成部分，並且因此，在提交本專利申請案時，神經網路為智慧財產權開發中極受歡迎之主題。一般而言，在此種電腦運算中，大量輸入值以規則模式進行處理，該模式在大多數情況下為矩陣。輸入值之處理可涉及偏置及應用權重，個別輸入值可乘以該等權重。 Background of the invention: The use of computers for matrix operations is well known in the art, with particular examples being image processing and the development and use of neural networks. Neural networks are an important component of artificial intelligence and, therefore, are a very popular subject for intellectual property development at the time of filing this patent application. Generally speaking, in such computer operations, a large number of input values are processed in a regular pattern, which in most cases is a matrix. The processing of the input values may involve biasing and applying weights by which individual input values may be multiplied.

本案發明人咸信，在神經網路之技術中的其中將傳入值乘以多個權重值中之各者的複雜且運算密集之操作係開放創新的步驟，以在該技術中提供明顯優勢。本案發明人亦咸信，在修改要應用之數學過程之次序方面會獲得優勢。 The inventors of this case believe that the complex and computationally intensive operation in the art of neural networks in which an input value is multiplied by each of a plurality of weight values is open to innovative steps to provide significant advantages in the art. The inventors of this case also believe that advantages will be obtained in modifying the order of the mathematical processes to be applied.

本案發明人咸信，他已判定了要在此類應用中實施之數學過程之次序及方式的一般改變，此可很好地產生此類操作中之時間及成本的極顯著減少。 The inventor of this case believes that he has determined that a general change in the order and manner of mathematical processes to be performed in such applications may well produce a very significant reduction in the time and cost of such operations.

發明概要：在本發明之實施例中，提供了一種用於運算一卷積神經網路(CNN)之特殊應用積體電路(ASIC)，其包含：一第一輸入匯流排，其自一陣列接收一有序值串流，該陣列中之各位置具有一或多個資料通道；一第一組自始至終有序的核心處理塊，其具有固定數目個並行輸入連接及固定數目個並行輸出連接，該第一有序組之各核心處理塊透過一第一組可組態多工器中之一者耦接至該輸入匯流排，該等核心處理塊適於運算一公共核心大小之一卷積且將運算值傳遞至一第一輸出匯流排及傳遞至該第一有序組之一鄰近下游核心處理塊，該第一輸出匯流排作為一輸入連接回該第一組可組態多工器中之各可組態多工器；一第二組自始至終有序的核心處理塊，其具有該等固定數目個並行輸入連接及該等固定數目個並行輸出連接，該第二有序組之各核心處理塊透過一第二組可組態多工器中之一者耦接至該第一輸出匯流排，該第二有序組之該等核心處理塊適於運算該公共核心大小之一卷積且將運算值傳遞至一第二輸出匯流排及傳遞至該第二有序組之一鄰近下游核心處理塊，該第二輸出匯流排作為一輸入連接回該第二組可組態多工器中之各可組態多工器；以及一第三組自始至終有序的核心處理塊，其具有該等固定數目個並行輸入連接及該等固定數目個並行輸出連接，該第三有序組之各核心處理塊透過一第三組可組態多工器中之一者耦接至該第二輸出匯流排，該第三有序組之該等核心處理塊適於運算該公共核心大小之一卷積且將運算值傳遞至一第三輸出匯流排及傳遞至該第三有序組之一鄰近下游核心處理塊，該第三輸出匯流排作為一輸入連接回該第三組可組態多工器中之各可組態多工器，該第三輸出匯流排亦透過單個主要輸出多工器連接至一主要輸出電路，該主要輸出電路適於執行主要輸出處理且提供最終輸出。 Summary of the Invention: In an embodiment of the present invention, an application specific integrated circuit (ASIC) for computing a convolutional neural network (CNN) is provided, comprising: a first input bus that receives an ordered stream of values from an array, each position in the array having one or more data channels; a first set of core processing blocks that are ordered throughout, having a fixed number of parallel input connections and a fixed number of parallel output connections, each core processing block of the first ordered set being coupled to the input bus via one of a first set of configurable multiplexers; , the core processing blocks being adapted to compute a convolution of a common core size and passing the computed value to a first output bus and to an adjacent downstream core processing block of the first ordered group, the first output bus being connected back as an input to each configurable multiplexer in the first set of configurable multiplexers; a second set of core processing blocks being ordered throughout, having the fixed number of parallel input connections and the fixed number of parallel output connections, each core processing block of the second ordered group being coupled to the first through one of the second set of configurable multiplexers output bus, the core processing blocks of the second ordered group are adapted to calculate a convolution of the common core size and pass the calculated value to a second output bus and to an adjacent downstream core processing block of the second ordered group, the second output bus being connected back as an input to each configurable multiplexer in the second set of configurable multiplexers; and a third set of core processing blocks ordered from beginning to end, which have the fixed number of parallel input connections and the fixed number of parallel output connections, each core processing block of the third ordered group is connected to the third set of configurable multiplexers through a third set of configurable multiplexers. One of the configuration multiplexers is coupled to the second output bus, the core processing blocks of the third ordered group are adapted to compute a convolution of the common core size and pass the computational value to a third output bus and to an adjacent downstream core processing block of the third ordered group, the third output bus is connected back as an input to each configurable multiplexer in the third group of configurable multiplexers, the third output bus is also connected to a primary output circuit through a single primary output multiplexer, the primary output circuit is adapted to perform primary output processing and provide final output.

在一個實施例中，該ASIC進一步包含：耦接至該第一輸出匯流排之一額外可組態多工器，其將經選擇值提供至該第二有序組中之第一核心處理塊；以及耦接至第二輸出匯流排之一額外可組態多工器，其將經選擇值提供至該第三有序組中之第一核心處理塊。此外，在一個實施例中，該ASIC進一步包含提供除該等核心處理塊之功能以外的功能之一或多個輔助功能塊。在一個實施例中，該一或多個輔助功能塊自連接至該輸入匯流排及該第一輸出匯流排之雙多工器接受輸入且將輸出提供至該第一輸出匯流排。並且在一個實施例中，該一或多個輔助功能塊自連接至該第一輸出匯流排及該第二輸出匯流排之雙多工器接受輸入且將輸出提供至該第二輸出匯流排。 In one embodiment, the ASIC further includes: an additional configurable multiplexer coupled to the first output bus that provides selected values to the first core processing block in the second ordered group; and an additional configurable multiplexer coupled to the second output bus that provides selected values to the first core processing block in the third ordered group. In addition, in one embodiment, the ASIC further includes one or more auxiliary function blocks that provide functions other than the functions of the core processing blocks. In one embodiment, the one or more auxiliary function blocks receive inputs from dual multiplexers connected to the input bus and the first output bus and provide outputs to the first output bus. And in one embodiment, the one or more auxiliary functional blocks receive input from a dual multiplexer connected to the first output bus and the second output bus and provide output to the second output bus.

在本發明之一個實施例中，該一或多個輔助功能塊自連接至該第二輸出匯流排及該第三輸出匯流排之雙多工器接受輸入且將輸出提供至該第三輸出匯流排。此外，在一個實施例中，該ASIC進一步包含透過一可組態多工器自該第一輸出匯流排選擇輸入且將輸出提供至該第一輸出匯流排的外部功能電路系統。此外，在一個實施例中，該ASIC進一步包含透過一可組態多工器自該第二輸出匯流排選擇輸入且將輸出提供至該第二輸出匯流排的外部功能電路系統。在一個實施例中，該ASIC進一步包含透過一可組態多工器自該第三輸出匯流排選擇輸入且將輸出提供至該第三輸出匯流排的外部功能電路系統。並且在一個實施例中，該公共核心大小為一3×3核心。 In one embodiment of the present invention, the one or more auxiliary function blocks receive inputs from a dual multiplexer connected to the second output bus and the third output bus and provide outputs to the third output bus. In addition, in one embodiment, the ASIC further includes an external functional circuit system that selects inputs from the first output bus through a configurable multiplexer and provides outputs to the first output bus. In addition, in one embodiment, the ASIC further includes an external functional circuit system that selects inputs from the second output bus through a configurable multiplexer and provides outputs to the second output bus. In one embodiment, the ASIC further includes an external functional circuit system that selects inputs from the third output bus and provides outputs to the third output bus through a configurable multiplexer. And in one embodiment, the common core size is a 3×3 core.

在一個實施例中，並行輸入連接之該固定數目為16，並且並行輸出連接之該固定數目為16。此外，在一個實施例中，該有序值串流係由RGB值之一直接攝影機輸出、適合於自一CPU匯流排存取之一DMA介面或一視訊串流解壓縮電路中之一者提供，從而針對一影像產生紅、綠及藍(RGB)值之三個並行通道。此外，在一個實施例中，該ASIC進一步包含組合處理16個並行輸入及輸出連接之3乘3核心之二個或更多個核心處理塊的操作，以組態具有多於16個輸入或多於16個輸出或多於16個輸入及輸出的3乘3核心。在一個實施例中，該ASIC由額外電路系統調適以藉由組合多個該等3乘3核心處理塊之操作來運算具有大於3乘3之一核心的一卷積。並且在一個實施例中，該ASIC適於運算一5乘5卷積、一7乘7卷積或一9乘9卷積。 In one embodiment, the fixed number of parallel input connections is 16, and the fixed number of parallel output connections is 16. Further, in one embodiment, the ordered stream of values is provided by one of a direct camera output of RGB values, a DMA interface suitable for access from a CPU bus, or a video stream decompression circuit, thereby generating three parallel channels of red, green, and blue (RGB) values for an image. Further, in one embodiment, the ASIC further includes combining the operation of two or more core processing blocks of a 3x3 core that processes 16 parallel input and output connections to configure a 3x3 core with more than 16 inputs or more than 16 outputs or more than 16 inputs and outputs. In one embodiment, the ASIC is adapted by additional circuitry to compute a product having a core greater than 3x3 by combining operations of a plurality of the 3x3 core processing blocks. And in one embodiment, the ASIC is adapted to compute a 5x5 product, a 7x7 product, or a 9x9 product.

在一個實施例中，該等核心處理塊將各輸入通道值呈現給一大量乘法器，該大量乘法器運算來自該輸入之可能的倍數之全組且將該等倍數連同來自一輔助並行連接組之單個通道值提供至單個卷積單元，來自各卷積單元之所得輸出經分組在一組16個並行輸出連接中且可供該輸出匯流排上之其他核心處理塊使用。在一個實施例中，核心處理塊輸入值係由本地雙輸入固定乘法器處理。此外，在一個實施例中，該等輔助功能塊中之個別者透過一雙多工器自二個獨立匯流排接收並行輸入連接且輸出由一輸出多工器選擇之MaxPool函數、平均函數、取樣函數及擴展函數中之一者。在一個實施例中，該輔助功能塊對藉由個別通道來自該等並行輸入連接之值求和且將所求和值多工至一查找表，該查找表適於提供任何啟動函數，該啟動函數可以表列式表示，包括一RELU、S型或雙曲正切啟動函數。在一個實施例中，自獨立並行輸入連接接收之該等輸入通道為至一第一專用多工器之輸入，該第一專用多工器將該等並行輸入連接串接以有效地將資料通道重新路由至特定並行輸出連接中且將經串接輸出提供至該輸出多工器作為該輔助功能塊之輸出的一供選擇候選者。並且在一個實施例中，自獨立並行輸入連接接收之該等輸入通道為至一第二專用多工器之輸入，該第二專用多工器藉由使來自各連接之樣本交替來將二個並行輸入連接串接成一個並行輸出連接且將結果提供至該輸出多工器作為該輔助功能塊之輸出的一供選擇候選者。 In one embodiment, the core processing blocks present each input channel value to a large number of multipliers which operate on the full set of possible multiples of the input and provide those multiples along with the individual channel values from an auxiliary set of parallel connections to a single convolution unit, the resulting outputs from each convolution unit being grouped in a set of 16 parallel output connections and made available to other core processing blocks on the output bus. In one embodiment, the core processing block input values are processed by local two-input fixed multipliers. In addition, in one embodiment, each of the auxiliary function blocks receives parallel input connections from two independent buses through a dual multiplexer and outputs one of a MaxPool function, an average function, a sampling function, and an expansion function selected by an output multiplexer. In one embodiment, the auxiliary function block sums the values from the parallel input connections through individual channels and multiplexes the summed value to a lookup table, which is suitable for providing any activation function that can be represented in a table, including a RELU, a sigmoid, or a hyperbolic tangent activation function. In one embodiment, the input channels received from independent parallel input connections are inputs to a first dedicated multiplexer, which concatenates the parallel input connections to effectively reroute the data channels to specific parallel output connections and provides the concatenated output to the output multiplexer as a candidate for selection as an output of the auxiliary function block. And in one embodiment, the input channels received from independent parallel input connections are inputs to a second dedicated multiplexer, which concatenates two parallel input connections into one parallel output connection by alternating samples from each connection and provides the result to the output multiplexer as a candidate for selection as an output of the auxiliary function block.

101,101a:源被乘數/源通道 101,101a: Source multiplicand/source channel

101b,101c,101d:源通道 101b,101c,101d: source channel

102a,102b,102c,102d:專用大量乘法器 102a, 102b, 102c, 102d: dedicated large number of multipliers

103a,103b,103c,103d,203a,203b,203c,203d:運算單元 103a,103b,103c,103d,203a,203b,203c,203d: Operational unit

104,204:處理 104,204:Processing

201a,201b,201c,201d,1008,1010,1011,1012,1106,1303,1402,4312,4403,4404,4407,4409,4412,4803,4905,4907,4912,4913:多工器 201a,201b,201c,201d,1008,1010,1011,1012,1106,1303,1402,4312,4403,4404,4407,4409,4412,4803,4905,4907,4912,4913:Multiplexer

302a,302b,302c,302d:位元項/未使用項/項 302a,302b,302c,302d: Bit Item/Unused Item/Item

302e,401:項 302e,401:Item

303a,303b,303c,303d,303e,303f:乘積/未使用乘積 303a,303b,303c,303d,303e,303f: Product/Unused Product

304:倍數 304: multiples

402a,402b,402c,402d,402e,402f,605a,605b,605c,605d,605e,605f:乘積 402a,402b,402c,402d,402e,402f,605a,605b,605c,605d,605e,605f: product

501a,501b:管線暫存器/保留項 501a,501b: pipeline register/reserved items

502:對 502: Yes

503:保留對 503:Retained

504,601:三位元組 504,601: triplet

505:四位元組 505: Four bytes

506a,506b,506c,506d,506e,506f:乘積/倍數 506a,506b,506c,506d,506e,506f: product/multiple

507:最佳邏輯電路 507: Best Logic Circuit

603:七位元組 603: seven bytes

604:八位元組 604: octet

701:輸入通道組 701: Input channel group

702:控制信號/控制電路系統 702: Control signal/control circuit system

703:公共電路系統 703: Public circuit system

704a,704b,704c:子函數運算電路 704a,704b,704c: Subfunction operation circuit

705:輸出通道組/輸出串流宿 705: Output channel group/output stream sink

801a,801b,801c,4603:大量乘法器 801a,801b,801c,4603: a large number of multipliers

802:源通道乘積 802: Source channel product

803:控制電路系統/公共控制及同步電路系統/控制 803: Control circuit system/public control and synchronization circuit system/control

804:rowSrc計數器 804:rowSrc counter

805:colSrc計數器 805:colSrc counter

806:rowDst計數器 806:rowDst counter

807:colDst計數器 807:colDst counter

808:第一列信號(ROWFST) 808: First row signal (ROWFST)

809:最後列信號(ROWLST) 809: Last row signal (ROWLST)

810:第一行信號(COLFST) 810: First line signal (COLFST)

811:最後行信號(COLLST) 811: Last row signal (COLLST)

812:後處理賦能信號(POSTEN) 812: Post-processing enabling signal (POSTEN)

813:輸出賦能信號(DSTEN) 813: Output enable signal (DSTEN)

814:源控制信號 814: Source control signal

815:輸出通道控制及計數器 815: Output channel control and counter

901:合成器/合成器電路/胞元 901:Synthesizer/Synthesizer Circuit/Cell

902a,902b,902c,903a,903b,903c,904,905a,905b,905c,906,907a,907b,907c:合成器/合成器電路 902a,902b,902c,903a,903b,903c,904,905a,905b,905c,906,907a,907b,907c:Synthesizer/Synthesizer Circuit

908a,908b,908c,908d,908e,908f:延遲級/延遲/列內延遲線 908a,908b,908c,908d,908e,908f: Delay level/delay/delay line within a row

909,910a,910b:延遲級/延遲/經截斷輸出延遲線/最終經截斷結果延遲線/經截斷延遲線/最終結果延遲線 909,910a,910b: Delay level/delay/truncated output delay line/final truncated result delay line/truncated delay line/final result delay line

911:終結步驟/終結函數/終結/最終處理 911: Termination step/termination function/termination/final processing

1001:串流值之源輸入組 1001: Source input group of stream value

1002:緊接在左側之合成器 1002: The synthesizer is connected to the left side

1003:緊接在上方之合成器列 1003: The synthesizer row immediately above

1004,1005,1006,1007,2709:電路系統 1004,1005,1006,1007,2709:Circuit system

1009,3201,3304:合成器 1009,3201,3304:Synthesizer

1101,1102:列 1101,1102: Column

1103,1111:列/路徑 1103,1111:row/path

1104,1105,3301,3401,3402,3501:暫存器 1104,1105,3301,3401,3402,3501: register

1107,2403:先進先出(FIFO)電路 1107,2403: First-in, first-out (FIFO) circuit

1109,1110:並行存取暫存器/經延遲結果 1109,1110:Parallel access to register/delayed result

1112,1113:並行路徑 1112,1113: Parallel paths

1114:值 1114: value

1201:輔助輸出 1201: Auxiliary output

1202,1304,2405,2409,2504,2602,2605,2703,2704,3302,3305,4202,4203,4204,4205,4208,4209,4304,4305,4306,4308,4309,4311,4705,4706,4902,5003:FIFO 1202,1304,2405,2409,2504,2602,2605,2703,2704,3302,3305,4202,4203,4204 ,4205,4208,4209,4304,4305,4306,4308,4309,4311,4705,4706,4902,5003:FIFO

1301:部分結果 1301: Partial results

1401:經截斷結果延遲線 1401: Delay line after truncation result

1403:終結電路系統 1403: Terminate circuit system

1404:輸出串流之最終形式 1404: Final form of output stream

1701,1801,2711,5001:輸入通道 1701,1801,2711,5001:Input channel

1702,1802:第一7乘7卷積節點 1702,1802: The first 7x7 convolution node

1703,1704,1705,1803,1804,1805:後續卷積節點 1703,1704,1705,1803,1804,1805: Subsequent convolution nodes

1706,1806:串接節點 1706,1806: Connect nodes

1707:MaxPool節點 1707: MaxPool node

1708:全域平均節點 1708: Global average node

1709:輸出通道 1709: Output channel

1807:第一MaxPool節點 1807: The first MaxPool node

1808,1809,1810,1811:節點 1808,1809,1810,1811: Nodes

1812:第二2乘2 MaxPool節點 1812: Second 2x2 MaxPool node

1813:最終輸出 1813: Final output

2101:一組四個輸入 2101: A set of four inputs

2102:當前組四個輸入 2102: Current group of four inputs

2103,2104,2105:核心列 2103,2104,2105: Core columns

2106,2107,2108:電路/核心複本 2106,2107,2108: Circuit/core copy

2109:輸出串流 2109: Output stream

2201,2202,2203,2301,2302,2303,2501,2901,2902,2903:輸入組 2201,2202,2203,2301,2302,2303,2501,2901,2902,2903: Input group

2204,2205,2206,2207:核心電路 2204,2205,2206,2207: Core circuit

2208,2908:4-up輸出陣列串流 2208,2908:4-up output array stream

2304,2305,2306,2307:核心處理電路 2304,2305,2306,2307: Core processing circuit

2308:插入(「有效」)卷積輸出/電路 2308: Insert ("valid") volume output/circuit

2309,2310,2311,2312,2712,2803,2804,2805,2806,2807,2808:電路 2309,2310,2311,2312,2712,2803,2804,2805,2806,2807,2808:Circuit

2313:完全(「相同」)卷積輸出 2313: Exact ("same") convolution output

2401:4-up資料串流 2401:4-up data stream

2402,2404,2408,2503:比較器 2402,2404,2408,2503: Comparator

2406:輸出組 2406:Output group

2407:2-up資料串流 2407:2-up data stream

2410:單組輸出通道 2410: Single output channel

2502,2802,3005:當前輸入組 2502,2802,3005:Current input group

2506,2507,2508,2509:比較器區塊 2506,2507,2508,2509: Comparator block

2510,2706,2713,3003,3007,4708,4813,4915,5004,5415:輸出 2510,2706,2713,3003,3007,4708,4813,4915,5004,5415: Output

2601:4-up串流 2601:4-up streaming

2603:2-up串流 2603:2-up streaming

2604:3-up串流 2604:3-up streaming

2606:5-up串流 2606:5-up streaming

2701,2702:源 2701,2702: Source

2705:交錯電路 2705: Interlaced circuits

2707,3001,5101,5401:輸入 2707,3001,5101,5401:Input

2708:權重 2708:Weight

2710:輸出通道之單個1-up組 2710: Single 1-up group of output channels

2801:暫存器/先前輸入組 2801: Register/previous input set

2809:輸出陣列串流 2809: Output array stream

2904,2905,2906,2907:求和電路 2904,2905,2906,2907:Summing circuit

3002:路由電路系統 3002: Routing circuit system

3004:先前輸入組 3004:Previous input group

3006:重新封裝電路系統 3006: Repackaging circuit system

3100:系統 3100:System

3101,3105,3106,3107,3108:IC 3101,3105,3106,3107,3108:IC

3102:輸入埠 3102: Input port

3103:輸出埠 3103: Output port

3104:功能電路 3104: Functional circuit

3202:列緩衝器FIFO 3202: column buffer FIFO

3203:平面緩衝器FIFO 3203: Flat buffer FIFO

3204:最終處理電路 3204: Final processing circuit

3303:匯流排 3303:Bus

3403,3502,3503,3504,3505,3506:合成器組 3403,3502,3503,3504,3505,3506:Synthesizer set

3601:輸入陣列串流 3601: Input array stream

3602:孔徑函數IC電路 3602: Aperture function IC circuit

3603,4003,4102:上下文管理電路 3603,4003,4102:Context management circuit

3604:最終輸出串流 3604: Final output stream

3701:第一串流 3701: First stream

3702:第二串流 3702: Second stream

3703:儲存及轉發多工器/交錯電路系統 3703: Storage and forwarding multiplexer/interleaving circuit system

3704:上下文電路 3704:Context circuit

3705,4103:最終輸出 3705,4103: final output

3801:第一未縮放子串流 3801: First unscaled substream

3802,3803,3804,3901,3902:子串流 3802,3803,3804,3901,3902: substream

4001:級聯多尺度取樣器 4001: Cascaded multi-scale sampler

4002:時間交錯序列/交錯樣本串流/多尺度樣本串流 4002: Time-interleaved sequence/interleaved sample stream/multi-scale sample stream

4004:交錯多尺度陣列串流 4004: Interleaved multi-scale array stream

4101:交錯輸入串流 4101: Interlaced input stream

4201:2：1取樣器 4201:2:1 Sampler

4206,4210,4307:切分多工器 4206,4210,4307: Cut multiplexer

4207:

：1取樣器 4207:

: 1 Sampler

4301:U：1取樣器 4301:U:1 Sampler

4302:V：1取樣器 4302:V:1 Sampler

4303:W：1取樣器 4303:W:1 Sampler

4310:1：T放大取樣器 4310:1: T-amplifier sampler

4401:主要輸入 4401:Main input

4402:輸入匯流排 4402: Input bus

4405,4501,4502:核心處理塊 4405,4501,4502: core processing block

4406:輸出匯流排 4406: Output bus

4408:輔助功能塊 4408: Auxiliary function block

4410:外部功能電路系統 4410: External functional circuit system

4411:16個並行輸入連接 4411: 16 parallel input connections

4413:主要輸出電路系統 4413: Main output circuit system

4503:第一核心處理塊 4503: First core processing block

4504:第二核心處理塊 4504: Second core processing block

4505,4506,4507,4508:塊 4505,4506,4507,4508: Block

4601:並行輸入連接組 4601: Parallel input connection set

4602:輔助連接組 4602: Auxiliary connection set

4604:卷積單元 4604: Volume unit

4605:並行輸出連接 4605:Parallel output connection

4701:輸入乘積 4701: Input product

4702:輔助輸入 4702: Auxiliary input

4703:內部匯流排 4703:Internal bus

4704:權重求和胞元 4704: Weighted sum cell

4707,4806,4810,4811,4904:加法器 4707,4806,4810,4811,4904: Adder

4801:所呈現輸入通道倍數 4801: The number of input channels presented

4802:組態暫存器/可組態索引 4802: Configuration register/configurable index

4804:組態暫存器/可組態尺度 4804: Configuration register/configurable scale

4805:可變移位暫存器/縮放器 4805: Variable shift register/scaler

4807:經組態偏置值 4807:Configured offset value

4808:經延遲值/經延遲連接 4808: Delayed value/delayed connection

4809:經轉發部分和/經轉發值 4809: Forwarded part and/or forwarded value

4812:最終加法器 4812: Final adder

4901:主要並行輸入連接組 4901: Main parallel input connection set

4903:輔助輸入 4903: Auxiliary input

4906:公共查找表 4906: Public Lookup Table

4908:MaxPool功能區塊 4908: MaxPool functional block

4909:平均功能區塊 4909: Average functional block

4910:取樣功能區塊 4910: Sampling function block

4911:擴展功能區塊 4911: Expanded functional area

4914,5102:可組態多工器 4914,5102:Configurable multiplexer

5002:小塊功能電路系統 5002: Small functional circuit system

5103:外部輸出 5103:External output

5402,5404,5406,5408,5410,5413:卷積 5402,5404,5406,5408,5410,5413: Volume

5403:取樣 5403: Sampling

5405,5407,5412:MaxPool 5405,5407,5412:MaxPool

5409:擴展 5409:Expansion

5411:串接 5411: Serial connection

5414:平均 5414:Average

q₀:第一並行通道組/第一輸出/值/輸出位置/輸出組 q ₀ : First parallel channel group/first output/value/output position/output group

q₁:並行通道組/值/輸出組 q ₁ : Parallel channel group/value/output group

q₂,q₃:並行通道組/值 q ₂ ,q ₃ : parallel channel group/value

p₀,p₁,p₂,p₃:輸入通道 p ₀ ,p ₁ ,p ₂ ,p ₃ : input channel

w_0,0-w_2,2,W_0,2,2:權重 w _0,0 -w _2,2 ,W _0,2,2 : weight

W_2,2,2:最後權重 W _2,2,2 : Final weight

R₀-R₁₆:列 R ₀ -R ₁₆ : Column

圖1繪示了其中應用於各公共源之大量乘法器為固定的且直接連線至處理電路中的實施例。 Figure 1 shows an embodiment in which a large number of multipliers applied to each common source are fixed and directly connected to the processing circuit.

圖2繪示了其中應用於各公共源之大量乘法器為動態的且透過多工器路由至處理電路的實施例。 FIG2 illustrates an embodiment in which a large number of multipliers applied to each common source are dynamic and routed to processing circuits via multiplexers.

圖3繪示了其中對應於各大量乘法器中所設定之位元的移位項經求和以形成乘積的簡單實施例。 FIG3 illustrates a simple implementation in which shift terms corresponding to bits set in each of the plurality of multipliers are summed to form a product.

圖4繪示了其中移位項彼此之加法及減法經混合以形成具有較低複雜度之等效解的經增強實施例。 Figure 4 illustrates an enhanced embodiment in which addition and subtraction of shift terms from one another are mixed to form an equivalent solution with lower complexity.

圖5A繪示了藉由僅自成對操作構建子組成來使時脈頻率最大化的管線實施例。 FIG5A illustrates an example of a pipeline implementation that maximizes clock frequency by constructing subcomponents from only paired operations.

圖5B繪示了其中倍數係藉由一組固定情況直接形成而不參考標準算術運算的實施例。 FIG5B illustrates an embodiment in which the multiples are formed directly from a set of fixed cases without reference to standard arithmetic operations.

圖6繪示了藉由自多達四次操作構建子組成來使電路密度最大化的管線實施例。 Figure 6 shows an example pipeline implementation that maximizes circuit density by building subassemblies from up to four operations.

圖7為繪示本發明之實施例中接收輸入串流、預處理輸入串流且透過獨特數位裝置饋送結果以產生輸出串流的結構及連接性之圖。 FIG. 7 is a diagram showing the structure and connectivity of receiving an input stream, preprocessing the input stream, and feeding the result through a unique digital device to generate an output stream in an embodiment of the present invention.

圖8A為繪示產生源通道乘積之結構及連接性的圖。 Figure 8A is a diagram showing the structure and connectivity for generating source channel products.

圖8B為繪示本發明之實施例中的控制設備及功能之額外細節的圖。 FIG8B is a diagram showing additional details of the control device and functions in an embodiment of the present invention.

圖9A為本發明之實施例中的管線操作之一般情況的部分圖示。 FIG. 9A is a partial diagram of a general situation of pipeline operation in an embodiment of the present invention.

圖9B為本發明之實施例中的管線操作之一般情況的另一部分圖示。 FIG. 9B is another partial diagram of a general situation of pipeline operation in an embodiment of the present invention.

圖9C為本發明之實施例中的管線操作之一般情況的另一部分圖示。 FIG. 9C is another partial diagram of a general situation of pipeline operation in an embodiment of the present invention.

圖10A為繪示本發明之實施例中的圖9A及圖9B之合成器905a、905b及905c之內部結構的圖。 FIG. 10A is a diagram showing the internal structure of synthesizers 905a, 905b, and 905c of FIG. 9A and FIG. 9B in an embodiment of the present invention.

圖10B為繪示本發明之實施例中的圖9A及圖9B之合成器902a、902b及902c之內部結構的圖。 FIG. 10B is a diagram showing the internal structure of synthesizers 902a, 902b, and 902c of FIG. 9A and FIG. 9B in an embodiment of the present invention.

圖10C為繪示本發明之實施例中的圖9A之合成器904之內部結構的圖。 FIG. 10C is a diagram showing the internal structure of the synthesizer 904 of FIG. 9A in an embodiment of the present invention.

圖10D為繪示本發明之實施例中的圖9A之合成器901之內部結構的圖。 FIG. 10D is a diagram showing the internal structure of the synthesizer 901 of FIG. 9A in an embodiment of the present invention.

圖10E為繪示本發明之實施例中的圖9B及圖9C之合成器903a、903b及903c之內部結構的圖。 FIG. 10E is a diagram showing the internal structure of synthesizers 903a, 903b, and 903c of FIG. 9B and FIG. 9C in an embodiment of the present invention.

圖10F為繪示本發明之實施例中的圖9A及圖9B之合成器907a、907b及907c之內部結構的圖。 FIG. 10F is a diagram showing the internal structure of synthesizers 907a, 907b, and 907c of FIG. 9A and FIG. 9B in an embodiment of the present invention.

圖10G為繪示本發明之實施例中的圖9A之合成器906之內部結構的圖。 FIG. 10G is a diagram showing the internal structure of the synthesizer 906 of FIG. 9A in an embodiment of the present invention.

圖11為描述本發明之實施例中的圖9C之延遲級908a、908b、908c、908d、908e及908f之內部結構及功能的圖。 FIG. 11 is a diagram describing the internal structure and function of the delay stages 908a, 908b, 908c, 908d, 908e, and 908f of FIG. 9C in an embodiment of the present invention.

圖12為繪示本發明之實施例中的圖9C之延遲級909之操作的圖。 FIG. 12 is a diagram showing the operation of the delay stage 909 of FIG. 9C in an embodiment of the present invention.

圖13為繪示本發明之實施例中的圖9C之延遲級910a及910b之操作的圖。 FIG. 13 is a diagram illustrating the operation of delay stages 910a and 910b of FIG. 9C in an embodiment of the present invention.

圖14為繪示圖9C中之終結步驟911之操作的圖。 FIG. 14 is a diagram showing the operation of the end step 911 in FIG. 9C .

圖15為繪示實施5乘5卷積節點的本發明之實施例中的管線操作之特定情況的圖。 FIG. 15 is a diagram showing a specific case of pipeline operation in an embodiment of the present invention implementing a 5x5 product node.

圖16繪示了本發明之實施例中的用於4×4孔徑函數之IC。 FIG. 16 shows an IC for a 4×4 aperture function in an embodiment of the present invention.

圖17A繪示了具有實施個別地串流傳輸輸入通道之深度神經網路之一部分的電路系統之IC。 FIG. 17A illustrates an IC having a circuit system that implements a portion of a deep neural network that streams input channels individually.

圖17B繪示了具有實施深度神經網路之另一部分之電路系統的IC。 FIG17B shows an IC having a circuit system that implements another portion of a deep neural network.

圖18A繪示了具有實施同時串流傳輸四個輸入通道之深度神經網路之一部分的電路系統之IC。 FIG. 18A illustrates an IC having a circuit system that implements a portion of a deep neural network that streams four input channels simultaneously.

圖18B繪示了實施圖18A之深度神經網路之另一部分的電路系統。 FIG18B shows a circuit system that implements another portion of the deep neural network of FIG18A.

圖19為繪示圖17A及圖17B之DNN之陣列串流大小的表。 FIG19 is a table showing the array stream sizes of the DNNs of FIG17A and FIG17B.

圖20為繪示圖18A及圖18B之DNN之陣列串流大小的表。 FIG. 20 is a table showing the array stream sizes of the DNNs of FIG. 18A and FIG. 18B .

圖21繪示了執行同時串流傳輸四個輸入通道之3乘3卷積節點的IC之電路系統。 Figure 21 shows the circuit system of an IC that implements a 3x3 convolution node that streams four input channels simultaneously.

圖22繪示了用於針對「相同」版本之3乘3卷積之4-up輸入通道產生輸出的電路之所需配置。 Figure 22 shows the required configuration of the circuit used to generate the output for the 4-up input channel of the "same" version of the 3x3 convolution.

圖23繪示了用於輸出同時串流傳輸四個輸入通道之1列乘7行卷積之二個變型的電路之所需配置。 Figure 23 shows the required configuration of the circuit for outputting two variations of the 1-row by 7-row product of four input channels for simultaneous streaming.

圖24A展示了在4-up資料串流上的2乘2 MaxPool節點之配置。 Figure 24A shows the configuration of a 2x2 MaxPool node on a 4-up data stream.

圖24B展示了在2-up資料串流上的圖24A之2乘2 MaxPool節點之配置。 Figure 24B shows the configuration of the 2x2 MaxPool nodes of Figure 24A on a 2-up data stream.

圖25繪示了其中不可能減小N之所設想實例。 Figure 25 shows a hypothetical example where it is not possible to reduce N.

圖26A繪示了用於將4-up串流重新封裝成2-up串流之FIFO電路。 Figure 26A shows the FIFO circuit used to repack a 4-up stream into a 2-up stream.

圖26B繪示了將3-up串流重新封裝成5-up串流。 Figure 26B shows the repackaging of a 3-up stream into a 5-up stream.

圖27A繪示了串接節點之實施方式，使得輸出包含來自所有源之所有通道。 Figure 27A shows an implementation of concatenating nodes so that the output includes all channels from all sources.

圖27B繪示了4-up密集節點之實施方式。 Figure 27B shows the implementation of a 4-up dense node.

圖27C繪示了4-up全域平均節點之實施方式。 Figure 27C shows the implementation of the 4-up global average node.

圖28繪示了3乘3局部平均節點之4-up實施方式。 Figure 28 shows a 4-up implementation of a 3x3 local average node.

圖29繪示了3乘3局部平均節點之另一4-up實施方式。 Figure 29 shows another 4-up implementation of a 3x3 local averaging node.

圖30A繪示了4-up子組節點之實施方式。 Figure 30A shows the implementation of a 4-up subgroup node.

圖30B繪示了4-up剪裁節點之典型實施方式。 Figure 30B shows a typical implementation of a 4-up pruning node.

圖31繪示了實施神經網路之互連IC之系統。 Figure 31 shows a system of interconnected ICs implementing a neural network.

圖32描繪了在積體電路上的合成器之配置，其經組態以在二十七個個別資料樣本上實施作為3D孔徑函數之3乘3乘3卷積。 Figure 32 depicts the arrangement of the synthesizer on an integrated circuit, which is configured to implement a 3x3x3 convolution as a 3D aperture function over twenty-seven individual data samples.

圖33繪示了IC，其中來自多個平面之資料可經同時緩衝且呈現，使得用於多個平面之權重可由單個合成器應用。 Figure 33 illustrates an IC in which data from multiple planes can be buffered and presented simultaneously so that weights for multiple planes can be applied by a single synthesizer.

圖34描繪了應用於4-up輸入串流之典型3乘3乘3卷積之實施方式。 Figure 34 illustrates a typical 3x3x3 convolution implementation applied to a 4-up input stream.

圖35繪示了應用於4-up資料串流之IC的經完全翻轉實施方式。 Figure 35 shows a fully flipped implementation of the IC for 4-up data streaming.

圖36繪示了在本發明之實施例中的孔徑函數IC電路至有序樣本之輸入陣列串流3601的應用。 FIG. 36 illustrates the application of an aperture function IC circuit to an input array stream 3601 of ordered samples in an embodiment of the present invention.

圖37繪示了本發明之實施例中的接收二個獨立輸入陣列串流之實例。 FIG. 37 illustrates an example of receiving two independent input array streams in an embodiment of the present invention.

圖38繪示了本發明之實施例中的用於一系列四次2：1資料減小的完全資料列及經縮小資料列之序列。 FIG. 38 illustrates a sequence of full data rows and reduced data rows for a series of four 2:1 data reductions in an embodiment of the present invention.

圖39繪示了本發明之實施例中的用於

：1之無理縮放的完全列及經縮小列之序列。 FIG. 39 shows an embodiment of the present invention.

: A sequence of irrationally scaled complete and reduced columns of 1.

圖40繪示了在本發明之實施例中的全尺度資料及經縮小資料至孔徑函數的應用。 FIG. 40 illustrates the application of full-scale data and downscaled data to an aperture function in an embodiment of the present invention.

圖41繪示了本發明之實施例中的後續CNN節點對交錯串流之處理。 Figure 41 shows the processing of interleaved streams by subsequent CNN nodes in an embodiment of the present invention.

圖42A繪示了本發明之實施例中的產生交錯串流所需之取樣及切分邏輯。 FIG. 42A illustrates the sampling and slicing logic required to generate an interleaved stream in an embodiment of the present invention.

圖42B繪示了本發明之實施例中的產生交錯串流所需之取樣及切分邏輯之另一例子。 FIG. 42B shows another example of the sampling and slicing logic required to generate an interleaved stream in an embodiment of the present invention.

圖43A繪示了本發明之實施例中的自各種縮小之串流產生交錯串流所需的子取樣及切分邏輯。 FIG. 43A illustrates the sub-sampling and slicing logic required to generate an interleaved stream from various scaled-down streams in an embodiment of the present invention.

圖43B繪示了在另一實施例中產生交錯多尺度樣本串流。 FIG. 43B illustrates the generation of an interleaved multi-scale sample stream in another embodiment.

圖43C繪示了在又一實施例中產生交錯多尺度樣本串流。 FIG. 43C illustrates the generation of an interleaved multi-scale sample stream in yet another embodiment.

圖44為本發明之實施例中的ASIC之圖。 Figure 44 is a diagram of an ASIC in an embodiment of the present invention.

圖45繪示了本發明之實施例中的用於提供更多輸入及輸出之塊之配置。 FIG. 45 illustrates a configuration of a block for providing more inputs and outputs in an embodiment of the present invention.

圖46繪示了本發明之實施例中的核心處理塊之內部結構。 Figure 46 shows the internal structure of the core processing block in an embodiment of the present invention.

圖47展示了本發明之實施例中的卷積單元之內部結構。 Figure 47 shows the internal structure of the convolution unit in an embodiment of the present invention.

圖48描繪了本發明之實施例中的核心處理塊之單個求和胞元。 Figure 48 depicts a single summing cell of a core processing block in an embodiment of the present invention.

圖49描繪了本發明之實施例中的輔助功能塊之內部結構。 Figure 49 depicts the internal structure of the auxiliary functional block in an embodiment of the present invention.

圖50描繪了本發明之實施例中的實施包括在輔助塊中之小塊函數中之任一者所需的元件。 FIG. 50 depicts the elements required to implement any of the small block functions included in the auxiliary block in an embodiment of the present invention.

圖51描繪了本發明之實施例中的至外部電路系統之連接之配置。 FIG. 51 illustrates the configuration of the connection to the external circuit system in an embodiment of the present invention.

圖52描繪了本發明之實施例中的用於運算5乘5卷積之四個3乘3核心處理塊之配置。 FIG. 52 illustrates the configuration of four 3x3 core processing blocks for computing 5x5 products in an embodiment of the present invention.

圖53描繪了本發明之實施例中的以實施9乘9核心之方式配置的九個3乘3核心處理塊之配置。 FIG. 53 depicts the configuration of nine 3x3 core processing blocks configured to implement a 9x9 core in an embodiment of the present invention.

圖54繪示了本發明之實施例中的功能完整之卷積神經網路之抽象實例。 FIG. 54 illustrates an abstract example of a fully functional convolutional neural network in an embodiment of the present invention.

較佳實施例之詳細說明：廣泛多種影像及資料演算法廣泛使用線性代數之矩陣形式來證明命題且計算結果。在本申請案中，「演算法」意謂在計算或其他問題求解運算中尤其由電腦遵守之過程或規則組。在本申請案中，演算法並不普遍地解釋為軟體。本申請案中所描述之演算法可且通常較佳地在硬體中實施。 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS: A wide variety of image and data algorithms make extensive use of the matrix form of linear algebra to prove statements and compute results. In this application, "algorithm" means a process or set of rules followed, especially by a computer, in a computation or other problem-solving operation. In this application, an algorithm is not generally interpreted as software. The algorithms described in this application may be, and are often preferably, implemented in hardware.

矩陣運算經定義為一或多個維度之正交集合，並且通常經構想為在各給定維度之每次迭代中具有相同數目個元素。藉助於實例，M乘N矩陣經常由值陣列描繪，諸如：

Matrix operations are defined as orthogonal sets of one or more dimensions, and are usually conceived to have the same number of elements in each iteration of each given dimension. By way of example, an M by N matrix is often depicted by an array of values, such as:

在概念上，矩陣可具有任何數目個維度，並且矩陣可經描繪為展示各維度之值的表組。 Conceptually, a matrix can have any number of dimensions, and can be depicted as a set of tables showing the values of each dimension.

形式為M乘1或1乘N之矩陣之子組可稱為向量，該等向量具有其自有的特定屬性及所定義之運算且廣泛用於2D及3D圖形模擬中。 Subsets of matrices of the form M by 1 or 1 by N are called vectors, which have their own specific properties and defined operations and are widely used in 2D and 3D graphics simulations.

形式為1乘1之矩陣之簡併子組可稱為純量，並且構成熟習此項技術者相當熟悉之數字。 Degenerate subsets of matrices of the form 1 by 1 are called scalars, and form numbers that are quite familiar to those skilled in the art.

當矩陣之值為常數且矩陣具有相容維度時，諸如乘法之某些運算為明確界定的。3乘4矩陣A可乘以4乘5矩陣B以形成3乘5矩陣C，此可通常寫為：A×B=C

Certain operations, such as multiplication, are well-defined when the values of the matrices are constants and the matrices have compatible dimensions. A 3x4 matrix A can be multiplied by a 4x5 matrix B to form a 3x5 matrix C, which can be written generally as: A × B = C

然而，運算B×A並未明確界定，此係因為內部維度並不匹配(5≠3)，並且k將不具有與B及A之索引相容之單個範圍。 However, the operation B × A is not well defined because the internal dimensions do not match (5 ≠ 3) and k will not have a single range that is compatible with the indices of B and A.

其元素為向量之矩陣或其他矩陣被稱為張量(自其導出TensorFlow之名稱)。張量之熟悉形式可為RGB影像。RGB影像之一個形式為HDMI圖框，作為RGB值之1080乘1920矩陣，其中各像素為色彩分量之3乘1向量。像素被視為真向量，此係因為紅色分量之線性運算不會影響綠色或藍色，且反之亦然。 A matrix or other matrix whose elements are vectors is called a tensor (from which TensorFlow gets its name). A familiar form of a tensor is an RGB image. One form of an RGB image is an HDMI frame, which is a 1080 by 1920 matrix of RGB values, where each pixel is a 3 by 1 vector of color components. Pixels are considered true vectors because linear operations on the red component do not affect the green or blue, and vice versa.

HDMI圖框通常不被視為五維矩陣，此係因為影像中之像素之位置的處理與色彩之處理不相關。藉由丟棄影像之不感興趣之部分來剪裁影像為有效且相當有意義的，但不存在對應操作來剪裁色彩分量。同樣地，可存在對色彩之許多操作，該等操作具有容易理解的效應，該等效應若應用於包含陣列之元素則將為無意義的。因此，HDMI圖框明顯為2,3張量而非5D陣列。 HDMI frames are not usually considered to be 5D matrices because the manipulation of pixel positions in an image is unrelated to the manipulation of color. Clipping an image by discarding uninteresting parts of the image is efficient and quite meaningful, but there is no corresponding operation to clip color components. Likewise, there can be many operations on color that have well-understood effects that would be meaningless if applied to elements of a containing array. Therefore, HDMI frames are clearly 2,3 tensors rather than 5D arrays.

存在已知可表示為矩陣運算之許多影像處理演算法。矩陣運算為表示重複運算之簡潔方式，並且矩陣數學規則有助於證明特定命題。 There are many image processing algorithms that are known to be expressible as matrix operations. Matrix operations are a concise way to represent repeated operations, and the rules of matrix mathematics help prove certain statements.

在通用電腦處理器上執行基於矩陣之演算法通常藉由循環機制來實現，並且電腦語言及硬體CPU二者可具有使此類循環有效之特徵。然而，矩陣定義之數學本質上並不要求藉由任何特定方法或規劃來執行運算以便運算正確結果。 Matrix-based algorithms on general-purpose computer processors are often implemented using looping mechanisms, and both computer languages and hardware CPUs may have features that make such loops efficient. However, the mathematical nature of the matrix definition does not require that operations be performed by any particular method or plan in order to compute correct results.

影像處理及認知之現代混合體為卷積神經網路(CNN)。雖然多年來訓練此類網路一直極具挑戰性，但實際上執行經訓練網路卻相對簡單。 The modern hybrid of image processing and cognition is the convolutional neural network (CNN). Although training such networks has been extremely challenging for many years, actually executing a trained network is relatively simple.

在CNN中，各卷積輸出元素藉由使獨立核心通過輸入張量上方來操作，以產生輸出張量之各分量。通常，當神經網路用於處理影像時，網路之第一層對影像之RGB像素之輸入陣列進行操作，並且產生相關大小之輸出陣列，該輸出陣列包含在結構上與輸入分量之RGB向量不相關的輸出分量之任意向量。輸出向量分量通常經描述為特徵或啟動，並且表示各核心之回應強度(識別程度)。CNN中之後續層將來自先前層之輸出用作其輸入，因此僅第一層作用於像素值；所有其餘部分作用於特徵以產生更多特徵。卷積之各輸出特徵為不相關的，並且與每個其他特徵不同，正如色彩分量彼此不同一樣。 In a CNN, each convolution output element is operated on by passing an independent kernel over the input tensor to produce components of the output tensor. Typically, when a neural network is used to process images, the first layer of the network operates on an input array of RGB pixels of the image and produces an output array of associated size containing an arbitrary vector of output components that are structurally unrelated to the RGB vectors of the input components. The output vector components are often described as features or activations, and represent the strength of the response (degree of recognition) of each kernel. Subsequent layers in a CNN use the output from previous layers as their input, so only the first layer operates on pixel values; all remaining layers operate on features to produce more features. The output features of the convolution are unrelated and different from each other feature, just as color components are different from each other.

CNN層之常見形式為3乘3卷積。在運算中，將恆定權重之3乘3核心逐元素地應用於輸入張量(亦即，影像)之各特定位置；亦即，將權重中之各者乘以影像中同一相對位置處之像素分量，並且對乘積求和以產生彼位置之輸出之單個分量。偏置常數(其可為零)提供初始值以促進對模型求解以達到最佳權重值。 The common form of a CNN layer is a 3x3 convolution. In the operation, a 3x3 kernel of constant weights is applied element-wise to each specific position of the input tensor (i.e., image); that is, each of the weights is multiplied by the pixel component at the same relative position in the image, and the products are summed to produce a single component of the output at that position. The bias constant (which can be zero) provides an initial value to facilitate solving the model to achieve the optimal weight value.

若存在三個輸入分量，如在RGB影像中存在的那樣，則存在待應用於各分量值(在第一層之情況下，色彩)之3乘3個權重之三個不同組，但僅存在單個初始偏置。3乘3乘3個權重加上偏置之各卷積形成對應於3×3像素小塊之中心處之位置的單個輸出分量值。各輸出通道依次應用其自有的27個權重值，直至給定小塊(在與輸出位置相同之位置處且對應於核心權重之相對位置的輸入分量之子組)之所有輸出分量已運算出。卷積通常具有64個至256個輸出分量，該等輸出分量中之各者具有一組獨特的特定27個權重加上偏置。 If there are three input components, as there are in RGB images, there are three different sets of 3x3 weights to be applied to each component value (color in the case of the first layer), but there is only a single initial bias. Each convolution of the 3x3x3 weights plus the bias forms a single output component value corresponding to the location at the center of a 3x3 pixel patch. Each output channel has its own 27 weight values applied in turn until all output components for a given patch (a subset of input components at the same location as the output location and corresponding to the relative location of the core weights) have been calculated. Convolutions typically have 64 to 256 output components, each of which has a unique set of specific 27 weights plus the bias.

在此實例中，各核心將其27個權重與3個RGB分量之9個像素之同一小塊相乘。對於64個輸出分量之相對較小組，將各個別輸入分量乘以64個任意且不相關的權重。在運算出各小塊之輸出分量之後，自影像加載鄰近小塊，並且再次應用核心之全組權重。此過程繼續，直至到達影像之右邊緣為止，並且小塊下降一列且自左邊緣開始。 In this example, each core multiplies its 27 weights with the same block of 9 pixels of the 3 RGB components. For relatively small groups of 64 output components, each individual input component is multiplied by 64 arbitrary and unrelated weights. After the output components of each block are calculated, the neighboring blocks are loaded from the image and the core's full set of weights is applied again. This process continues until the right edge of the image is reached, and the blocks go down one row and start again from the left edge.

在處理第一層之後，下一卷積層處理第一層之輸出作為至第二層之輸入。因此，3乘3卷積現具有待應用於小塊之3乘3乘64個輸入分量的3乘3乘64個權重。若此層具有256個輸出，則必須針對各輸出位置執行3×3×64×256=147,456次乘法。熟習此項技術者應理解，此係針對可包含大於40層之深度神經網路中的單層。 After processing the first layer, the next convolution layer processes the output of the first layer as input to the second layer. Therefore, the 3x3 convolution now has 3x3x64 weights to be applied to the 3x3x64 input components of the small block. If this layer has 256 outputs, then 3×3×64×256=147,456 multiplications must be performed for each output position. Those familiar with the art will understand that this is for a single layer in a deep neural network that can contain more than 40 layers.

應用於小塊之各元素的乘法之數目等於層中之通道之數目。在標準CPU中，此等乘法必須以某一序列進行。許多現代CPU具有同時執行多組乘法之能力，尤其在資料格式較小(亦即，8位元)之情況下。在GPU或TPU中，可用乘法器之數目高得多，但各乘法器經設計以自二個不同且不受限制的因數產生乘積。 The number of multiplications applied to each element of the tile is equal to the number of lanes in the layer. In a standard CPU, these multiplications must be done in a certain sequence. Many modern CPUs have the ability to perform multiple sets of multiplications simultaneously, especially if the data format is small (i.e., 8 bits). In a GPU or TPU, the number of available multipliers is much higher, but each multiplier is designed to produce a product from two different and unrestricted factors.

在當前技術處理器中，CPU、TPU或GPU並不利用以下簡單事實：在CNN實施方式中，用於乘法之因數中之一者對於在針對小塊之處理期間應用於輸入通道之所有權重為公共的。 In current technology processors, CPU, TPU or GPU do not take advantage of the simple fact that in CNN implementations, one of the factors used for multiplication is common to all weights applied to the input channels during processing for a tile.

在本申請案中，本案發明人提出了一種大量乘法器，該大量乘法器在單個步驟中執行所有乘法，否則該等乘法習知地順序地進行。當一組乘法之權重均具有某一小精確度(對於TPU，通常為8位元)時，存在有限數目(2⁸=256)個不同權重以及公共輸入之對應數目個不同倍數(其可具有任何大小；不管公因數之精確度如何，當應用8位元權重時，仍僅存在256個可能的倍數)。在此情況下，實施用比相同數目個不受限制的乘法器少得多的元素同時產生所有所需輸出之電路存在明顯優勢。 In this application, the inventors propose a large number of multipliers that perform all multiplications in a single step, which otherwise are performed sequentially in a learned manner. When the weights for a set of multiplications all have a certain small precision (typically 8 bits for a TPU), there are a finite number (2 ⁸ = 256) of different weights and a corresponding number of different multiples of a common input (which can be of any size; regardless of the precision of the common factor, there are still only 256 possible multiples when 8-bit weights are applied). In this case, there is a clear advantage in implementing a circuit that produces all the desired outputs simultaneously with far fewer elements than the same number of unrestricted multipliers.

在本發明之實施例中，等效大量乘法器專用於單個輸入通道，並且並不始終共用。因此，運算具有使用若干時脈循環及多個暫存器級之選項。此使得運算採用極簡單且有效的形式，而不影響系統之總體輸貫量。 In an embodiment of the present invention, an equivalently large number of multipliers are dedicated to a single input channel and are not always shared. Therefore, the operation has the option of using several clock cycles and multiple register stages. This allows the operation to take a very simple and efficient form without affecting the overall throughput of the system.

在其中單個動態值乘以許多常數之常見情況下，如在本發明之實施例中，用單個多級大量乘法器電路代替等效的一組獨立單級乘法器電路會產生以實質上更高輸貫量以及實質上更低功率及佔據面積執行相同計算的系統。即使該組輸出小於所使用之實際倍數之數目，仍可顯著節省功率及空間。 In the common case where a single dynamic value is multiplied by many constants, as in embodiments of the present invention, replacing an equivalent set of independent single-stage multiplier circuits with a single multi-stage mass multiplier circuit produces a system that performs the same calculations with substantially higher output throughput and substantially lower power and footprint. Even if the set of outputs is less than the actual number of multiples used, significant power and space savings can still be achieved.

已確立了本發明之實施例中的獨特大量乘法器優於獨立乘法器之明顯優勢，將操作序列重新排序可進一步增加優勢。 Having established the clear advantages of the unique large number of multipliers in the embodiments of the present invention over independent multipliers, reordering the sequence of operations can further increase the advantages.

神經網路(或其他類似影像處理)演算法之數學並不要求任何特定的操作序列。若以任何次序進行相同操作，則將進行相同的正確運算。 The mathematics of neural network (or other similar image processing) algorithms does not require any particular sequence of operations. If the same operations are performed in any order, the same correct operation will be performed.

本案發明人觀察到，軟體在基於CPU、GPU或TPU之設計上執行之常用次序係藉由將權重乘以輸入且立即對它們求和來同時針對給定位置產生所有輸出通道。藉由將權重乘以輸入且立即對它們求和來同時針對給定位置產生所有輸出通道會使必須自RAM讀取輸入之次數最小化，以及限制亦必須自RAM讀取權重之次數。其不會消除多次讀取輸入，此係因為當處理下一列時，除RAM以外，沒有地方來保留該等輸入。 The inventors of this case observed that the common order in which software is executed on CPU, GPU or TPU based designs is to generate all output channels for a given position at the same time by multiplying the weights by the inputs and summing them immediately. Generating all output channels for a given position at the same time by multiplying the weights by the inputs and summing them immediately minimizes the number of times the inputs must be read from RAM and limits the number of times the weights must also be read from RAM. It does not eliminate multiple reads of inputs because there is no place to keep the inputs except RAM when processing the next row.

然而，若在本發明之實施例中，經定義以對陣列輸入之M乘N小塊進行操作之核心或其他孔徑函數之操作序列被翻轉，亦即，有效地自內向外翻轉，則僅利用各輸入值一次，並且不需要RAM緩存器。此獨特操作僅在輸入最初呈現時一次一個地處理輸入且保留所有不完整輸出之部分和，而非藉由在孔徑函數經過各列時冗餘地讀取輸入來一次一個地產生輸出。部分和可保留在硬體移位暫存器或標準硬體先進先出暫存器(FIFO)中，並且保持所保留之值所需的暫存器之數目與核心之高度及輸入列之寬度成比例。 However, if, in an embodiment of the invention, the sequence of operations of a core or other aperture function defined to operate on M by N blocks of array inputs is flipped, that is, effectively flipped inside out, each input value is used only once, and no RAM buffer is required. Rather than generating outputs one at a time by redundantly reading the inputs as the aperture function passes through each row, this unique operation processes the inputs only one at a time as they are initially presented and retains partial sums of all incomplete outputs. The partial sums can be retained in hardware shift registers or standard hardware first-in, first-out registers (FIFOs), and the number of registers required to hold the values retained is proportional to the height of the core and the width of the input rows.

由於實施孔徑函數之函數可經分解成一系列子函數，該等子函數中之各者對緊接在前之子函數之結果進行操作，因此核心之實施方式可藉由隨時間推移而依序合成子函數來實現，使得各子函數立即對所接收之資料進行操作，並且產生與在摘要中應用核心相同之操作序列。吾人將此經重新合成函數(包括任何初始化)稱為孔徑函數，並且將個別步驟稱為子函數。如本文中所使用，孔徑函數係指待在較大的R乘C輸入陣列中之M乘N個輸入之滑動窗或小塊上的多個位置處實施之任何M乘N計算。如同全CNN核心之實施方式，孔徑函數亦可包括初始化及終結操作。在CNN之情況下，初始化將偏置值預加載至累加器中，並且終結經由任意啟動函數來轉變核心之原始輸出。 Since the function implementing the aperture function can be decomposed into a series of subfunctions, each of which operates on the results of the immediately preceding subfunction, the implementation of the core can be achieved by synthesizing the subfunctions sequentially over time, so that each subfunction operates on the received data immediately and produces the same sequence of operations as applying the core in the summary. We refer to this re-synthesized function (including any initialization) as the aperture function, and the individual steps as subfunctions. As used in this article, an aperture function refers to any M by N calculation to be implemented at multiple locations on a sliding window or tile of M by N inputs in a larger R by C input array. As with the implementation of the full CNN core, the aperture function may also include initialization and termination operations. In the case of CNNs, initialization preloads bias values into the accumulators and ends by transforming the raw output of the core via any activation functions.

在本發明之此實例中，當呈現各新輸入位置之分量時，彼位置處之分量表示小塊向下且向右之第一元素，並且同時表示小塊向上且向左之最後元素以及與當前位置相交之所有其他小塊之中間元素。此允許運算電路經開發為本發明之實施例，其始終具有固定數目個處理中元素(其中在輸入之邊緣附近具有一些可能的異常情況)，並且以與其接受輸入相同之速率產生輸出。 In this example of the invention, when the components of each new input position are presented, the component at that position represents the first element of the tile downward and to the right, and simultaneously represents the last element of the tile upward and to the left and the intermediate elements of all other tiles that intersect the current position. This allows operation circuits to be developed as embodiments of the invention that always have a fixed number of elements in process (with some possible anomalies near the edges of the input), and produce output at the same rate as it accepts input.

在引導演算法需要在延伸超過輸入陣列之邊緣的小塊上評估孔徑函數的情況下，會出現許多特殊情況及問題，但它們並非不可克服的。可添加特殊情況邏輯，使得重疊小塊的部分結果與正常情況相容，而不影響總體輸貫量。 There are a number of special cases and problems that arise when the guided algorithm needs to evaluate the aperture function on patches that extend beyond the edge of the input array, but they are not insurmountable. Special case logic can be added to make partial results for overlapping patches compatible with the normal case without affecting the overall throughput.

在本發明之實施例中，孔徑函數操作之此翻轉形式接受作為串流之輸入且產生作為串流之輸出。輸入不需要被緩衝在RAM中，此係因為該等輸入各自僅被引用一次。由於輸出亦呈串流形式，因此該等輸出亦可由後續層處理而無需RAM緩衝，此為可歸因於本發明之結果，其相比於許多其他必要的對RAM之讀取及寫入操作顯著增加了處理速度。 In an embodiment of the present invention, this inverted form of the aperture function operation accepts inputs as a stream and produces outputs as a stream. The inputs do not need to be buffered in RAM because they are each referenced only once. Since the outputs are also in stream form, they can also be processed by subsequent layers without RAM buffering, which is a result of the present invention, which significantly increases processing speed compared to many other necessary read and write operations to RAM.

在本發明之實施例中，代替共用單組獨立乘法器之許多層執行、儲存且隨後讀取回結果以依序處理下一層，可使用同時處理所有層之專用大量乘法器來產生管線，從而將各層之輸出串流饋送至下一層之輸入中，而無需等待任何層完成。 In an embodiment of the present invention, instead of having many layers sharing a single set of independent multipliers execute, store, and then read back the results to process the next layer in sequence, a pipeline can be created using a dedicated large number of multipliers that process all layers simultaneously, thereby streaming the output of each layer into the input of the next layer without waiting for any layer to complete.

因此，本發明之實施例中的完全實施之管線可達到以比習知的以輸出為中心之排序過程大二個數量級量測的有效輸貫量，並且消除了對RAM之爭用(此係因為該管線不使用RAM)。此對RAM之爭用形成了基於GPU及TPU之處理的主要瓶頸。 Thus, the fully implemented pipeline of an embodiment of the present invention can achieve effective throughput measured by two orders of magnitude greater than known output-centric sorting processes, and eliminates the contention for RAM (since the pipeline does not use RAM). This contention for RAM forms a major bottleneck for GPU- and TPU-based processing.

在本發明之實施例中，此類系統之時延經減少至自最後像素之輸入至最後結果之輸出的時間。由於按照演算法之定義，影像之最後像素必須為完成所有層之所有最終運算所需的最後資料，因此系統之時延嚴格地為時脈速率乘以管線中包括最終輸出之不同時脈級之數目。 In an embodiment of the present invention, the latency of such a system is reduced to the time from the input of the last pixel to the output of the final result. Since, by definition of the algorithm, the last pixel of the image must be the last data required to complete all final operations of all layers, the latency of the system is strictly the clock rate multiplied by the number of different clock stages in the pipeline including the final output.

在本發明之實施例中，針對整個神經網路中之各輸入通道使用單個專用大量乘法器(代替必須重新使用且動態地分配之有限組獨立乘法器)使得有可能構建像素同步管線，其中所有乘法均並行執行，此係因為其僅採用單個大量乘法器來處理所應用之任意數目個權重。 In embodiments of the present invention, the use of a single dedicated bulk multiplier for each input channel in the entire neural network (instead of a finite set of independent multipliers that must be reused and dynamically allocated) makes it possible to build a pixel-synchronous pipeline where all multiplications are performed in parallel because it only employs a single bulk multiplier to process any number of weights applied.

已描述了大量乘法器之創新之必要特徵，並且亦描述了翻轉之優勢，本案發明人假定以下特定實例： Having described the necessary features of a large number of multiplier innovations, and having also described the advantages of flipping, the inventors of this case assume the following specific example:

圖1為繪示本發明之實施例的圖，其中多個一或多個源通道1至N(標記為101a至101d)中之各者具有所分配的專用大量乘法器102a至102d。由於各源通道在此實例中具有專用大量乘法器電路以創建彼通道之值之一組倍數，源通道格式可以對於在硬體中實施之處理演算法方便的任何精確度在帶符號、無符號、固定或浮點之間變化。各大量乘法器電路(諸如大量乘法器電路102c)之特定輸出可直接饋送至一或多個運算單元103a至103d中，該一或多個運算單元可執行需要任何或所有源通道之倍數的計算。此類運算單元可用於實施待在相同源通道上運算之單個演算法或不相關演算法之獨立輸出通道。運算之輸出可經轉發以供進一步處理，如104處所展示，如在硬體中實施之一或多個演算法可能需要的。舉例而言，當在現場可編程閘陣列(FPGA)中實施神經網路時會出現此情形，其中作為被乘數應用之權重值將不改變。 FIG. 1 is a diagram illustrating an embodiment of the present invention in which each of a plurality of one or more source channels 1 to N (labeled 101a to 101d) has an assigned dedicated mass multiplier 102a to 102d. Since each source channel in this example has a dedicated mass multiplier circuit to create a set of multiples of the value of that channel, the source channel format can vary between signed, unsigned, fixed or floating point for any accuracy convenient for the processing algorithms implemented in hardware. Certain outputs of each mass multiplier circuit (such as mass multiplier circuit 102c) can be fed directly into one or more operational units 103a to 103d, which can perform calculations requiring multiples of any or all of the source channels. Such an operation unit may be used to implement a single algorithm or independent output channels of unrelated algorithms to be operated on the same source channel. The output of the operation may be forwarded for further processing, as shown at 104, as may be required by one or more algorithms implemented in hardware. This may occur, for example, when implementing a neural network in a field programmable gate array (FPGA), where the weight values applied as multiplicands will not change.

圖2繪示了本發明之實施例，其中各大量乘法器(諸如圖1之大量乘法器102a)之輸出透過一組多工器201a至201d饋送至運算單元203a至203d中，使得可在系統之初始化時挑選或在其操作時動態地挑選經選擇倍數。運算之輸出可隨後經轉發以供在204處進一步處理，如前所述。當在特殊應用積體電路(ASIC)中實施神經網路時會出現此情形，其中運算之結構經提交，但所使用之權重值需要為可改變的。 FIG. 2 illustrates an embodiment of the present invention, wherein the output of each of the mass multipliers (such as mass multiplier 102a of FIG. 1 ) is fed to the operation units 203a to 203d via a set of multiplexers 201a to 201d, so that the selected multiples can be selected at the initialization of the system or dynamically selected during its operation. The output of the operation can then be forwarded for further processing at 204, as described above. This situation arises when implementing a neural network in an application specific integrated circuit (ASIC), where the structure of the operation is submitted, but the weight values used need to be changeable.

圖3繪示了在一個實施例中的圖1及圖2之大量乘法器102a之內部結構。此結構可為大量乘法器102b、102c及102d以及本發明之其他實施例中之其他大量乘法器所共用的。在此結構中，並行地產生A個位元之源通道被乘數101a乘B個位元之所有可能乘數的乘積303a至303f，並且將其遞送至倍數304。在此實例中，源被乘數101a之A個位元經複製，且藉由將0位元附加至最低有效位置來向上移位，並且藉由將0位元填補至最高有效位置來填充，使得全組所有所需移位值0至B-1以A+B個位元項302a至302d之向量的形式可用。此等項可僅藉由路由電路連接形成，並且不需要暫存器或邏輯電路系統。在時脈週期足以允許在單個週期中構成A+B個位元之最多B個項的情況下，可能不需要暫存器或子組成。求和項之個別乘積303a至303f可在本地暫存，或經轉發以作為組合邏輯進行進一步處理。可藉由在各乘數中出現1位元之位置處添加B個對應項302a至302d中之任一者或全部來形成1至2^B-1乘以源被乘數101a之各乘積。任何及所有源之倍數0為所有0位元之常量，並且可在使用多工器時為了完整性包括在倍數304中，但另外不需要電路系統。藉由將任何未使用乘積303a至303f排除在電路規範之外，從而允許合成工具刪除該等未使用乘積，或藉由任何其他方法，可省略該等未使用乘積。亦可省略未使用項302a至302d，但由於該等未使用項並不佔用邏輯，因此，此省略通常沒有效果。以此方式，源被乘數101之所有所需倍數304可形成為單級管線或組合邏輯。 Fig. 3 illustrates the internal structure of the mass multiplier 102a of Fig. 1 and Fig. 2 in one embodiment. This structure can be common to other mass multipliers in mass multipliers 102b, 102c and 102d and other embodiments of the present invention. In this structure, the products 303a to 303f of all possible multipliers of the source channel multiplicand 101a of A bits by B bits are generated in parallel and delivered to the multiple 304. In this example, the A bits of the source multiplicand 101a are copied and shifted upward by appending a 0 bit to the least significant position and filled by filling a 0 bit to the most significant position so that a full set of all required shift values 0 to B-1 are available in the form of a vector of A+B bit entries 302a to 302d. These terms may be formed solely by routing circuit connections, and no registers or logic circuitry is required. Where the clock cycle is sufficient to allow up to B terms of A+B bits to be constructed in a single cycle, no registers or sub-assemblies may be required. Individual products of the sum terms 303a to 303f may be locally buffered or forwarded for further processing as combinatorial logic. Each product of 1 to 2^B-1 times the source multiplicand 101a may be formed by adding any or all of the B corresponding terms 302a to 302d at the location where a 1 bit occurs in each multiplier. Any and all source multiples of 0 are constants of all 0 bits and may be included in multiples 304 for completeness when multiplexers are used, but no circuitry is otherwise required. Any unused products 303a-303f may be omitted by excluding them from the circuit specification, thereby allowing synthesis tools to remove them, or by any other method. Unused terms 302a-302d may also be omitted, but since they do not occupy logic, such omission generally has no effect. In this way, all required multiples 304 of source multiplicand 101 may be formed as a single stage of pipeline or combinatorial logic.

圖4展示了最佳化實施例，其中一組項401包含由A+B+1個位元形成的自0至B(包括端點)之所有所需個別項302a至302e。此允許乘積402a至402f包括自較大項減去較小項而非添加較小項，並且可用於減小電路之總體大小，此亦可增加最大允許時脈頻率。舉例而言，對於任何給定輸入a及乘數15，8a+4a+2a+1a=15a組合四個分量，而16a-1a=15a組合僅二個分量，並且通常可預期為更緊密且有效的。各乘積402a至402f可由產生正確結果之項302a至302e之任何加法及減法構成，並且各特定變型可基於特定實施技術之最佳折衷而挑選。舉例而言，二個N位元數量之減法可能比二個N位元數量之加法需要更多邏輯，但一般而言，三個N位元數量之加法將始終比二個N位元數量之減法需要更多邏輯。所需倍數304之處理不因合成個別乘積402a至402f之細節而改變。 FIG. 4 shows an optimized embodiment in which a set of terms 401 includes all required individual terms 302a to 302e from 0 to B (inclusive) formed by A+B+1 bits. This allows the products 402a to 402f to include subtracting smaller terms from larger terms rather than adding smaller terms, and can be used to reduce the overall size of the circuit, which can also increase the maximum allowable clock frequency. For example, for any given input a and multiplier 15, 8a+4a+2a+1a=15a combines four components, while 16a-1a=15a combines only two components and can generally be expected to be more compact and efficient. Each product 402a-402f may be constructed from any addition and subtraction of terms 302a-302e that produces the correct result, and each particular variation may be chosen based on the best compromise for a particular implementation. For example, subtraction of two N-bit quantities may require more logic than addition of two N-bit quantities, but in general, addition of three N-bit quantities will always require more logic than subtraction of two N-bit quantities. The handling of required multiples 304 does not change with the details of synthesizing individual products 402a-402f.

圖5A繪示了大量乘法器之實施例，其中時脈週期使得每週期僅可能進行A+B個位元值之單次加法(若使用減法，則為A+B+1個位元值)。在此情況下，為了適應其中利用多於二個項之倍數，有必要將所需元素配置至多級管線中。項401由各源通道101形成，如前所述，但在管線暫存器501a及501b中保留一或多次以供稍後引用。所求和之二個項之對502經運算並暫存，並且隨後視需要保存503。三位元組504形成為對502及所保留項501之和。項值之四位元組505形成為對502之和。可省略任何及所有未使用元素，並且為了增加重疊，可僅指定加數之降序序列。此確保了冗餘和(例如，a+b及b+a)不會在最終電路中利用並保留。乘積506a至506f可利用滿足時序約束之任一對經暫存子組成之任何加法或減法運算。藉由一致地使用可用的最大元素，總大小及因此功率可能會減少，但產生正確結果之運算之任何組合為可接受的。 FIG. 5A illustrates an embodiment of a large number of multipliers in which the clock cycle is such that only a single addition of A+B bit values (or A+B+1 bit values if subtraction is used) is possible per cycle. In this case, in order to accommodate multiples of more than two terms, it is necessary to configure the required elements into a multi-stage pipeline. Term 401 is formed by each source channel 101, as previously described, but is retained one or more times in pipeline registers 501a and 501b for later reference. The pair 502 of the two terms being summed is calculated and stored, and then saved 503 as needed. A triplet 504 is formed as the sum of the pair 502 and the retained term 501. A quad 505 of the term value is formed as the sum of the pair 502. Any and all unused elements may be omitted, and to increase overlap, only a descending sequence of addends may be specified. This ensures that redundant sums (e.g., a+b and b+a) are not utilized and are retained in the final circuit. Products 506a-506f may utilize any addition or subtraction operation of any pair of stored subcomponents that meets the timing constraints. By consistently using the largest elements available, the overall size and therefore power may be reduced, but any combination of operations that produces the correct result is acceptable.

圖5A之實施例足以產生所有所需倍數，其中B=8。對於較大倍數組，可在進一步的管線級中重組所展示之子組成，使得針對B之任何值的所有所需倍數506a至506f可由對包括先前揭露的保留項501b、保留對503、三位元組504及四位元組505之子組成之擴展組連同所需之此類其他子組成的單時脈操作構成，以藉由單時脈操作形成足以形成倍數506a至506f之一組項。 The embodiment of FIG. 5A is sufficient to generate all required multiples, where B=8. For larger multiples, the sub-compositions shown may be reorganized in further pipeline stages so that all required multiples 506a to 506f for any value of B may be constructed by single-clock operations on an expanded set of sub-compositions including the previously disclosed reserved item 501b, reserved pair 503, triplet 504, and quad 505, together with such other sub-compositions as required, to form a combination sufficient to form multiples 506a to 506f by single-clock operations.

圖5B繪示了其中倍數係藉由一組固定情況直接形成而不參考標準算術運算的實施例。對於所需倍數中之各者，針對各源通道值a枚舉該組輸出值a*b。此允許硬體電路合成工具判定最佳邏輯電路507以產生全組所需倍數。任何給定輸入值所需之輸出值之規範通常係藉由Verilog「case」或「casex」語句中之枚舉來制定。此不同於查找表，在該查找表中，輸出值經由由輸入形成之索引來儲存並存取，此係因為邏輯閘用於實施產生全組輸出值所需的操作之最小子組，並且用於產生相關子表達式之冗餘邏輯將經組合。 FIG. 5B illustrates an embodiment in which multiples are formed directly from a set of fixed cases without reference to standard arithmetic operations. For each of the required multiples, the set of output values a*b is enumerated for each source channel value a. This allows the hardware circuit synthesis tool to determine the best logic circuit 507 to generate the full set of required multiples. The specification of the output values required for any given input value is typically formulated by enumeration in a Verilog "case" or "casex" statement. This is different from a lookup table, in which the output values are stored and accessed via an index formed from the input, because logic gates are used to implement the smallest subset of operations required to generate the full set of output values, and the redundant logic used to generate the associated sub-expression will be combined.

就空間、頻率及功率而言方法5A及5B中之哪一者最有效取決於A及B之特定值以及算術運算相對於任意邏輯之核心效率。使用哪種方法之挑選可基於直接觀察、模擬或其他標準。 Which of methods 5A and 5B is most efficient in terms of space, frequency, and power depends on the specific values of A and B and the core efficiency of arithmetic operations versus arbitrary logic. The choice of which method to use can be based on direct observation, simulation, or other criteria.

圖6繪示了其中時脈週期使得足夠的邏輯位準允許在各單個時脈週期期間藉由四個元素之加法及/或減法進行組成的實施例。藉由自一組子組成進行選擇，可藉由組合不大於四個經暫存元素來產生各乘積605a至605f。如前所述，項經保留在暫存器501a及501b中，但保留在602中之三位元組601直接由項401構成，並且不使用對。七位元組603及八位元組604由三位元組601及保留項501a形成。 FIG. 6 illustrates an embodiment in which the clock cycle is such that sufficient logic levels permit composition by addition and/or subtraction of four elements during each single clock cycle. By selecting from a set of subsets, each product 605a to 605f can be produced by combining no more than four stored elements. As previously described, the entries are retained in registers 501a and 501b, but the triplet 601 retained in 602 is constructed directly from entry 401 and no pairs are used. Seven bytes 603 and eight bytes 604 are formed from triplet 601 and the retained entry 501a.

圖6之示例性實施例足以產生所有所需倍數，其中B=32。對於較大乘法器，可在進一步的管線級中一次重組四個所展示之子組成，以針對B之任何值產生所有所需倍數。所展示之元素之子組成為必要的，並且足以產生所有乘積，其中B=32，但其他子組成(可能針對B之不同值之間的一致性而挑選的)為可接受的。 The exemplary embodiment of FIG. 6 is sufficient to produce all required multiples where B=32. For larger multipliers, the shown subcombinations may be reorganized four at a time in further pipeline stages to produce all required multiples for any value of B. The shown subcombination of elements is necessary and sufficient to produce all products where B=32, but other subcombinations (perhaps chosen for consistency between different values of B) are acceptable.

當該組乘法器為固定的時，如對於FPGA應用常見的，甚至可有效地實施一組大的稀疏乘法器，此係因為合併了公共元素且可省略未使用元素。當合成工具自動地執行此功能時，電路之表達式可包括所有可能的元素，而無需明確聲明使用哪些倍數。 When the set of multipliers is fixed, as is common for FPGA applications, even a large set of sparse multipliers can be implemented efficiently because common elements are merged and unused elements can be omitted. When the synthesis tool performs this function automatically, the expression of the circuit can include all possible elements without explicitly stating which multiples are used.

若在單個時脈循環中無法完成對A+B或A+B+1個位元值之運算，則可在視需要插入額外管線暫存器以使得所有路徑具有相同數目個時脈週期的情況下針對任何單級組成邏輯插入多級管線加法器。管線級週期可為單個邊緣對邊緣時脈轉變之例子，或在輸貫量約束允許之情況下為多循環時脈。除上文剛剛提及之問題以外，每運算之多個時脈級以及多循環時脈之使用均不需要對任何實施例進行結構改變。 If the operation on the A+B or A+B+1 bit values cannot be completed in a single clock cycle, multiple stages of pipeline adders can be inserted for any single stage of logic, with additional pipeline registers inserted as needed to make all paths have the same number of clock cycles. The pipeline stage cycle can be a single edge-to-edge clock transition, or a multi-cycle clock if the input throughput constraints allow. The use of multiple clock stages per operation and multi-cycle clocks does not require architectural changes to any embodiment, except for the issues just mentioned above.

本發明之重要目標為提供在積體電路中實施之工業大量乘法器以用於多種應用中。為此目的，本案發明人在一個實施例中提供了實施為積體電路之大量乘法器，該積體電路具有：接收離散值串流之埠，及將在埠處接收之各值同時乘以多個權重值之電路系統，以及提供大量乘法器所產生之乘積的輸出通道。 An important object of the present invention is to provide an industrial mass multiplier implemented in an integrated circuit for use in a variety of applications. To this end, the inventors of the present invention provide, in one embodiment, a mass multiplier implemented as an integrated circuit having: a port for receiving a stream of discrete values, a circuit system for multiplying each value received at the port by a plurality of weight values simultaneously, and an output channel for providing the products produced by the mass multiplier.

在一個版本中，所接收之離散值可為固定寬度之無符號二進位值，權重值可為二個或更多個位元之固定寬度之無符號二進位，並且各倍數可經構成為輸入之位元移位副本之總和。在另一版本中，可增加該組移位副本以允許使用減法運算來減小或以其他方式最佳化電路。可明確地或隱含地省略該組中之未使用輸出。 In one version, the received discrete values may be fixed-width unsigned binary values, the weight values may be fixed-width unsigned binary values of two or more bits, and each multiple may be constructed as a sum of bit-shifted copies of the inputs. In another version, the set of shifted copies may be increased to allow the use of reduction operations to reduce or otherwise optimize the circuit. Unused outputs in the set may be omitted explicitly or implicitly.

在一個實施例中，該組輸出乘積可由組合邏輯產生。在另一實施例中，該組輸出乘積可由單級管線使用單個或多個時脈循環產生。在另一實施例中，該組輸出倍數可由多級管線藉由每級組合不大於二個加數來產生。可明確地或隱含地自電路消除中間子組成之未使用元素。 In one embodiment, the set of output products may be generated by combinatorial logic. In another embodiment, the set of output products may be generated by a single-stage pipeline using single or multiple clock cycles. In another embodiment, the set of output multiples may be generated by a multi-stage pipeline by combining no more than two addends per stage. Unused elements of intermediate components may be eliminated from the circuit either explicitly or implicitly.

在一個實施例中，該組輸出乘積可由多級管線藉由每級組合三個或更多個加數來產生，並且子組成可相應地進行調整。可明確地或隱含地自電路消除中間子組成之未使用元素。 In one embodiment, the set of output products may be produced by a multi-stage pipeline by combining three or more addends per stage, and the sub-assemblies may be adjusted accordingly. Unused elements of intermediate sub-assemblies may be explicitly or implicitly eliminated from the circuit.

本發明之另一目標為提供積體電路中之大量乘法以在深度學習及人工智慧之持續演進中實施實質上改良之卷積神經網路。本案發明人在此努力中提供了實施為積體電路之第一卷積神經網路(CNN)節點，該第一CNN節點具有經定義為陣列之元素的第一分量之離散值串流的第一輸入通道。 Another object of the present invention is to provide a large number of multiplications in an integrated circuit to implement substantially improved convolutional neural networks in the continued evolution of deep learning and artificial intelligence. In this effort, the inventors of this case provide a first convolutional neural network (CNN) node implemented as an integrated circuit, the first CNN node having a first input channel defined as a stream of discrete values of a first component of an element of an array.

在此描述中，本案發明人意圖將陣列之元素之命名法表示可具有單個分量或多個分量之元素。良好實例為可具有像素作為元素之影像，並且若影像為單色的，則各像素可具有單個分量，或者在一個實例中，若影像呈RGB色彩形式，則各像素可具有三個色彩值。在此實例中，各色彩值為元素之分量，該元素為像素。 In this description, the inventors of this case intend that the nomenclature of the elements of the array represent elements that may have a single component or multiple components. A good example is an image that may have pixels as elements, and if the image is monochrome, each pixel may have a single component, or in one example, if the image is in RGB color form, each pixel may have three color values. In this example, each color value is a component of an element, which is a pixel.

繼續以上對實施為積體電路之第一卷積神經網路(CNN)節點的描述，該第一CNN節點具有經定義為陣列之元素的第一分量之離散值串流的第一輸入通道，在此CNN中進一步存在第一大量乘法器電路，該第一大量乘法器電路將所接收的第一分量之離散值同時乘以多個權重值。輸出通道提供離散值之輸出串流。 Continuing with the above description of a first convolutional neural network (CNN) node implemented as an integrated circuit, the first CNN node has a first input channel of a stream of discrete values of a first component defined as an element of an array, and further in the CNN there is a first plurality of multiplier circuits that simultaneously multiply the received discrete values of the first component by a plurality of weight values. The output channel provides an output stream of discrete values.

在CNN節點之一個實施例中，第一輸出串流在一些情況下藉由將乘積與常數組合且在一些情況下藉由應用啟動函數來自第一大量乘法器電路之乘積形成。 In one embodiment of a CNN node, a first output stream is formed from a first plurality of multiplier circuits by combining the products with a constant in some cases and by applying an activation function in some cases.

在另一實施例中，CNN節點進一步包含經定義為陣列之元素的第二分量之離散值串流的第二輸入通道，以及將所接收的第二分量之離散值同時乘以多個權重值之第二大量乘法器電路。在另一實施例中，可存在經定義為陣列之元素的第三分量之離散值串流的第三輸入通道，以及將所接收的第三分量之離散值同時乘以多個權重值之第三大量乘法器電路。 In another embodiment, the CNN node further includes a second input channel defined as a stream of discrete values of a second component of the elements of the array, and a second large number of multiplier circuits that simultaneously multiply the received discrete values of the second component by a plurality of weight values. In another embodiment, there may be a third input channel defined as a stream of discrete values of a third component of the elements of the array, and a third large number of multiplier circuits that simultaneously multiply the received discrete values of the third component by a plurality of weight values.

在已描述了具有一個、二個或三個輸入分量串流之CNN節點以及專用大量乘法器的情況下，本案發明人進一步提供卷積神經網路(CNN)，該CNN具有：實施為積體電路之第一卷積神經網路(CNN)節點，以及具有至少部分地取決於第一節點之輸出的輸入之第二CNN節點，該第一CNN節點包含：經定義為陣列之元素的分量之離散值串流的輸入通道；專用於個別輸入通道之大量乘法器電路，該等大量乘法器電路將所接收的分量之離散值同時乘以多個權重值；以及提供離散值之輸出串流的輸出通道。此CNN可具有連續節點，並且可作為深度神經網路(DNN)操作。不要求第一節點之後的連續節點為CNN節點。 Having described a CNN node with one, two or three input component streams and a dedicated large number of multipliers, the inventors of the present case further provide a convolutional neural network (CNN) having: a first convolutional neural network (CNN) node implemented as an integrated circuit, and a second CNN node having an input that depends at least in part on the output of the first node, the first CNN node comprising: an input channel of a discrete value stream of components defined as elements of an array; a large number of multiplier circuits dedicated to individual input channels, the large number of multiplier circuits multiplying the received discrete values of the components by multiple weight values simultaneously; and an output channel providing an output stream of discrete values. This CNN may have consecutive nodes and may operate as a deep neural network (DNN). It is not required that the consecutive nodes after the first node be CNN nodes.

管線式孔徑函數運算Pipeline aperture function calculation

現在返回參考本說明書中之先前描述，論述了處理CNN或其他類似挑選之孔徑函數時的操作次序，該孔徑函數使運算子函數之陣列通過輸入陣列上方以產生最終結果，現在提供本發明之實施例中的孔徑函數運算之翻轉形式的特定描述，孔徑函數運算之該翻轉形式接受作為串流之輸入且產生作為串流之輸出。在本發明之此實施例中，輸入並非且無需緩衝在RAM中，此係因為各輸入僅引用一次。輸出亦以串流形式產生，因此輸出串流可由後續層處理，而無需RAM緩衝。本案發明人咸信，此創新相比於其他處理系統中之許多其他必要讀取及寫入操作顯著增加了處理速度。 Now returning to the previous description in this specification, discussing the order of operations when processing a CNN or other similarly selected aperture function that passes an array of operator subfunctions over an input array to produce a final result, a specific description of the flipped form of the aperture function operation in an embodiment of the present invention is now provided, which accepts inputs as a stream and produces outputs as a stream. In this embodiment of the present invention, the inputs are not and need not be buffered in RAM because each input is only referenced once. The output is also produced in the form of a stream, so the output stream can be processed by subsequent layers without RAM buffering. The inventors of this case believe that this innovation significantly increases processing speed compared to many other necessary read and write operations in other processing systems.

在本發明之實施例中提供了一種設備及方法，其中使二維孔徑函數通過二維陣列上方之動作係藉由作用於輸入之傳入串流來實現，使得所有輸入經立即處理且部分完成之運算經保留，直至接收並處理所有所需輸入之此類時間為止，並且輸出以符合串流形式產生，該符合串流通常具有與輸入串流相同或更低之資料速率。所有輸入以所提供之速率經接受並處理，並且不需要以任何次序而是以所呈現之次序進行儲存或存取。若定義孔徑函數之應用以使得產生比輸入更多的輸出，則電路仍可藉由選擇增量足夠之處理時脈速率來以傳入資料速率操作，使得系統不會在輸入呈現時無法接受並處理該輸入。 In an embodiment of the invention, an apparatus and method are provided in which the action of passing a two-dimensional aperture function over a two-dimensional array is implemented by acting on an incoming stream of inputs, such that all inputs are processed immediately and partially completed operations are retained until such time as all required inputs are received and processed, and outputs are produced in the form of a conforming stream, which generally has the same or lower data rate as the input stream. All inputs are accepted and processed at the rate provided and need not be stored or accessed in any order other than the order presented. If the application of the aperture function is defined so that more outputs are produced than inputs, the circuit can still operate at the incoming data rate by selecting a processing clock rate that is incremented enough so that the system is not unable to accept and process the input when it is presented.

用於針對較大輸入陣列實施核心或更一般孔徑函數之卷積的習知方式為收集所需輸入小塊，將該函數應用於輸入且輸出結果。當孔徑通過輸入陣列上方，各後續小塊將與剛剛處理之一個小塊重疊，因此一些輸入可經保留並重新使用。諸如FIFO之各種機制可用於避免在小塊前進至各新列時自源儲存器冗餘地讀取輸入，但源資料仍將依次應用於核心中之各位置，以產生其輸入小塊與各特定資料輸入位置重疊之各輸出。 The known way to implement a convolution of a kernel or more general aperture function over a large input array is to collect the required input chunks, apply the function to the inputs, and output the results. As the aperture passes over the input array, each subsequent chunk will overlap the one just processed, so some inputs can be preserved and reused. Various mechanisms such as FIFOs can be used to avoid redundant reading of inputs from source registers as chunks advance to new rows, but the source data will still be applied sequentially to each location in the kernel to produce each output whose input chunk overlaps each particular data input location.

若存在許多輸出通道及許多待運算之獨立孔徑函數，則大量乘法器可用於並行地將所考慮之輸入值之小塊的乘積提供至所有孔徑函數。但藉由此配置及操作次序，源資料之各位置將需要針對核心中之各位置的一組乘積，因為源資料之各位置經組合成重疊之各種輸出位置。 If there are many output channels and many independent aperture functions to be operated on, a large number of multipliers can be used to provide products of small blocks of input values under consideration to all aperture functions in parallel. But with this configuration and order of operations, each position of the source data will require a set of products for each position in the core as the various positions of the source data are combined into overlapping various output positions.

本發明之機制為翻轉(亦即，自內向外翻轉)操作次序，以獲得每輸入通道使用僅應用於給定輸入值一次之單個大量乘法器的特定優勢。本發明之實施例中之過程在各輸入呈現時運算各輸入之所有所需乘積且保留孔徑函數之各元素的運行總計，該運行總計直至當前輸入出現之點為止為完整的，而非保留或重新讀取源值以供以運算稍後乘積之形式稍後使用。 The mechanism of the invention is to flip (i.e., inside out) the order of operations to obtain the specific advantage of using a single large multiplier per input channel that is applied only once to a given input value. Rather than retaining or re-reading source values for later use in the form of computing later products, the process in an embodiment of the invention computes all required products for each input as it is presented and maintains a running total of each element of the aperture function that is complete up to the point at which the current input is presented.

可以此方式實施可在數學上分解成依序應用之一系列子函數的任何孔徑函數。由於CNN核心僅僅為權重乘以輸入之乘積的加法序列，並且操作次序與自左向右、自上至下獲取之源輸入之次序相容，因此可容易地應用該機制。 Any aperture function that can be mathematically decomposed into a series of subfunctions applied sequentially can be implemented in this way. This mechanism can be easily applied because the CNN core is just a sequence of additions of weights times the products of the inputs, and the order of operations is compatible with the order in which the source inputs are taken from left to right, top to bottom.

在本發明之實施例中，在IC上實施合成器陣列，該等合成器對應於孔徑函數之子函數元件，該等子函數元件各自在孔徑函數前進通過輸入串流上方時保持該孔徑函數之值之運行總計。陣列中之最終合成器輸出函數之完整值，並且所有其他合成器輸出函數之部分值。 In an embodiment of the invention, an array of synthesizers is implemented on an IC, the synthesizers corresponding to sub-function elements of an aperture function, each of which maintains a running total of the value of the aperture function as it progresses through the input stream. The complete value of the final synthesizer output function in the array, and partial values of all other synthesizer output functions.

在應用3乘3核心之簡單情況下，左上合成器之輸出反映了應用於當前輸入之核心之第一元素加上任何初始化常數，中上合成器之輸出反映了前二個步驟，並且右上合成器之輸出反映了前三個步驟。需要延遲右上合成器之輸出，直至其可再次由下一列使用為止。下一列合成器繼續接受部分完成之函數值加上各新輸入之貢獻且使其向前傳遞之模式。最後一列合成器完成函數之最後步驟，並且輸出經完成值以供任何進一步處理。 In the simple case of applying a 3x3 kernel, the output of the top left synthesizer reflects the first element of the kernel applied to the current input plus any initialization constants, the output of the top middle synthesizer reflects the first two steps, and the output of the top right synthesizer reflects the first three steps. The output of the top right synthesizer needs to be delayed until it can be used again by the next row. The next row of synthesizers continues the pattern of accepting partially completed function values plus the contribution of each new input and passing it onward. The last row of synthesizers completes the last step of the function and outputs the completed value for any further processing.

應注意，函數之部分值在合成器之間的前進通常為在第一列中自左向右，隨後在後續列中自左向右，最終至最後一列中之最後合成器，吾人可將部分值流視為串流且將合成器及流稱為上游或下游。 Note that the partial values of a function usually progress from left to right in the first column, then from left to right in subsequent columns, and finally to the last combiner in the last column. We can think of the flow of partial values as a stream and refer to combiners and flows as upstream or downstream.

各合成器始終維持孔徑函數之部分和，直至且包括當前源輸入。各合成器始終在輸出之不同小塊位置上工作，具體而言，在其中當前輸入出現在孔徑子函數陣列中之合成器之相對位置中的彼小塊上工作。 Each synthesizer always maintains a partial sum of aperture functions up to and including the current source input. Each synthesizer always operates on a different tile position of the output, specifically, on the tile in the synthesizer's relative position in the array of aperture subfunctions where the current input appears.

若3×3核心W表示為輸入A之函數，則為

u=k+a ₁₁ w ₁₁+a ₁₂ w ₁₂+a ₁₃ w ₁₃+a ₂₁ w ₂₁+a ₂₂ w ₂₂+a ₂₃ w ₂₃+a ₃₁ w ₃₁+a ₃₂ w ₃₂+a ₃₃ w ₃₃ If the 3×3 kernel W is expressed as a function of input A, then

u = k + a ₁₁ w ₁₁ + a ₁₂ w ₁₂ + a ₁₃ w ₁₃ + a ₂₁ w ₂₁ + a ₂₂ w ₂₂ + a ₂₃ w ₂₃ + a ₃₁ w ₃₁ + a ₃₂ w ₃₂ + a ₃₃ w ₃₃

實施核心之函數可經分解成等效子函數。 The implementation core functions can be decomposed into equivalent sub-functions.

v ₀(a)=k+aw ₁₁ v ₁(t,a)=t+aw ₁₂ v ₂(t,a)=t+aw ₁₃ v ₃(t,a)=t+aw ₂₁ v ₄(t,a)=t+aw ₂₂ v ₅(t,a)=t+aw ₂₃ v ₆(t,a)=t+aw ₃₁ v ₇(t,a)=t+aw ₃₂ v ₈(t,a)=t+aw ₃₃ u=v ₈(v ₇(v ₆(v ₅(v ₄(v ₃(v ₂(v ₁(v ₀(a ₁₁),a ₁₂),a ₁₃),a ₂₁),a ₂₂),a ₂₃),a ₃₁),a ₃₂),a ₃₃) u=((((((((k+a ₁₁ w ₁₁)+a ₁₂ w ₁₂)+a ₁₃ w ₁₃)+a ₂₁ w ₂₁)+a ₂₂ w ₂₂)+a ₂₃ w ₂₃)+a ₃₁ w ₃₁)+a ₃₂ w ₃₂)+a ₃₃ w ₃₃ u=k+a ₁₁ w ₁₁+a ₁₂ w ₁₂+a ₁₃ w ₁₃+a ₂₁ w ₂₁+a ₂₂ w ₂₂+a ₂₃ w ₂₃+a ₃₁ w ₃₁+a ₃₂ w ₃₂+a ₃₃ w ₃₃=u(A,W) v ₀ ( a )= k + aw ₁₁ v ₁ ( t , a )= t + aw ₁₂ v ₂ ( t , a )= t + aw ₁₃ v ₃ ( t , a )= t + aw ₂₁ v ₄ ( t , a )= t + aw ₂₂ v ₅ ( t , a )= t + aw ₂₃ v ₆ ( t , a )= t + aw ₃₁ v ₇ ( t , a )= t + aw ₃₂ v ₈ ( t , a )= t + aw ₃₃ u = v ₈ ( v ₇ ( v ₆ ( v ₅ ( v ₄ ( v ₃ ( v ₂ ( v ₁ ( v ₀ ( a ₁₁ ) ), a ₁₂ ), a ₁₃ ), a ₂₁ ), a ₂₂ ), a ₂₃ ), a ₃₁ ), a ₃₂ ), a ₃₃ ) u =((((((( k + a ₁₁ w ₁₁ )+ a ₁₂ w ₁₂ )+ a ₁₃ w ₁₃ )+ a ₂₁ w ₂₁ )+ a ₂₂ w ₂₂ )+ a ₂₃ w ₂₃ )+ a ₃₁ w ₃₁ )+ a ₃₂ w ₃₂ )+ a ₃₃ w ₃₃ u = k + a ₁₁ w ₁₁ + a ₁₂ w ₁₂ + a ₁₃ w ₁₃ + a ₂₁ w ₂₁ + a ₂₂ w ₂₂ + a ₂₃ w ₂₃ + a ₃₁ w ₃₁ + a ₃₂ w ₃₂ + a ₃₃ w ₃₃ = u ( A , W )

運算彼等子函數所需之電路系統隨後配置在對應合成器陣列中

The circuitry required to compute those subfunctions is then configured in the corresponding synthesizer arrays.

並且部分完成和經維持為合成器之輸出值

and partially completed and maintained as the output value of the synthesizer

其中a _i為來自輸入串流之當前值，並且在各情況下a _i-1至a _i-8為針對其中a _i出現在相對於各個別合成器之輸出之位置中的特定小塊的先前處理之輸入。各合成器將運算孔徑函數之值，直至且包括合成器在孔徑陣列中對應之位置。各合成器採用輸入串流之當前值，並且將其與先前值組合，以產生對應於輸入陣列中之部分處理之小塊的不同部分和，在該部分處理之小塊中當前輸入值出現在對應於孔徑函數中之各合成器之位置的彼小塊之相對位置中。 where a _i is the current value from the input stream and in each case a _i-1 to a _i-8 are the previously processed inputs for a particular tile in which a _i appears in a position relative to the output of each individual synthesizer. Each synthesizer will operate on the value of the aperture function up to and including the position that the synthesizer corresponds to in the aperture array. Each synthesizer takes the current value of the input stream and combines it with the previous value to produce a different partial sum corresponding to the partially processed tile in the input array in which the current input value appears in a relative position of that tile corresponding to each synthesizer's position in the aperture function.

以此方式，以標準次序及精確度運算的孔徑函數之部分值將隨時間推移而維持在輸入串流上，直至經完成值準備好輸出為止。 In this way, partial values of the aperture function, computed in standard order and accuracy, are maintained on the input stream over time until the completed value is ready for output.

雖然此技術在輸入陣列內部相當簡單明瞭，但在應用於與輸入陣列之邊緣重疊的小塊時會出現複雜問題，此係因為當所有輸入不可用時孔徑函數經不同地定義。在CNN核心之情況下，丟棄額外操作，等效於使用零作為輸入。本發明係關於在處理彼等異常情況時維持通過合成器之部分和之穩流，如下文所描述。 While this technique is fairly straightforward inside an input array, complications arise when applied to small tiles that overlap the edges of the input array because the aperture function is defined differently when all inputs are not available. In the case of a CNN core, the extra operations are discarded, which is equivalent to using zero as input. The present invention is about maintaining a steady flow of partial sums through the synthesizer while handling those anomalies, as described below.

輸入通道組701及相關聯之控制信號702由公共電路系統703使用以產生輸入通道組與權重之任何及所有乘積以用於後續子函數。隨後將源通道乘積分佈至子函數計算電路704a、704b及704c之排組，該等子函數計算電路中之各者產生輸出通道組705中之單個通道。任何數目個獨立輸出通道可由公共電路系統703支援。 The input channel set 701 and associated control signals 702 are used by common circuitry 703 to generate any and all products of the input channel set and weights for use in subsequent subfunctions. The source channel products are then distributed to the bank of subfunction computation circuits 704a, 704b, and 704c, each of which generates a single channel in the output channel set 705. Any number of independent output channels may be supported by common circuitry 703.

圖8A為繪示圖7之公共電路系統703中的大量乘法器801a、801b及801c之圖，該等大量乘法器採用輸入通道組701中之各通道且產生所定義子函數所需之稀疏或完整組倍數。應注意，此圖示假定輸入通道組中之三個通道，如可為針對在處理RGB影像時諸如紅色、綠色及藍色像素值的情況。在其他實施例中，可存在一個、二個或多於三個通道。任何或所有乘積802(由大量乘法器構造之源輸入陣列值之倍數)可供合成器使用，如下面詳細描述之圖9A、圖9B、圖9C中所展示。合成器為本發明之獨特裝置中的硬佈線電路系統之例子，其對由圖8A之大量乘法器產生之源通道乘積執行子函數。 FIG. 8A is a diagram of a large number of multipliers 801a, 801b and 801c in the common circuit system 703 of FIG. 7, which use each channel in the input channel group 701 and produce the sparse or complete set multiples required by the defined subfunction. It should be noted that this diagram assumes three channels in the input channel group, such as for the case of red, green and blue pixel values when processing RGB images. In other embodiments, there may be one, two or more than three channels. Any or all products 802 (multiples of the source input array values constructed by the large number of multipliers) can be used by the synthesizer, as shown in FIG. 9A, FIG. 9B, and FIG. 9C described in detail below. The synthesizer is an example of a hardwired circuit system in the unique device of the present invention that performs a subfunction on the source channel products generated by the large number of multipliers of Figure 8A.

圖8B為繪示同步電路系統之結構的圖，該同步電路系統將正常及異常處置信號二者提供至所有輸出通道之所有合成器。 FIG8B is a diagram illustrating the structure of a synchronous circuit system that provides both normal and abnormal processing signals to all synthesizers of all output channels.

控制電路系統803使所有輸出及控制計數器與源輸入串流同步，並且實施每當RST或INIT經確證時將輸出及控制計數器設定為初始狀態。 The control circuit system 803 synchronizes all output and control counters with the source input stream and implements setting the output and control counters to the initial state whenever RST or INIT is asserted.

在此實例中，colSrc計數器805跨列逐行地對陣列之內部維度進行計數，並且隨著處理各組源通道乘積而前進。在此實例中，在各列之末端處， colSrc計數器返回至最左位置(0)，並且rowSrc計數器804前進一。在源陣列串流之末端處，rowSrc及colSrc計數器返回至初始狀態且準備接收新輸入陣列。 In this example, the colSrc counter 805 counts the internal dimensions of the array row by row across columns and advances as each set of source channel products is processed. In this example, at the end of each column, the colSrc counter rolls back to the leftmost position (0) and the rowSrc counter 804 advances by one. At the end of the source array stream, the rowSrc and colSrc counters return to their initial state and are ready to receive new input arrays.

在此實例中，colDst計數器807及rowDst計數器806一起以與針對所有輸出通道之計數器類似的方式起作用。colDst及rowDst計數器係由輸出賦能信號(DSTEN)813賦能，並且判定何時後處理賦能信號(POSTEN)812經確證。 In this example, the colDst counter 807 and rowDst counter 806 function together in a similar manner as the counters for all output channels. The colDst and rowDst counters are enabled by the output enable signal (DSTEN) 813 and determine when the post-processing enable signal (POSTEN) 812 is asserted.

應注意，此實例中所描繪之系統產生孔徑函數之單個輸出，但通常將用於產生與源輸入串流之維度相容的通道輸出之串流組。各獨立輸出通道將經由大量乘法器及公共控制邏輯來共用運算電路系統中之至少一些。 It should be noted that the system depicted in this example produces a single output of the aperture function, but will generally be used to produce a stream of channel outputs that is compatible with the dimensions of the source input stream. Each independent output channel will share at least some of the computational circuitry via a large number of multipliers and common control logic.

輸出賦能(DSTEN)信號813控制終結函數何時接受並處理來自合成器之結果。雖然從源輸入陣列接受前若干列，但未將有效結果呈現給終結函數(參見圖9C)。當rowDst及colDst計數器指示有效結果可用時或替代地當處理經延遲截斷結果時，輸出賦能信號813(DSTEN)經確證。POSTEN信號812連續地或週期性地經確證以符合SRCEN信號801之時序。當處理源輸入串流陣列之最後列時，需要此等信號來對所有經截斷合成器之最終輸出進行定序。自

至M-2之每列合成器將與最後完整輸出同時產生最終經截斷輸出，必須在所有完整小塊輸出之後保留並依序發射該等最終完整輸出，以便符合陣列串流格式。 The output enable (DSTEN) signal 813 controls when the terminator function accepts and processes results from the synthesizer. Although the first several rows are accepted from the source input array, valid results are not presented to the terminator function (see Figure 9C). The output enable signal 813 (DSTEN) is asserted when the rowDst and colDst counters indicate that a valid result is available, or alternatively when a delayed truncated result is being processed. The POSTEN signal 812 is asserted continuously or periodically to match the timing of the SRCEN signal 801. These signals are required to sequence the final outputs of all truncated synthesizers when processing the last row of the source input stream array.

Each row of synthesizers through M-2 will produce a final truncated output concurrently with the last complete output, which must be retained and emitted in sequence after all the complete chunk outputs in order to conform to the array stream format.

在此實例中，POSTEN及DSTEN信號以及colDst及rowDst計數器值與SRCEN信號以及colSrc及rowSrc計數器值無關，並且繼續處理經延遲結果，直至所有經延遲結果已經終結且發送至輸出串流。系統可在完成先前輸出時接受新輸入，因此允許系統在無需在圖框之間暫停的情況下處理源輸入串流之多個圖框。當源串流資料尚未到達陣列末端時，POSTEN未經確證，並且最終結果獲自合成器。緊接在到達源陣列之末端之後，針對各額外輸出確證POSTEN信號，並且最終結果獲自經截斷延遲線909、910a及910b，如下文所描述之圖9C中所示，直至rowDst計數器達到完整輸出列數為止，藉此將rowDst及colDst重設至初始條件以便為下一資料圖框做準備。 In this example, the POSTEN and DSTEN signals and the colDst and rowDst counter values are independent of the SRCEN signal and the colSrc and rowSrc counter values, and delayed results continue to be processed until all delayed results have been finalized and sent to the output stream. The system can accept new input while previous output is completing, thus allowing the system to process multiple frames of the source input stream without pausing between frames. When the source stream data has not reached the end of the array, POSTEN is not asserted and the final result is obtained from the synthesizer. Immediately after reaching the end of the source array, the POSTEN signal is asserted for each additional output and the final result is obtained by truncating delay lines 909, 910a and 910b, as shown in Figure 9C described below, until the rowDst counter reaches the full number of output rows, thereby resetting rowDst and colDst to the initial conditions in preparation for the next data frame.

當rowSrc計數器指示來自串流之源資料組表示陣列之第一列時，第一列信號808(ROWFST)經確證。 When the rowSrc counter indicates the first row of the source data group representation array from the stream, the first row signal 808 (ROWFST) is asserted.

當rowSrc計數器指示來自串流之源資料組表示陣列之最後列時，最後列信號809(ROWLST)經確證。 When the rowSrc counter indicates the last row of the source data group representation array from the stream, the last row signal 809 (ROWLST) is asserted.

當colSrc計數器指示來自串流之源資料組表示陣列之各列之第一行時，第一行信號810(COLFST)經確證。 The first row signal 810 (COLFST) is asserted when the colSrc counter indicates the first row of columns of the source data set representation array from the stream.

當colSrc計數器指示來自串流之源資料組表示陣列之各列之最後行時，最後行信號811(COLLST)經確證。 When the colSrc counter indicates the last row of columns of the source data set representation array from the stream, the last row signal 811 (COLLST) is asserted.

圖9A、圖9B及圖9C繪示了在一般情況下的上文所提及之獨特裝置，其中孔徑函數之M乘N個子函數元素應用於R乘C個輸入之陣列的各重疊M乘N小塊，包括與邊緣重疊之彼等小塊，輸入以規則或不規則時間間隔呈現為相關聯分量之串流，以產生R乘C個輸出之對應串流，其中各輸出為如由孔徑函數之規則指定而應用於輸入小塊之M乘N個功能元件的聚合效應。應用於陣列中之各位置的功能元件在此裝置中為用於M乘N個子函數中之各者的硬佈線合成器，如圖9A、圖9B及圖9C之合成中所示。 FIG9A, FIG9B and FIG9C illustrate the unique device mentioned above in a general case, wherein M times N sub-function elements of an aperture function are applied to each overlapping M times N small block of an array of R times C inputs, including those small blocks overlapping with edges, the inputs appearing as a stream of related components at regular or irregular time intervals to produce a corresponding stream of R times C outputs, wherein each output is the aggregate effect of the M times N functional elements applied to the input small block as specified by the rules of the aperture function. The functional element applied to each position in the array is in this device a hard-wired synthesizer for each of the M times N sub-functions, as shown in the synthesis of FIG9A, FIG9B and FIG9C.

電路之效應為以與將用於在各小塊上個別地運算孔徑函數相同之運算序列在R乘C個輸入之陣列之各位置處運算孔徑函數之經重合成值。若在輸出串流中不需要任何位置，則可添加電路系統以省略該等位置，以便產生平鋪或間隔輸出，而非完全重疊。 The effect of the circuit is to compute the overlapped value of the aperture function at each position of the array of R times C inputs using the same sequence of operations that would be used to compute the aperture function individually on each tile. If any positions are not needed in the output stream, circuitry can be added to omit those positions in order to produce a tiled or spaced output rather than a fully overlapped one.

源通道乘積802及源控制信號814可供合成器901、902a、902b、902c、903a、903b、903c、904、905a、905b、905c、906、907a、907b及907c中之各者使用。源控制信號亦連接至延遲908a、908b、908c、908d、908e及908f。輸出通道控制及計數器815可供延遲909、910a及910b以及終結函數911使用。在且僅在操作次序不改變之情況下，額外管線級可手動地或由自動化工具插入，以使電路佈線對於給定時脈頻率為可行的。時序控制及計數器信號可供電路之所有元件使用，並且未個別地展示。 Source channel products 802 and source control signals 814 are available to each of synthesizers 901, 902a, 902b, 902c, 903a, 903b, 903c, 904, 905a, 905b, 905c, 906, 907a, 907b, and 907c. Source control signals are also connected to delays 908a, 908b, 908c, 908d, 908e, and 908f. Output channel control and counter 815 are available to delays 909, 910a, and 910b and termination function 911. Additional pipeline stages may be inserted manually or by automated tools to make the circuit routing feasible for a given clock frequency, provided that the order of operations is not changed. Timing control and counter signals are available to all components of the circuit and are not shown individually.

各合成器具有至特定輸入乘積或替代地至可編程多工器之專用直接連接，該可編程多工器針對該組中之各輸入值選擇乘積中之一者且在電路之執行之前進行預組態。各專用連接為具有多個導線之並行路徑，該等多個導線足以攜載表示單個輸入間隔中所需之乘積之位元。使用任擇的預組態多工器來選擇將針對各組元素之哪個乘積發送至各合成器使得能夠在現場升級權重值。當權重並未升級且在裝置之使用壽命內保持固定時，使用固定連接。因為權重之選擇在操作期間並不改變，所以固定或可變乘積選擇之挑選並不影響電路之操作。 Each synthesizer has a dedicated direct connection to a specific input product or alternatively to a programmable multiplexer that selects one of the products for each input value in the group and is preconfigured prior to execution of the circuit. Each dedicated connection is a parallel path having a number of wires sufficient to carry the bits representing the desired product in a single input interval. The use of an optional preconfigured multiplexer to select which product for each group element is sent to each synthesizer enables the weight values to be upgraded in the field. Fixed connections are used when the weights are not upgraded and remain fixed for the life of the device. Because the weight selection does not change during operation, the choice of fixed or variable product selection does not affect the operation of the circuit.

各合成器從大量乘法器接收對應於子函數之權重的該組乘積，每個輸入通道一個乘積，並且執行子函數運算，通常簡單地將它們全部加在一起，以形成此合成器對總體孔徑函數之值的貢獻。除對應於孔徑函數之左行的彼等合成器以外，各合成器亦自緊鄰左側之合成器接收部分完成之結果。除對應於孔徑函數之頂列之彼等合成器以外，各合成器亦可自上方列上之合成器接收經延遲的部分完成之結果。各合成器最多具有一個來自左側之連接及一個來自上方之經延遲連接，該等連接中之各連接為具有多個導體之並行路徑，該等多個導體足以攜載表示部分完成之結果的位元作為至合成器之輸入。按照關於當前輸入小塊相對於輸入陣列之邊緣之位置的子函數之定義，各合成器執行三個操作中之一者：此合成器之部分結果與初始化值(若存在)的組合，或此合成器之部分結果與來自左側合成器之部分結果的組合，或此合成器之部分結果與經延遲部分結果的組合。經修正結果被置於足以包含該結果之多個位元之輸出暫存器中，並且使得彼結果在後續輸入間隔中可供右側合成器及/或延遲及終結電路使用。此經修正結果可為部分結果、完整結果或經截斷結果，視孔徑函數中之合成器之位置及輸入串流位置之狀態而定。 Each synthesizer receives the set of products corresponding to the weights of the sub-function from a number of multipliers, one product per input channel, and performs the sub-function operation, typically simply adding them all together to form this synthesizer's contribution to the value of the overall aperture function. Except for those synthesizers corresponding to the left row of aperture functions, each synthesizer also receives partially completed results from the synthesizer immediately to the left. Except for those synthesizers corresponding to the top column of aperture functions, each synthesizer may also receive delayed partially completed results from the synthesizer on the upper column. Each synthesizer has at most one connection from the left and one delayed connection from the top, each of which is a parallel path with a number of conductors sufficient to carry bits representing partially completed results as inputs to the synthesizer. Each synthesizer performs one of three operations, defined by a subfunction with respect to the position of the current input tile relative to the edge of the input array: a combination of the partial result of this synthesizer with an initialization value (if any), or a combination of the partial result of this synthesizer with a partial result from a synthesizer to the left, or a combination of the partial result of this synthesizer with a delayed partial result. The modified result is placed in an output register large enough to contain the result and is made available to the right-side synthesizer and/or delay and termination circuits in subsequent input intervals. This modified result can be a partial result, a complete result, or a truncated result, depending on the position of the synthesizer in the aperture function and the state of the input stream position.

合成器(0,0)的獨特之處在於，在孔徑函數中在其左側或上方不存在合成器，並且因此始終用所接收之各輸入組來初始化運算。 The combiner (0,0) is unique in that there are no combiners to the left or above it in the aperture function, and therefore the operation is always initialized with each set of inputs received.

合成器(M-1,N-1)的獨特之處在於，所產生之結果始終為最終結果，但在結構上與所有其他合成器903a、903b或903c相同。 The synthesizer (M-1, N-1) is unique in that the result produced is always final, but is structurally identical to all other synthesizers 903a, 903b, or 903c.

一些合成器之輸出被分接用於延遲或後處理，在此情況下，通過此類延遲或後處理之路徑之寬度足以傳輸表示部分結果、經截斷結果或經完成結果之位元。一些合成器之輸出僅由右側合成器使用。合成器內部之運算以及輸出資料格式不需要根據輸出之使用而改變。 Some synthesizer outputs are tapped for delay or post-processing, in which case the path through such delay or post-processing is wide enough to transmit bits representing partial, truncated, or completed results. Some synthesizer outputs are used only by the right-hand synthesizer. The internal operations of the synthesizer and the output data format do not need to change based on the use of the output.

終結電路自若干可能源獲取結果，並且對它們進行多工以選擇在任何間隔上處理哪個可能源。在應用終結函數(若存在)之後，最終輸出之寬度可減小且將形成本發明實施例之輸出串流，該輸出串流可為包含本發明之系統的接下來之最終輸出的輸入串流，或可用於進一步處理。 The finalization circuit takes results from several possible sources and multiplexes them to select which possible source to process at any interval. After applying the finalization function (if any), the width of the final output can be reduced and will form the output stream of an embodiment of the present invention, which can be the input stream containing the next final output of the system of the present invention, or can be used for further processing.

本發明之實施例中的獨特裝置上之資料路徑在圖9A、圖9B及圖9C中藉由具有由箭頭指示之方向之粗線指示，並且省略號指示範圍內之最後行或列全部重複之位置。來自源通道乘積802之資料路徑(a)為一組並行導電路徑，一個路徑專用於輸入分量之各乘積，各乘積為輸入分量乘以孔徑函數之多個權重值中之一者的值。應明白，5乘5孔徑函數針對各輸入分量具有25個權重值。對於用於R、G及B色彩像素之R乘C輸入陣列的孔徑函數之情形，存在75個權重值。因此，在此情形下，線(a)具有75個並行路徑，各路徑具有一組並行導體，其寬度容納所需數目個位元以保證準確度。線(a)在本領域中被稱為一組點對點連接，而非匯流排。 The data paths on the unique device in the embodiment of the present invention are indicated in FIG. 9A , FIG. 9B and FIG. 9C by thick lines with directions indicated by arrows, and the ellipsis indicates the position of the last row or column in the range of all repetitions. The data path (a) from the source channel product 802 is a set of parallel conductive paths, one path dedicated to each product of the input components, each product being the value of the input component multiplied by one of the multiple weight values of the aperture function. It should be understood that the 5 times 5 aperture function has 25 weight values for each input component. For the case of an aperture function for an R times C input array of R, G and B color pixels, there are 75 weight values. So, in this case, line (a) has 75 parallel paths, each with a set of parallel conductors wide enough to accommodate the required number of bits to ensure accuracy. Line (a) is referred to in the art as a point-to-point connection, not a bus.

圖9A、圖9B及圖9C中之資料路徑(b)並非線(a)之延伸部分，而是至線(a)中之路徑之特定子組的專用連接。線(b)並未在圖9A、圖9B及圖9C中之每個例子中經標記，但自線(a)直接至合成器中之個別者的每個連接皆為專用線(b)。專用為各合成器連接至攜載各輸入分量與彼合成器所需之權重值之乘積的路徑之彼子組。 The data paths (b) in Figures 9A, 9B and 9C are not extensions of line (a), but are dedicated connections to a specific subset of the paths in line (a). Line (b) is not labeled in each of the examples in Figures 9A, 9B and 9C, but each connection from line (a) directly to an individual one of the synthesizers is a dedicated line (b). The dedicated connection for each synthesizer is to that subset of the paths that carry the product of each input component and the weight value required by that synthesizer.

圖9A、圖9B及圖9C中之資料路徑(c)為各合成器及下一右側合成器中之輸出暫存器之間的點對點路徑。此等資料路徑為具有準確度寬度之專用路徑，其通常攜載部分和，如在說明書中其他地方詳細描述的。並非每個路徑(c)在圖中皆經標記，但可假定在此實例中，自一個合成器至另一合成器之每個直接連接為路徑(c)。應注意，存在輸出路徑(c)分支至替代電路系統之例子。 The data paths (c) in Figures 9A, 9B, and 9C are point-to-point paths between each synthesizer and the output register in the next right synthesizer. These data paths are dedicated paths with accuracy bandwidth that typically carry a portion of and, as described in detail elsewhere in the specification. Not every path (c) is labeled in the figure, but it can be assumed that in this example, every direct connection from one synthesizer to another is a path (c). It should be noted that there are instances where output paths (c) branch to alternative circuit systems.

本發明之實施例中之另一不同的資料路徑在圖9A、圖9B及圖9C中標記為(d)。此等資料路徑為自諸如電路908A至908f之延遲電路返回至列下方之合成器及左側合成器或直接至其他延遲電路的專用資料路徑。延遲電路經構造以在一列合成器之右端處接受部分和，使在特定數目個源間隔內傳遞部分和延遲，並且隨後在適當時間將彼等部分和傳遞至另一合成器及/或其他處理。在本說明書中其他地方詳細描述總體功能性。延遲電路系統之間的路徑(d)為通常供部分和以某些源間隔經傳遞之類似專用路徑。 Another different data path in an embodiment of the present invention is marked as (d) in Figures 9A, 9B and 9C. These data paths are dedicated data paths from delay circuits such as circuits 908A to 908f back to the synthesizers below the row and the left synthesizer or directly to other delay circuits. The delay circuits are constructed to accept partial sums at the right end of a row of synthesizers, delay the partial sums for a specific number of source intervals, and then pass those partial sums to another synthesizer and/or other processing at an appropriate time. The overall functionality is described in detail elsewhere in this specification. Path (d) between delay circuit systems is a similar dedicated path that is generally used for partial sums to be passed at certain source intervals.

若M或N中之任一者減小，使得不需要範圍之最後列或行，則省略結束元素且保留該範圍中之第一列或行之實施方式。在M或N中之一者或二者減小至2的退化情況下，保留第一列或行及最後列或行，並且省略中間列或行。在M或N中之一者減小至1的退化情況下，組合第一合成器及最後合成器之實施方式，並且不需要特殊初始化。在M及N二者均為1之特定情況下，不需要孔徑函數之翻轉，但使用大量乘法器仍提供不同優勢。 If either M or N decreases so that the last column or row of a range is not needed, an implementation is made in which the end element is omitted and the first column or row of the range is retained. In the degenerate case where either or both M or N decreases to 2, the first and last columns or rows are retained and the middle columns or rows are omitted. In the degenerate case where either M or N decreases to 1, an implementation is made in which the first and last synthesizers are combined and no special initialization is required. In the special case where both M and N are 1, no inversion of the aperture function is required, but the use of a large number of multipliers still provides various advantages.

源通道乘積802可為同時與R乘C陣列之特定位置相關聯且以某一預定義序列呈現的任何組二進位值。輸入串流之源通道可為呈針對孔徑函數之輸入定義之任何性質之任何格式的整數值或分數值之任何組合。一個實例為來自一或多個視訊圖框之像素值及/或經縮放以與陣列大小R乘C匹配之任何其他感測器值，以及作為CNN層之輸出產生的特徵分量值。要強調的是，體現本發明之各節點可接受來自除主要源輸入之外或代替主要源輸入之其他節點的輸出。雖然在本發明之實施例中，一或多個第一節點通常接受影像像素作為系統之主要輸入，但若所處理之資料可經格式化成表示R乘C陣列之串流，則對該資料之性質不存在限制。 The source channel products 802 may be any set of binary values simultaneously associated with a particular position of the R by C array and presented in some predefined sequence. The source channels of the input stream may be any combination of integer or fractional values in any format of any nature defined for the input to the aperture function. An example is pixel values from one or more video frames and/or any other sensor values scaled to match the array size R by C, and feature component values produced as output of a CNN layer. It is emphasized that each node embodying the present invention may accept outputs from other nodes in addition to or in lieu of the primary source input. Although in embodiments of the present invention, one or more first nodes typically accept image pixels as the primary input to the system, there is no restriction on the nature of the data being processed if the data can be formatted as a stream representing an R by C array.

在本發明之一個實施例中，源串流元素組可以列優先次序呈現，其中各後續行以嚴格升序次序呈現。在本發明之一些實施例中，列及行無需對應於水平或豎直軸，而是可為任意的，如在向上或向下掃描行以及自左至右掃描時。列R及行C在此處僅係指串流格式之長軸及短軸。無需針對以除標準視訊自左至右、自上至下次序以外之定向產生輸入串流的輸入信號調整電路系統。可使孔徑子函數之定向符合針對各輸入陣列位置產生相同輸出。 In one embodiment of the invention, source stream element groups may be presented in column priority order, with each subsequent row presented in strictly ascending order. In some embodiments of the invention, columns and rows need not correspond to horizontal or vertical axes, but may be arbitrary, such as when scanning rows up or down and scanning from left to right. Columns R and rows C refer here only to the major and minor axes of the stream format. Circuitry need not be adjusted for input signals that generate input streams in orientations other than standard video left-to-right, top-to-bottom order. Aperture subfunctions may be oriented to produce the same output for each input array position.

在此實例中，藉由指示各組新元素何時有效之信號(SRCEN，參見圖8B)來呈現源輸入，該等源輸入為孔徑函數所需的源值與權重之乘積。可在任何時間暫停及恢復輸入。在一些例子中，可定義輸入之間的最小間隔，並且電路可使用多循環或較高速度時脈以減小大小、功率或以其他方式進行利用，並且輸出通道組可使用相同的最小間隔。 In this example, the source inputs are presented as the product of the source values required by the aperture function and the weights, with a signal indicating when each new set of elements is valid (SRCEN, see Figure 8B). The inputs can be paused and resumed at any time. In some examples, a minimum spacing between inputs can be defined, and the circuit can use multiple cycles or higher speed clocks to reduce size, power, or otherwise utilize, and the output channel groups can use the same minimum spacing.

公共控制及同步電路系統803(圖8B)提供描述R乘C陣列中之當前輸入位置的計數器及控制信號。計數器可在最終輸入之後繼續運行額外的列及行，以幫助終結函數911(圖9C)輸出由最後列輸入生成的超過輸入行的累加輸出。(參見圖12、圖13及圖14以及下面的描述)控制信號可供所有其他元件使用，並且未在圖9A、圖9B及圖9C中展示。 Common control and synchronization circuitry 803 (FIG. 8B) provides counters and control signals describing the current input position in the R by C array. The counters can continue to run additional columns and rows after the final input to help the termination function 911 (FIG. 9C) output the accumulated output generated by the last column input over the input rows. (See FIG. 12, FIG. 13 and FIG. 14 and the description below) The control signals are available to all other components and are not shown in FIG. 9A, FIG. 9B and FIG. 9C.

合成器電路901、902a、902b、902c、903a、903b、903c、904、905a、905b、905c、906、907a、907b及907c各自運算經分配給它們在M乘N函數中之位置的孔徑函數之彼部分。所有合成器在相同的源通道組上以及在由控制803提供之列及行計數器狀態上操作。下面參考額外圖進一步描述孔徑函數之資料處置之細節。 Synthesizer circuits 901, 902a, 902b, 902c, 903a, 903b, 903c, 904, 905a, 905b, 905c, 906, 907a, 907b and 907c each operate on that portion of the aperture function assigned to their position in the M times N function. All synthesizers operate on the same set of source channels and on the column and row counter states provided by control 803. Details of the data handling of the aperture function are further described below with reference to additional figures.

當從輸入串流接收到源輸入組時，應用於與輸入串流中之當前位置重疊之所有小塊的孔徑函數之部分完成之運算在合成器之M乘N陣列內自左至右且自上至下傳遞。此操作隨時間推移而累加孔徑函數之完整運算，並且輸出孔徑函數在輸入陣列之各小塊上的正確實施方式，從而透過相同的操作次序產生相同的結果，如同藉由直接從陣列讀取輸入值來實施孔徑函數的情況一般。用串流存取代替對陣列之隨機存取為本發明之重要特徵，並且消除了對隨機存取記憶體之冗餘存取的要求。 When a source input set is received from an input stream, the partially completed operation of the aperture function applied to all tiles overlapping the current position in the input stream is passed from left to right and top to bottom within the synthesizer's M by N array. This operation accumulates the complete operation of the aperture function over time and outputs the correct implementation of the aperture function on each tile of the input array, thereby producing the same results through the same order of operations as if the aperture function was implemented by reading the input values directly from the array. Replacing random access to the array with stream access is an important feature of the present invention and eliminates the requirement for redundant access to random access memory.

在除底列之外的合成器之右側行

至N-1處，部分輸出經傳遞至延遲級908a、908b、908c、908d、908e及908f，在該等延遲級中該等部分輸出經保持用於所需之輸入間隔數目，使得當接收到對應於小塊之下部列的輸入時，該等部分輸出可用於同一邏輯小塊位置之進一步運算。 In the right row of synthesizers except the bottom row

At N-1, the partial outputs are passed to delay stages 908a, 908b, 908c, 908d, 908e and 908f where they are retained for the required number of input intervals so that they can be used for further operations at the same logic tile location when inputs corresponding to the lower row of tiles are received.

當處理各輸入列之最後行C-1時，來自行

至N-1及列0至M-2之所有合成器亦表示針對包括輸入陣列之最後行的小塊之彼列的最後運算，並且其值經轉發至延遲級908a、908b、908c、908d、908e及908f且需要特殊處理以便插入在序列中，使得當接收到後續輸入列時，該等合成器將在正確時間可用於繼續運算孔徑函數。參見圖11及相關聯的描述。 When processing the last row C-1 of each input column, the

All synthesizers to N-1 and columns 0 to M-2 also represent the last operation for that column of the tile that includes the last row of the input array, and their values are forwarded to delay stages 908a, 908b, 908c, 908d, 908e and 908f and require special processing to be inserted in the sequence so that when subsequent input columns are received, these synthesizers will be available to continue operating the aperture function at the correct time. See Figure 11 and the associated description.

在此實例中，在(M-1,N-1)位置處之合成器903c始終產生M乘N子函數元素之經完成累加，但在其他方面與彼組態903c之其他合成器不可區分。如上所述，當處理各輸入列之最後行C-1時，在列M-1上來自行

至N-1之所有合成器亦表示孔徑函數元素之經完成但經截斷之累加，並且直接發送至終結函數911以進行處理以便插入至輸出串流中。 In this example, the combiner 903c at the (M-1, N-1) position always produces a completed accumulation of M times N subfunction elements, but is otherwise indistinguishable from the other combiners of that configuration 903c. As described above, when processing the last row C-1 of each input column, the sum of the elements from row M-1 is calculated.

All synthesizers through N-1 also represent a completed but truncated accumulation of aperture function elements and are sent directly to the finalizer function 911 for processing for insertion into the output stream.

在此實例中，當處理輸入之最後列R-1時，在行N-1中來自列

至M-1之合成器亦表示子函數元素運算之經完成但經截斷之累加，並且經發送至經截斷輸出延遲線909、910a及910b且經保持直至來自列M-1之主要輸出已在911處終結為止。在如圖8B中所示之控制信號的情況下，附加的M-

列經截斷輸出自延遲線909、910a及910b傳送並在911終結，並且最終以任何所需時序間隔提供至輸出串流匯705。 In this example, when processing the last column R-1 of the input, the

The synthesizer to M-1 also represents the completed but truncated accumulation of the subfunction element operation and is sent to the truncated output delay lines 909, 910a and 910b and is held until the main output from column M-1 has terminated at 911. With the control signal as shown in FIG. 8B, the additional M-

The truncated outputs are sent from delay lines 909, 910a and 910b and terminate at 911 and are ultimately provided to output bus 705 at any desired timing interval.

圖15為繪示實施5乘5卷積節點的本發明之實施例中的管線操作之特定情況的圖。 FIG. 15 is a diagram illustrating a specific case of pipeline operation in an embodiment of the present invention implementing a 5x5 product node.

源通道乘積802及源控制信號(此處未展示)可供合成器901、902a、902b、903a、903b、904、905a、905b、906、907a及907b中之各者使用。源控制信號亦連接至延遲908a、908b、908c及908d。輸出通道控制及計數器可供延遲909、910a以及終結911使用。在且僅在操作次序不改變之情況下，額外管線級可手動地或由自動化工具插入，以使電路佈線對於給定時脈頻率為可行的。時序控制及計數器信號可供電路之所有元件使用，並且未個別地展示。 Source channel products 802 and source control signals (not shown here) are available to each of synthesizers 901, 902a, 902b, 903a, 903b, 904, 905a, 905b, 906, 907a, and 907b. Source control signals are also connected to delays 908a, 908b, 908c, and 908d. Output channel controls and counters are available to delays 909, 910a, and termination 911. Additional pipeline stages may be inserted manually or by automated tools to make the circuit routing feasible for a given timing clock frequency, provided that and only if the order of operations is not changed. Timing control and counter signals are available to all elements of the circuit and are not shown individually.

當依次呈現每組源通道乘積時，各合成器選擇適當乘積來運算對應於孔徑函數中之位置的子函數。與輸入陣列中之當前位置相交的各5乘5小塊經修正以包括基於彼位置之乘積的運算。最終效應為單個源輸入串流經轉變為一組5乘5並行部分運算串流，該等部分運算串流在合成器之間傳遞，直至每次對小塊之所有操作皆完成時為止，此通常在合成器(4,4)中發生，且有時在處理輸入陣列之右邊緣或下邊緣時發生其他情況。 As each set of source channel products is presented in turn, each synthesizer selects the appropriate product to compute for the subfunction corresponding to the position in the aperture function. Each 5 by 5 tile that intersects the current position in the input array is modified to include the operation based on the product at that position. The net effect is that a single source input stream is transformed into a set of 5 by 5 parallel partial operation streams that are passed between synthesizers until all operations on each tile are completed, which usually happens in synthesizer (4,4), and sometimes in other cases when processing the right or bottom edge of the input array.

應注意，僅輸入陣列之寬度影響延遲元素之大小，此係因為各延遲元素必須延遲對應於接收一行之輸入及在下一列上之同一行處之輸入的源輸入間隔之數目的部分結果。 Note that only the width of the input array affects the size of the delayed elements, since each delayed element must delay a fraction of the number of source input intervals corresponding to receiving input at one row and input at the same row on the next row.

圖16繪示了本發明之IC之4×4實施例。已知核心可在列或行中具有奇數個子函數，或具有偶數個子函數。此偶數版本在如圖9C中之一般情況中所示的元素910*的意義上為退化的，並且在圖15中對於5×5孔徑函數(列及行為奇數)之特定情況根本不發生，此係因為省略輸出處理之額外線。 FIG16 shows a 4×4 implementation of the IC of the invention. Known cores can have an odd number of subfunctions in columns or rows, or an even number of subfunctions. This even version is degenerate in the sense of element 910* as shown in the general case of FIG9C, and does not occur at all in the specific case of FIG15 for a 5×5 aperture function (odd number of columns and rows) because the extra line of output processing is omitted.

奇數大小之核心在二個方向上均圍繞中心對稱，但在偶數大小之情況下中心偏移。本發明之實施例中的IC將針對偶數大小之中心置放在位置(

,

)處之自然劃分的右側及下方。在本發明之替代實施例中，電路可經修改以將中心定位在自然劃分的上方及左側。 The odd-sized cores are symmetrical about the center in both directions, but the center is offset in the case of even-sized cores. The IC in the embodiment of the present invention places the center for even-sized cores at position (

,

) to the right and below the natural division. In an alternative embodiment of the invention, the circuit may be modified to position the center above and to the left of the natural division.

除此等註釋以外，圖16之特定IC之操作如針對所描述之其他版本所描述的。 Except for these comments, the operation of the particular IC of FIG. 16 is as described for the other versions described.

圖10A為繪示本發明之實施例中的圖9A及圖9B或圖15之合成器905a、905b及905c之內部結構及操作的圖。通道組中之串流值之源輸入組1001(其可為單一的或為孔徑函數所需之資料類型之混合)用於藉由電路系統1004運算各個別合成器之貢獻。 FIG. 10A is a diagram showing the internal structure and operation of synthesizers 905a, 905b, and 905c of FIG. 9A and FIG. 9B or FIG. 15 in an embodiment of the present invention. A source input set 1001 of stream values in a channel set (which may be a single or a mixture of data types required by the aperture function) is used by circuit system 1004 to calculate the contribution of each individual synthesizer.

電路系統1005利用1004之輸出來運算子函數之初始值。電路系統1006利用1004之輸出及由緊接在左側之合成器1002先前運算的部分值來運算子函數之進行中部分值。電路系統1007利用1004之輸出及先前自緊接在上方之合成器列1003上的908a、908b、908c、908d、908e及908f中之一者運算並延遲之部分值來運算子函數之進行中部分值。 Circuit system 1005 uses the output of 1004 to calculate the initial value of the sub-function. Circuit system 1006 uses the output of 1004 and the partial value previously calculated by the synthesizer 1002 immediately to the left to calculate the ongoing partial value of the sub-function. Circuit system 1007 uses the output of 1004 and the partial value previously calculated and delayed from one of 908a, 908b, 908c, 908d, 908e and 908f on the synthesizer row 1003 immediately above to calculate the ongoing partial value of the sub-function.

電路系統1005、1006及1007之操作可使用其共用輸出與電路系統1004之操作同時(在同一時脈循環中)或可藉由按同一時脈同步之一系列管線級實施。 The operations of circuit systems 1005, 1006, and 1007 may be performed simultaneously (in the same clock cycle) with the operations of circuit system 1004 using their common outputs or may be implemented by a series of pipeline stages synchronized by the same clock.

多工器1008選擇將部分結果之哪個變型作為子函數之部分值轉發，作為合成器1009之輸出。若COLFST 811未經確證，則選擇1006之輸出，否則若ROWFST 808未經確證，則選擇1007之輸出，否則選擇1005之輸出。 Multiplexer 1008 selects which variant of the partial result to forward as the partial value of the subfunction as the output of synthesizer 1009. If COLFST 811 is not confirmed, the output of 1006 is selected, otherwise if ROWFST 808 is not confirmed, the output of 1007 is selected, otherwise the output of 1005 is selected.

此條件處理為允許M乘N孔徑函數延伸遍及表示值組之R乘C陣列的源輸入串流之邊緣的自然結果。最左邊緣或最上邊緣上之單個位置將為用於與彼等邊緣鄰接或重疊之若干小塊的孔徑函數之第一可運算元素。因此，需要用孔徑函數之基礎值來初始化位於重疊小塊之第一可運算位置中的各合成器。此外，位於小塊之後續列之第一運算位置中的各合成器必須與自緊接在前之列運算的同一小塊之部分值之先前值組合。以此方式，使用單個電路來確保與最上及最左邊緣重疊、鄰接最上及最左邊緣或在最上及最左邊緣內部的所有小塊之正確運算。 This conditional processing is a natural consequence of allowing the M times N aperture function to extend over the edges of the source input stream of the R times C array representing the value group. The single position on the leftmost edge or the topmost edge will be the first operable element of the aperture function for several small blocks adjacent to or overlapping those edges. Therefore, each synthesizer located in the first operable position of the overlapping small blocks needs to be initialized with the basis value of the aperture function. In addition, each synthesizer located in the first operable position of the subsequent column of the small block must be combined with the previous value of the partial value of the same small block operated from the immediately preceding column. In this way, a single circuit is used to ensure correct operation of all tiles that overlap, are adjacent to, or are inside the top and left edges.

在圖10B至圖10G中，圖10A中所介紹並且使用相同元件編號之所有元件在功能上與參考圖10A所描述的彼等元件相同。 In FIGS. 10B to 10G , all components described in FIG. 10A and using the same component numbers are functionally identical to those components described with reference to FIG. 10A .

圖10B為繪示本發明之實施例中的圖9A及9B或圖15之合成器902a、902b及902c之內部結構及操作的圖。串流值之源輸入組1001由電路系統1004用於運算合成器對孔徑函數之貢獻。 FIG. 10B is a diagram showing the internal structure and operation of synthesizers 902a, 902b, and 902c of FIGS. 9A and 9B or FIG. 15 in an embodiment of the present invention. A source input set 1001 of stream values is used by circuit system 1004 to calculate the synthesizer's contribution to the aperture function.

電路系統1005利用1004之輸出來運算子函數之初始值，並且電路系統1006利用1004之輸出及由緊接在左側之合成器1002先前運算的部分值來運算子函數之進行中部分值。 Circuit system 1005 uses the output of 1004 to calculate the initial value of the sub-function, and circuit system 1006 uses the output of 1004 and the partial value previously calculated by synthesizer 1002 immediately to the left to calculate the ongoing partial value of the sub-function.

多工器1010選擇將部分結果之哪個變型作為子函數之部分值轉發，作為合成器1009之輸出。若COLFST 811未經確證，則選擇1006之輸出，否則選擇1005之輸出。 Multiplexer 1010 selects which variant of the partial result to forward as the partial value of the subfunction as the output of synthesizer 1009. If COLFST 811 is not confirmed, the output of 1006 is selected, otherwise the output of 1005 is selected.

圖10C為繪示本發明之實施例中的圖9A或圖15之合成器904之內部結構及操作的圖。串流值之源輸入組1001由電路系統1004用於運算各個別合成器之貢獻。 FIG. 10C is a diagram showing the internal structure and operation of the synthesizer 904 of FIG. 9A or FIG. 15 in an embodiment of the present invention. The source input set 1001 of stream values is used by the circuit system 1004 to calculate the contribution of each individual synthesizer.

電路系統1005利用1004之輸出來運算子函數之初始值，並且電路系統1007利用1004之輸出及先前自緊接在上方之合成器列1003上的908a、908b、 908c、908d、908e及908f中之一者運算並延遲之部分值來運算子函數之進行中部分值。 Circuit system 1005 uses the output of 1004 to calculate the initial value of the sub-function, and circuit system 1007 uses the output of 1004 and the partial value previously calculated and delayed from one of 908a, 908b, 908c, 908d, 908e and 908f on the synthesizer row 1003 immediately above to calculate the in-progress partial value of the sub-function.

多工器1011選擇將部分結果之哪個變型作為子函數之部分值轉發，作為合成器1009之輸出。若ROWFST 808未經確證，則選擇1007之輸出，否則選擇1005之輸出。 Multiplexer 1011 selects which variant of the partial result to forward as the partial value of the subfunction as the output of synthesizer 1009. If ROWFST 808 is not confirmed, the output of 1007 is selected, otherwise the output of 1005 is selected.

圖10D為繪示本發明之實施例中的圖9A或圖15之合成器901之內部結構及操作的圖。串流值之源輸入組1001由電路系統1004用於運算各個別合成器之貢獻。 FIG. 10D is a diagram showing the internal structure and operation of the synthesizer 901 of FIG. 9A or FIG. 15 in an embodiment of the present invention. The source input set 1001 of stream values is used by the circuit system 1004 to calculate the contribution of each individual synthesizer.

電路系統1005利用1004之輸出來運算子函數之初始值，該初始值作為子函數之部分值轉發，作為合成器1009之輸出。 Circuit system 1005 uses the output of 1004 to calculate the initial value of the sub-function, and the initial value is forwarded as a partial value of the sub-function as the output of synthesizer 1009.

胞元901(圖9A、圖15)始終為所利用的任何完整或經截斷小塊中之第一值，並且因此始終產生用於小塊之初始化值。 Cell 901 (Fig. 9A, Fig. 15) is always the first value in any complete or truncated tile used, and therefore always produces the initialization value for the tile.

圖10E為繪示本發明之實施例中的圖9B及圖9C或圖15之合成器903a、903b及903c之內部結構及操作的圖。串流值之源輸入組1001由電路系統1004用於運算各個別合成器之貢獻。 FIG. 10E is a diagram showing the internal structure and operation of synthesizers 903a, 903b, and 903c of FIG. 9B and FIG. 9C or FIG. 15 in an embodiment of the present invention. A source input set 1001 of stream values is used by circuit system 1004 to calculate the contribution of each individual synthesizer.

電路系統1006利用電路系統1004之輸出及由緊接在左側之合成器1002先前運算的部分值來運算子函數之進行中部分值，該進行中部分值作為子函數之部分值轉發，作為合成器1009之輸出。 Circuit system 1006 uses the output of circuit system 1004 and the partial value previously calculated by synthesizer 1002 on the immediate left side to calculate the in-progress partial value of the sub-function, and the in-progress partial value is forwarded as the partial value of the sub-function as the output of synthesizer 1009.

圖10F為繪示本發明之實施例中的圖9A及圖9B或圖15之合成器907a、907b及907c之內部結構及操作的圖。串流值之源輸入組1001用於運算各個別合成器1004之貢獻。 FIG. 10F is a diagram showing the internal structure and operation of synthesizers 907a, 907b, and 907c of FIG. 9A and FIG. 9B or FIG. 15 in an embodiment of the present invention. The source input set 1001 of the stream value is used to calculate the contribution of each individual synthesizer 1004.

電路系統1006利用電路系統1004之輸出及由緊接在左側之合成器1002先前運算的部分值來運算子函數之進行中部分值。電路系統1007利用1004之輸出及先前自緊接在上方之合成器列1003上的908a、908b、908c、908d、908e 及908f中之一者運算並延遲之部分值來運算子函數之進行中部分值。 Circuit system 1006 uses the output of circuit system 1004 and the partial value previously calculated by synthesizer 1002 on the immediate left side to calculate the ongoing partial value of the subfunction. Circuit system 1007 uses the output of 1004 and the partial value previously calculated and delayed from one of 908a, 908b, 908c, 908d, 908e and 908f on the synthesizer row 1003 immediately above to calculate the ongoing partial value of the subfunction.

多工器1012選擇將部分結果之哪個變型作為子函數之部分值轉發，作為合成器1009之輸出。若COLFST 811未經確證，則選擇1006之輸出，否則選擇1007之輸出。 Multiplexer 1012 selects which variant of the partial result to forward as the partial value of the subfunction as the output of synthesizer 1009. If COLFST 811 is not confirmed, the output of 1006 is selected, otherwise the output of 1007 is selected.

圖10G為繪示本發明之實施例中的圖9A或圖15之合成器906之內部結構及操作的圖。串流值之源輸入組1001由電路系統1004用於運算各個別合成器之貢獻。 FIG. 10G is a diagram showing the internal structure and operation of the synthesizer 906 of FIG. 9A or FIG. 15 in an embodiment of the present invention. The source input set 1001 of stream values is used by the circuit system 1004 to calculate the contribution of each individual synthesizer.

電路系統1007利用電路系統1004之輸出及先前自在1003處緊接在上方之合成器列上的908a、908b、908c、908d、908e及908f中之一者運算並延遲之部分值來運算子函數之進行中部分值。電路系統1007之輸出作為子函數之部分值轉發，作為合成器1009之輸出。 Circuit system 1007 computes the in-progress partial value of the subfunction using the output of circuit system 1004 and the partial value previously computed and delayed from one of 908a, 908b, 908c, 908d, 908e, and 908f on the immediately upper synthesizer row at 1003. The output of circuit system 1007 is forwarded as the partial value of the subfunction as the output of synthesizer 1009.

圖11為繪示列內延遲線908a、908b、908c、908d、908e及908f(圖9C)之內部結構及操作的圖。延遲線用於保留來自各列合成器之部分運算之結果以供在下一列中使用。 FIG11 is a diagram showing the internal structure and operation of intra-row delay lines 908a, 908b, 908c, 908d, 908e, and 908f (FIG. 9C). The delay lines are used to retain the results of partial operations from each row synthesizer for use in the next row.

當COLLST經確證時，源輸入串流之當前位置在最右邊緣處，並且列

(1101)至N-2(1102)之合成器之輸出分別由暫存器1104至1105保留以供未來參考。 When COLLST is asserted, the current position of the source input stream is at the rightmost edge and the column

The outputs of the synthesizers (1101) to N-2 (1102) are retained by registers 1104 to 1105 for future reference.

若源輸入串流之當前位置colSrc小於

，則多工器1106以由索引計算(N-2)-colSrc所定義的自右至左之反向次序自所保留值中進行選擇，否則該多工器自列m(1103)之最後合成器選擇當前值。 If the current position of the source input stream colSrc is less than

, then multiplexer 1106 selects from the retained values in reverse right-to-left order defined by index count (N-2)-colSrc, otherwise the multiplexer selects the current value from the last synthesizer in row m (1103).

應注意，當源輸入串流行位置小於

時，列之最右合成器將不包含有效資料，此使得此等時槽可用於插入所保留資料。 It should be noted that when the source input string pop position is less than

, the rightmost synthesizer in the row will contain no valid data, making this time slot available for inserting the retained data.

由多工器1106選擇之部分輸出經饋送至具有C-N個位置之先進先出(FIFO)電路1107，該FIFO電路經組態以使得處理源輸入串流位置，使得恰好插入一個值，並且以與插入相同之次序擷取一個值。由於將不需要來自一個位置之部分完成之結果，直至源輸入串流返回至下一列上之同一小塊位置為止，因此，此實現了延遲，使得由一列運算之部分結果將在需要時精確地呈現給下一列。 The partial output selected by multiplexer 1106 is fed to a first-in-first-out (FIFO) circuit 1107 having C-N positions, which is configured so that the source input stream positions are processed so that exactly one value is inserted and a value is retrieved in the same order as it was inserted. Since the partially completed result from one position will not be needed until the source input stream returns to the same small block position on the next row, this implements a delay so that the partial result of an operation on one row will be presented to the next row exactly when needed.

由多工器1106選擇之部分輸出亦將相同值(1114)饋送至最終結果延遲線909、910a及910b中。 The partial output selected by multiplexer 1106 also feeds the same value (1114) into the final result delay lines 909, 910a and 910b.

自FIFO 1107擷取之部分輸出在1108處經路由至下一列(1111)上之最左合成器及至一系列並行存取暫存器1109至1110，當資料傳遞通過暫存器鏈時，該系列並行存取暫存器使部分輸出進一步延遲一個源輸入串流間隔。 The partial output captured from FIFO 1107 is routed at 1108 to the leftmost synthesizer on the next row (1111) and to a series of parallel access registers 1109 to 1110 which further delay the partial output by one source input stream interval as the data passes through the register chain.

當源輸入串流之當前位置在最左邊緣處時，FIFO在1108處導引輸出資料，並且經延遲結果1109至1110可分別供1111、1112至1113處之下一列之胞元使用。 When the current position of the source input stream is at the leftmost edge, the FIFO directs the output data at 1108, and the delayed results 1109 to 1110 are available to the cells in the next row at 1111, 1112 to 1113, respectively.

應注意，來自由多工器1106插入至FIFO 1107中之源輸入陣列串流之右側的額外值僅當源輸入陣列串流位置接近右邊緣時才經由路徑1111存取，而額外並行路徑1112至1113僅當源輸入陣列串流在最左位置處以存取通常自路徑1103插入之資料時使用。右邊緣處理與左邊緣處理之間在結構及要求上之明顯類似性為子函數與源輸入串流陣列之右邊緣及左邊緣之重疊對稱性的自然結果。當N之值為偶數時，經處理以支援右邊緣及左邊緣之額外胞元之數目不相同。 It should be noted that the additional values from the right side of the source input array stream inserted into FIFO 1107 by multiplexer 1106 are accessed via path 1111 only when the source input array stream is located close to the right edge, while the additional parallel paths 1112-1113 are used only when the source input array stream is at the leftmost position to access data normally inserted from path 1103. The obvious similarity in structure and requirements between right edge processing and left edge processing is a natural consequence of the overlapping symmetry of the subfunction with the right and left edges of the source input stream array. When the value of N is even, the number of additional cells processed to support the right edge and the left edge is different.

圖12為繪示最終經截斷結果延遲線909(圖9C)之內部結構及操作的圖。 FIG12 is a diagram showing the internal structure and operation of the final truncated result delay line 909 (FIG. 9C).

當處理源輸入串流陣列之最後列時，來自列內延遲線908d之輔助輸出1201之部分結果被視為經截斷小塊之最終列之最終結果，並且經保留在FIFO 1202中，該FIFO之元素之數目C等於源輸入串流陣列之寬度。 When processing the last row of the source input stream array, the partial result from the auxiliary output 1201 of the intra-row delay line 908d is considered as the final result of the final row of the truncated chunks and is retained in a FIFO 1202 whose number of elements C is equal to the width of the source input stream array.

緊接在記錄經截斷小塊之最終結果之後，若M之值使得無其他延遲線介入，則FIFO 1202之輸出將經由1203傳送至其他延遲線910a或直接傳送至最終處理911。 Immediately after recording the final result of the truncated small block, if the value of M is such that no other delay line intervenes, the output of FIFO 1202 will be transmitted to other delay lines 910a via 1203 or directly to the final processing 911.

圖13為繪示最終經截斷結果延遲線910a及910b之內部結構及操作的圖。 FIG. 13 is a diagram showing the internal structure and operation of the final truncated delay lines 910a and 910b.

當處理源輸入串流陣列之最後列時，來自列內延遲線908e至908f之輔助輸出之部分結果1301被視為經截斷小塊之最終列之最終結果，並且經保留在FIFO 1304中，該FIFO之元素之數目C等於源輸入串流陣列之寬度。 When processing the last row of the source input stream array, the partial result 1301 from the auxiliary output of the intra-row delay lines 908e to 908f is considered as the final result of the final row of truncated chunks and is retained in a FIFO 1304 whose number of elements C is equal to the width of the source input stream array.

當POSTEN經確證時，多工器1303在自1302獲取值至自上方列之最終經截斷延遲線獲取值之間切換，此將具有以與所有先前輸出結果之排序相容的列第一次序呈現最終經截斷結果之效應。 When POSTEN is asserted, multiplexer 1303 switches between obtaining the value from 1302 to obtaining the value from the final truncated delay line of the row above, which will have the effect of presenting the final truncated result in a row-first order that is compatible with the ordering of all previous output results.

應注意，在當POSTEN首先經確證時的輸入圖框之彼循環期間，FIFO 1202及1304之內容為與源輸入串流陣列之最後列重疊的經截斷小塊之最終值。在彼循環之前包含在FIFO 1202及1304中之任何資料將未經處理，因此在不處理源輸入串流陣列之最終列時執行之任何抑制為任擇的。 It should be noted that during that cycle of the input frame when POSTEN is first asserted, the contents of FIFOs 1202 and 1304 are the final values of the truncated chunks overlapping the last row of the source input stream array. Any data contained in FIFOs 1202 and 1304 prior to that cycle will not be processed, so any suppression performed when not processing the last row of the source input stream array is optional.

緊接在記錄經截斷小塊之最終結果之後，若M之值使得無其他延遲線介入，則FIFO 1304之輸出經由1305傳送至其他延遲線或直接傳送至最終處理911。 Immediately after recording the final result of the truncated small block, if the value of M is such that no other delay lines intervene, the output of FIFO 1304 is transmitted to other delay lines via 1305 or directly to the final processing 911.

圖14為繪示所有完整及經截斷結果之最終處理之內部結構及操作的圖。 Figure 14 is a diagram showing the internal structure and operation of the final processing of all complete and truncated results.

如同圖11並且具有相同的構造及功能，若源輸入串流之當前位置在最右邊緣處，則自

(1101)至N-2(1102)之列M-1之胞元的輸出分別由暫存器1104至1105保留以供未來參考。 As in FIG11 and having the same structure and function, if the current position of the source input stream is at the rightmost edge, then

The outputs of the cells in rows M-1 (1101) to N-2 (1102) are retained by registers 1104 to 1105 for future reference.

若源輸入串流之當前位置小於

，則多工器1106以自右至左之反向次序自所保留值中進行選擇，否則該多工器自列M-1(1103)之最後合成器選擇當前值。 If the current position of the source input stream is less than

, then the multiplexer 1106 selects from the retained values in reverse order from right to left, otherwise the multiplexer selects the current value from the last synthesizer in column M-1 (1103).

雖然處理源輸入串流陣列，但多工器1402將由多工器1106選擇之結果直接饋送至終結(1403)。當在後處理階段中時，替代地選擇經截斷結果延遲線1401之輸出來用於終結(1403)。 While processing the source input stream array, multiplexer 1402 feeds the result selected by multiplexer 1106 directly to the finalization (1403). When in the post-processing stage, the output of the truncated result delay line 1401 is instead selected for the finalization (1403).

終結電路系統1403執行所有額外運算(若存在)，以從經合成小塊結果產生輸出串流之最終形式(1404)。此通常可採用整流線性啟動(RELU)函數之形式，藉此負值經設定為零且超限值經設定為最大可接受值，或可採用任何其他所要調節函數，諸如S型或雙曲正切。後處理函數不需要在單個源輸入串流循環內完成，而是需要以源輸入串流陣列之速率接受各最終結果。 Finalization circuitry 1403 performs all additional operations (if any) to produce the final form of the output stream from the synthesized tile results (1404). This may typically take the form of a rectified linear unwinding (RELU) function, whereby negative values are set to zero and out-of-limit values are set to the maximum acceptable value, or any other desired conditioning function such as a sigmoid or hyperbolic tangent. Post-processing functions need not be completed within a single source input stream loop, but rather need to accept each final result at the rate of the source input stream array.

當DSTEN經確證時，終結電路系統1403將最終結果呈現為目的地輸出串流之一個值。在DSTEN經未確證之任何時間，忽略由終結電路系統1403產生之任何部分或不正確值，因此在未使用結果時操作之任何抑制為任擇的。 When DSTEN is asserted, termination circuitry 1403 presents the final result as a value in the destination output stream. At any time when DSTEN is not asserted, any partial or incorrect value produced by termination circuitry 1403 is ignored, so any inhibition of operation when the result is not used is optional.

在一個實施方式中，目的地輸出串流陣列由類似於前述內容之電路系統處理。在彼情況下，最終經截斷結果之時序與所有先前最終結果相同為有利的。為此，對FIFO 1202及1304之控制藉由控制電路系統702協調以維持與主要輸出速率相同之輸出速率。 In one embodiment, the destination output stream array is processed by circuitry similar to that described above. In that case, it is advantageous for the timing of the final truncated result to be the same as all previous final results. To this end, control of FIFOs 1202 and 1304 is coordinated by control circuitry 702 to maintain an output rate that is the same as the primary output rate.

在另一實施方式中，目的地輸出串流陣列為系統之最終級，並且不需要進一步處理。在彼情況下，儘可能快速地完成最終經截斷結果之時序為有利的。為此，對FIFO 1202及1304之控制藉由控制電路系統702協調以便以所支援之最大頻率輸出彼等結果。 In another embodiment, the destination output stream array is the final stage of the system and requires no further processing. In that case, it is advantageous to complete the timing of the final truncated results as quickly as possible. To this end, control of FIFOs 1202 and 1304 is coordinated by control circuitry 702 to output those results at the maximum frequency supported.

應注意，上文所描述之實施方式自全組輸入元素產生單個輸出元素。在自輸入組產生一大組輸出元素之完整系統中，所描述之整個機制針對每個輸出通道重複一次，但控制電路系統702除外，該控制電路系統可由輸出通道共用，此係因為所有個別子函數之時序對於整個輸出組為相同的。 It should be noted that the implementation described above generates a single output element from a full set of input elements. In a complete system that generates a large set of output elements from an input set, the entire mechanism described is repeated once for each output channel, with the exception of control circuitry 702, which can be shared by the output channels because the timing of all individual subfunctions is the same for the entire output set.

本案發明人已在本發明之實施例中構建IC之工作原型以測試並證實本發明之細節及特徵，並且原型之操作證實了以上描述。本案發明人亦開發出一種軟體支援之模擬器，該模擬器在直至申請本申請案之時間一直用於測試並證實以上細節及描述。 The inventors of this case have constructed working prototypes of the IC in the embodiments of this invention to test and verify the details and features of this invention, and the operation of the prototypes has confirmed the above description. The inventors of this case have also developed a software-supported simulator, which has been used to test and verify the above details and descriptions until the time of filing this application.

在本發明之另一態樣中，提供一種接受三維資料之輸入串流的系統，如醫學成像中通常所呈現，其中包括額外電路系統及緩衝以允許三維孔徑函數在三維輸入陣列上傳遞，其中對應運算正確地實施第一及最後平面之內部及邊緣情況二者。 In another aspect of the invention, a system is provided for accepting an input stream of three-dimensional data, such as is commonly found in medical imaging, including additional circuitry and buffering to allow a three-dimensional aperture function to be passed over the three-dimensional input array, wherein the corresponding operation is correctly performed for both the interior and edge conditions of the first and last planes.

在本發明之又一態樣中，對於訓練深度神經網路(DNN)之複雜過程，提供硬體輔助之神經網路訓練系統，其中大部分工作係藉由前向推斷引擎完成的，並且訓練演算法僅需要使用自前向推斷搜集之統計來週期性地調整全網路之權重及偏置以將模型收斂至所要狀態。在添加在前向推斷過程經運算時對輸入狀態求和之適當累加器的情況下，本發明形成硬體輔助之神經網路訓練系統。 In another aspect of the present invention, a hardware-assisted neural network training system is provided for the complex process of training a deep neural network (DNN), wherein most of the work is done by a forward inference engine, and the training algorithm only needs to use statistics collected from the forward inference to periodically adjust the weights and biases of the entire network to converge the model to the desired state. With the addition of an appropriate accumulator that sums the input states as the forward inference process is calculated, the present invention forms a hardware-assisted neural network training system.

在本發明之又一態樣中，關於其中浮點準確度之限制妨礙DNN模型之收斂的熟知問題(在本領域中被稱為「下降梯度問題」)，單個大量乘法器具備有限位元寬度精確度，其可與額外加法器級聯以產生具有任意大精確度之浮點乘積。雖然前向推斷運算通常不需要此創新，但其在DNN訓練器中可極其重要以避免在所運算之梯度變得過小而不能量測時出現的問題。 In yet another aspect of the invention, with respect to the well-known problem in which floating point accuracy limitations prevent convergence of DNN models (referred to in the art as the "descent gradient problem"), a single massive multiplier with finite bit-width precision can be cascaded with additional adders to produce floating point products with arbitrarily large precision. While this innovation is not typically required for forward propagation operations, it can be extremely important in DNN trainers to avoid problems that arise when the computed gradients become too small to be measured.

N-up並行處理N-up parallel processing

在上文所描述之本發明之實施例及實施方式中，焦點集中於用於執行其中需要乘法之函數中的大量乘法及卷積神經網路(CNN)中之新穎IC對孔徑函數之執行的設備及方法。然而，在本領域中眾所周知的，完整深度神經網路(DNN)必須實施全組完全不同的孔徑函數，該等孔徑函數中之許多者可能僅需要最小計算。 In the embodiments and implementations of the present invention described above, the focus is on apparatus and methods for performing a large number of multiplications in functions requiring multiplications and the execution of novel ICs for aperture functions in convolutional neural networks (CNNs). However, as is well known in the art, a full deep neural network (DNN) must implement a full set of disparate aperture functions, many of which may require only minimal computation.

為了符合本發明之實施例，各此類實施方式必須符合總體系統範圍管線格式，其接受輸入作為以一致次序表示陣列之並行值串流且同時產生輸出作為以彼相同次序表示陣列之並行值串流。DNN之最終節點可返回反映位置陣列之結論，或整體上關於輸入陣列之結論。下文所描述之本發明之實施例係用於在其中支援管線執行之新穎IC中執行DNN。 To comply with embodiments of the invention, each such implementation must comply with an overall system-wide pipeline format that accepts input as a parallel stream of values representing arrays in a consistent order and simultaneously produces output as a parallel stream of values representing arrays in that same order. The final node of the DNN may return a conclusion reflecting an array of positions, or a conclusion about the input array as a whole. The embodiments of the invention described below are for executing a DNN in a novel IC in which pipeline execution is supported.

在本發明之態樣中，本案發明人已開發出一種顯著加速CNN及DNN中之管線操作的方法及設備。本案發明人在一些實施例中提議用於多次將輸入並行地串流傳輸至IC的管線操作。在上文所描述之實施例中，在所有實施方式中，輸入通常跨各行自左至右串流傳輸，隨後跨列自上至下串流傳輸。以RGB資料為例，各像素位置處將採用三個個別通道之形式，通常各通道8個位元，從而表示在各像素位置處觀察到的三個獨立RGB色彩值中之各者。本案發明人將此稱為1-up實施方式。1-up意謂一次串流傳輸一個像素之輸入值。或者在更一般意義上，在輸入陣列中一次串流傳輸一個輸入位置之值。 In aspects of the present invention, the inventors have developed a method and apparatus for significantly accelerating pipeline operations in CNNs and DNNs. The inventors in some embodiments propose pipeline operations for streaming inputs to an IC multiple times in parallel. In the embodiments described above, in all embodiments, the inputs are typically streamed from left to right across rows and then from top to bottom across columns. For RGB data, for example, three separate channels of typically 8 bits per channel will be used at each pixel location to represent each of the three independent RGB color values observed at each pixel location. The inventors of this case refer to this as a 1-up implementation. 1-up means streaming the input value of one pixel at a time. Or in a more general sense, streaming the value of one input position at a time in an input array.

本案發明人咸信，可藉由諸如在像素實例中一次多於一個輸入位置地串流傳輸輸入來獲得相當大的優勢。為此，電路系統必須添加至執行輸入串流從而產生輸出串流之新穎IC。該改變通常為大小之改變而非複雜度之改變，此係因為在1-up情形中實施之電路系統在IC中重複以針對額外輸入位置(在此實例中，像素)並行地處理輸入值。 The inventors of this case believe that considerable advantage can be gained by streaming inputs to more than one input location at a time, such as in the case of pixels. To do this, circuitry must be added to a novel IC that performs the input streaming to produce the output stream. The change is generally a change in size rather than complexity, since the circuitry implemented in the 1-up case is repeated in the IC to process input values for the additional input locations (in this case, pixels) in parallel.

雖然當各列之寬度為待並行地串流傳輸之輸入計數的整數倍時電路系統最不成問題，但此並非本發明所需之限制。對於像素實例，對於1920×1080之解析度，列上之像素之數目(1920)可被1、2、3、4、5、6、8、10及12整除。因此，串流傳輸二個像素之RGB值(被稱為2-up)為高效方法，3-up及4-up亦如此。隨著像素之數目增加，用於處置所有處理之IC的絕對大小按與待並行考慮之像素之數目直接相關的因數增加，因此使用者必須作出合理的決策。 While the circuitry is least problematic when the width of each row is an integer multiple of the input count to be streamed in parallel, this is not a required limitation of the present invention. For the pixel example, for a resolution of 1920×1080, the number of pixels on a row (1920) is divisible by 1, 2, 3, 4, 5, 6, 8, 10, and 12. Therefore, streaming the RGB values of two pixels (called 2-up) is an efficient approach, as are 3-up and 4-up. As the number of pixels increases, the absolute size of the IC used to handle all the processing increases by a factor directly related to the number of pixels to be considered in parallel, so the user must make reasonable decisions.

但當串流向下傳遞通過DNN之節點時，在孔徑函數之步幅不為1(並非每個輸入位置將產生立即輸出位置)的情況下或在孔徑函數經定義以避免與輸入陣列之邊緣重疊的情況下，輸入陣列大小通常在維度方面會減小。在此等常見情況下，輸入陣列之寬度無法經限制為並行位置之任何給定數目N之整數倍。一個解決方案為始終將輸入陣列之各列之左邊緣與N個位置之組中的特定位置(標稱上為左側)對準。右邊緣可隨後由不完整組表示，始終自N個位置之組中之第一位置起始。額外電路系統隨後用於避免將無效資料用於運算且亦抑制自彼無效資料得出之任何輸出。 But as the stream passes down through the nodes of the DNN, the input array size typically decreases in dimension, when the stride of the aperture function is not 1 (not every input position will produce an immediate output position) or when the aperture function is defined to avoid overlapping with the edges of the input array. In these common cases, the width of the input array cannot be restricted to be an integer multiple of any given number N of parallel positions. One solution is to always align the left edge of each column of the input array with a specific position (nominaly the left side) in the group of N positions. The right edge can then be represented by an incomplete group, always starting from the first position in the group of N positions. Additional circuitry is then used to prevent invalid data from being used in calculations and also to suppress any output derived from that invalid data.

在本發明之實施例中，對於像素實例中之2-up實施方式，二個鄰近像素中之各者的R、G及B值作為管線輸入串流傳輸至IC。前二個像素為頂部列中左起之前二個像素。對於RGB實例，將存在六個輸入值，此等輸入值為前二個像素中之各者的R、G及B值。列中之下二個像素為串流中之下一者，並且跨頂部列依此類推，隨後為第二列中之前二個像素之R、G及B值，並且通過輸入陣列依此類推。對於3-up或4-up，遵循相同的一般協定。 In an embodiment of the invention, for a 2-up implementation in a pixel instance, the R, G, and B values of each of two neighboring pixels are transmitted as a pipeline input stream to the IC. The first two pixels are the first two pixels from the left in the top row. For an RGB instance, there will be six input values, which are the R, G, and B values of each of the first two pixels. The next two pixels in the row are the next in the stream, and so on across the top row, followed by the R, G, and B values of the first two pixels in the second row, and so on through the input array. For 3-up or 4-up, the same general protocol is followed.

圖17A及圖17B繪示用於良好形成之最小DNN模型之1-up管線解決方案，其可用於理解影像且回應模型已經訓練以識別的各種物件之相對激勵強度。輸入通道1701以特定次序呈現為個別像素之輸入值，通常跨各列自左至右，隨後自上至下，如上文剛剛所描述。對於RGB資料，此採用三個個別通道之形式，通常各通道8個位元，從而表示在彼位置處觀察到之三個獨立色彩值。八位元通道並不限制本發明之範圍。 Figures 17A and 17B illustrate a 1-up pipeline solution for a well-formed minimal DNN model that can be used to understand images and respond to the relative excitation strengths of the various objects that the model has been trained to recognize. Input channels 1701 are presented as input values for individual pixels in a particular order, typically from left to right across the columns and then from top to bottom, as just described above. For RGB data, this takes the form of three individual channels, typically 8 bits per channel, representing three independent color values observed at that location. Eight-bit channels do not limit the scope of the invention.

若至此DNN電路之輸入為另一DNN電路之輸出，如在大DNN經分解成較小零件以輔助處理的情況下將自然發生的，則所呈現之通道將為針對傳遞至DNN中之每個特徵的通道。舉例而言，若模型之特定片段需要64個特徵通道作為輸入，則各值將以指定格式並行地呈現為無符號或帶符號整數或者浮點值，並具有所要準確度位元。 If the input to this DNN circuit is the output of another DNN circuit, as would naturally occur if a large DNN is broken down into smaller parts to aid processing, the channels presented will be one for each feature passed into the DNN. For example, if a particular piece of the model requires 64 feature channels as input, then each value will be presented in parallel as an unsigned or signed integer or floating point value in the specified format, with the desired bit of precision.

重要的是要理解圖17A中(以及所描述之其他圖中)所描繪之區塊不表示依序執行之步驟。各區塊表示輸入通道或執行諸如孔徑函數之函數的電路系統。區塊之間的箭頭表示在處理電路之間傳遞值的並行導體組。每當呈現至彼塊之輸入時，所有過程皆同時為活動的。當輸入串流開始時，由區塊表示之電路系統一個接一個地變為活動的，直至所有過程皆為活動的，並且在多個通道中亦產生輸出串流。輸入陣列之第一角(標稱上為左上角)的最終輸出之發射在輸入仍被接受時開始。 It is important to understand that the blocks depicted in FIG. 17A (and in the other figures described) do not represent steps that are performed sequentially. Each block represents an input channel or circuitry that performs a function such as an aperture function. The arrows between blocks represent sets of parallel conductors that pass values between processing circuits. All processes are simultaneously active whenever an input is presented to that block. When the input stream begins, the circuitry represented by the blocks becomes active one by one until all processes are active and output streams are also generated in multiple channels. The emission of the final output at the first corner (nominal upper left corner) of the input array begins while the input is still being accepted.

該模型中之第一7乘7卷積節點1702對於用於視覺理解之DNN中的RGB輸入為典型的。該7乘7核心僅可在核心小塊擬合在輸入陣列之邊界內的情況下應用(對於RGB輸入為典型的)，或者它可應用於每個輸入位置及所合成之缺失值(對於重新處理特徵為典型的)。一般而言，產生大量輸出通道(通常64個)，並且整個系統其餘部分上的通道之數目通常隨著特徵值傳遞通過額外節點而增加。 The first 7x7 convolution node 1702 in the model is typical for RGB input in a DNN for visual understanding. The 7x7 kernel can be applied only if the kernel tile fits within the bounds of the input array (typical for RGB input), or it can be applied to every input position and missing values synthesized (typical for reprocessing features). In general, a large number of output channels are generated (typically 64), and the number of channels throughout the rest of the system is usually increased as feature values are passed through additional nodes.

後續卷積節點1703、1704、1705中之各者亦接受並產生與其輸入相同維度之多通道陣列串流。各卷積節點之輸出通道之數目為任意的，並且可更多、更少或與輸入通道之數目相同。 Each of the subsequent convolution nodes 1703, 1704, 1705 also accepts and produces a multi-channel array stream of the same dimension as its input. The number of output channels of each convolution node is arbitrary and can be more, less, or the same as the number of input channels.

此模型中之串接節點1706接受由節點1704及1705產生之並行輸入陣列串流，並且使其同步以產生一組組合通道。來自卷積節點之通道的值並不改變。但由於管線之性質使得來自1乘1卷積的對應於特定陣列位置之各輸出將在來自3乘3卷積之輸出之前產生，因此串接函數將必須提供呈先進先出(FIFO)電路之形式的緩衝，使得所有通道可輸出對應於同時呈現之同一位置的資料。 The concatenation node 1706 in this model accepts the parallel input array streams produced by nodes 1704 and 1705 and synchronizes them to produce a combined set of channels. The values of the channels from the convolution node are not changed. However, due to the nature of the pipeline, each output from the 1x1 convolution corresponding to a particular array position will be produced before the output from the 3x3 convolution, so the concatenation function will have to provide a buffer in the form of a first-in, first-out (FIFO) circuit so that all channels can output data corresponding to the same position presented at the same time.

此模型中之MaxPool節點1707利用孔徑函數，該孔徑函數比較小塊之所有值並且獨立地僅輸出各通道之最大值。通道之數目不受影響，但輸入串流之陣列維度將在輸出串流中減小。若MaxPool節點典型地將水平維度減小二倍以及將豎直維度減小二倍，則輸出陣列將為輸入陣列之大小的四分之一。 The MaxPool node 1707 in this model utilizes an aperture function that compares all values of a small block and outputs only the maximum value for each channel independently. The number of channels is not affected, but the array dimension of the input stream will be reduced in the output stream. If the MaxPool node typically reduces the horizontal dimension by a factor of two and the vertical dimension by a factor of two, the output array will be one-fourth the size of the input array.

由於輸入串流及輸出串流之圖框速率必須相同(輸出不可比它們所基於之輸入更快地產生，並且輸出不可比輸入更慢地產生，否則資料將丟失)，因此最終效應為用於經減小輸出陣列串流之時脈速率將成比例地減小。 Since the frame rates of the input and output streams must be the same (outputs cannot be generated faster than the inputs they are based on, and outputs cannot be generated slower than inputs or data will be lost), the net effect is that the clock rate used to scale the output array stream will be proportionally reduced.

在此MaxPool實例中，由於針對四個輸入位置之小塊僅產生一個輸出，因此所需輸出速率僅為輸入速率的四分之一。因此，管線中之所有後續節點將以經減小有效輸貫量操作。隨著通道之數目變得愈來愈大，經減小有效輸貫量可為有利的。當存在可用於進行所需計算之更多循環時，可專用於各通道之一些資源可在通道之間共用，從而引起電路大小之總體減小，而功率僅小幅增加。維度之減小亦形成本發明之重要基礎。 In this MaxPool example, since only one output is produced for a small block of four input positions, the required output rate is only one quarter of the input rate. Therefore, all subsequent nodes in the pipeline will operate with a reduced effective output throughput. As the number of channels becomes larger and larger, reducing the effective output throughput can be advantageous. When there are more cycles available to perform the required calculations, some of the resources that can be dedicated to each channel can be shared between channels, resulting in an overall reduction in circuit size with only a small increase in power. The reduction in dimensionality also forms an important basis for the present invention.

所繪示之模型之後續節點可利用類似或不類似的連接模式，只要各模式支援同時在任何輸入陣列串流中呈現對應於給定位置之資料之所有通道的系統範圍介面即可。 Subsequent nodes of the depicted model may utilize similar or dissimilar connection schemes, as long as each scheme supports a system-wide interface that simultaneously presents all channels corresponding to data at a given location in any input array stream.

在MaxPool節點1707值在如所繪示之此模型中串流傳輸至額外的卷積節點、串接節點及MaxPool節點之後，但由於此等節點在功能上與已描述之節點相同，此等節點不具有元件編號。 After the MaxPool node 1707 value is streamed to additional Convolution nodes, Concatenation nodes, and MaxPool nodes in this model as depicted, but since these nodes are functionally identical to those already described, they do not have component numbers.

圖17B中之全域平均節點1708的不同之處在於，節點1708之孔徑函數覆蓋先前輸入陣列串流之整個剩餘維度，並且簡單地返回整個陣列上之各通道之平均值。因此，輸出陣列維度為1乘1，並且形成整個電路之輸出通道1709。 The global average node 1708 in Figure 17B is different in that the aperture function of node 1708 covers the entire residual dimension of the previous input array stream and simply returns the average of each channel over the entire array. Therefore, the output array dimension is 1 by 1 and forms the output channel 1709 of the entire circuit.

圖18A及圖18B繪示了實施與圖17A及圖17B所繪示相同形式之DNN模型的4-up管線之總體構造及流程。 Figures 18A and 18B illustrate the overall structure and flow of a 4-up pipeline implementing the same form of DNN model as shown in Figures 17A and 17B.

輸入通道1801經並行地呈現為用於各通道之四組資料。對於RGB資料，此將採用四個個別像素之形式，該等四個個別像素表示輸入陣列之四個鄰近行，各像素包含四個RGB值，從而總共同時並行地接受12個輸入。替代地，輸入通道可來自另一DNN電路，在此情況下，該等輸入通道採用輸入通道之四個完整組的形式，該等四個完整組表示輸入陣列之四個鄰近行。舉例而言，若模型需要64個特徵通道作為輸入，則四個組將包含總共256個並行輸入。 The input channels 1801 are presented in parallel as four sets of data for each channel. For RGB data, this would take the form of four individual pixels representing four adjacent rows of the input array, each pixel containing four RGB values, for a total of 12 inputs accepted in parallel at the same time. Alternatively, the input channels may come from another DNN circuit, in which case they take the form of four complete sets of input channels representing four adjacent rows of the input array. For example, if the model requires 64 feature channels as input, the four sets would contain a total of 256 parallel inputs.

第一7乘7卷積節點1802對於用於視覺理解之DNN中的RGB輸入為典型的。在此4-up實施方式中，節點1802一次接受四個像素之輸入且一次產生四個像素之輸出。與輸入通道之數目相比，輸出通道之數目通常相當大，為64或更多，並且不再表示色彩資訊。在此模型中之DNN之整個剩餘部分中，通道表示在輸入陣列中發現之特徵或特徵之組合的偵測強度且具有各位置之獨立值。後續卷積節點1803、1804、1805中之各者亦針對各通道一次接受並處理四個像素之輸入。串接節點1806自卷積節點1804及1805接受四組通道且以四組為一組輸出組合通道。 The first 7 by 7 convolution node 1802 is typical for RGB input in a DNN for visual understanding. In this 4-up implementation, node 1802 accepts input four pixels at a time and produces output four pixels at a time. The number of output channels is typically quite large compared to the number of input channels, 64 or more, and no longer represents color information. Throughout the remainder of the DNN in this model, a channel represents the detected strength of a feature or combination of features found in the input array and has an independent value for each position. Each of the subsequent convolution nodes 1803, 1804, 1805 also accepts and processes input four pixels at a time for each channel. Concatenation node 1806 accepts four sets of channels from convolution nodes 1804 and 1805 and outputs the combined channels in sets of four.

第一MaxPool節點1807經標記為4-up至2-up。節點1807獲取表示包含二個連續列上之二個鄰近行的輸入陣列位置之小塊的四個樣本之最大值。由於效應為減小輸入陣列串流之維度以產生寬度為一半且高度為一半之輸出陣列串流，因此所有後續節點之有效輸貫量淨減小了四倍。當使用單個輸入處理時，可減小後續處理時脈以藉由利用更緊湊電路系統而利用。 The first MaxPool node 1807 is labeled 4-up to 2-up. Node 1807 takes the maximum of four samples representing a small block of input array positions containing two adjacent rows on two consecutive columns. Since the effect is to reduce the dimension of the input array stream to produce an output array stream that is half the width and half the height, the effective throughput of all subsequent nodes is reduced by a factor of four. When using single input processing, subsequent processing clocks can be reduced to take advantage of more compact circuitry.

當使用N-up並行輸入處理時，輸出陣列寬度之減小實際上用於減小並行輸出之數目。由於並行輸入表示輸入陣列串流中同一列上之鄰近行，因此僅寬度之減小為相關的。雖然有可能以經減小頻率保留N-up並行輸出，但如此操作在大小或功率方面沒有優勢。MaxPool節點1807之最終效應為在水平維度上將並行度自4-up減小至2-up(如所標記)，並且將處理頻率減小二倍而非四倍，如在上文所描述之1-up情況中。 When using N-up parallel input processing, the reduction in output array width actually serves to reduce the number of parallel outputs. Since parallel inputs represent adjacent rows on the same column in the input array stream, only the reduction in width is relevant. While it is possible to retain N-up parallel outputs at a reduced frequency, there is no size or power advantage to doing so. The net effect of the MaxPool node 1807 is to reduce parallelism in the horizontal dimension from 4-up to 2-up (as marked), and to reduce the processing frequency by a factor of two rather than four as in the 1-up case described above.

節點1808、1809、1810及1811在2-up並行軌中處理資料，並且該等節點各自大致為其4-up對應節點之大小的一半。此並不對應於功率之減小，此係因為4-up、2-up或1-up電路所需的操作之總數目相同，並且僅用於管理N-up協調之開銷經減小。 Nodes 1808, 1809, 1810, and 1811 process data in 2-up parallel tracks, and each of these nodes is roughly half the size of its 4-up counterpart. This does not correspond to a reduction in power, since the total number of operations required for a 4-up, 2-up, or 1-up circuit is the same, and only the overhead for managing N-up coordination is reduced.

第二2乘2 MaxPool節點1812同樣獲取表示包含二個連續列上之二個鄰近行的輸入陣列位置之小塊的四個樣本之最大值。節點1812之最終效應為在水平維度上將並行度自2-up減小至1-up，並且將處理頻率減小二倍。如圖18B中所示之所有後續節點對其各別輸入及輸出通道之單個組進行操作，並且最終輸出1813採用同時並行地呈現之各通道之單個樣本的形式。 The second 2x2 MaxPool node 1812 similarly obtains the maximum of four samples representing a small block of input array positions containing two adjacent rows on two consecutive columns. The net effect of node 1812 is to reduce the parallelism from 2-up to 1-up in the horizontal dimension and reduce the processing frequency by a factor of two. All subsequent nodes as shown in FIG18B operate on a single set of their respective input and output channels, and the final output 1813 takes the form of a single sample of each channel presented simultaneously in parallel.

圖19及圖20為描述用於應用於與呈HD RGB格式之影像相容之輸入串流的典型小DNN之陣列串流大小的表。圖19之表描述了僅實施1-up處理之DNN，如圖17A及圖17B中所描繪，並且圖20之表描述了最初實施4-up處理並且在後續節點中轉變為1-up處理之同一DNN，如圖18A及圖18B中所描繪。 FIG19 and FIG20 are tables describing array stream sizes for a typical small DNN applied to an input stream compatible with images in HD RGB format. The table of FIG19 describes a DNN that implements only 1-up processing, as depicted in FIG17A and FIG17B, and the table of FIG20 describes the same DNN that initially implements 4-up processing and transitions to 1-up processing in subsequent nodes, as depicted in FIG18A and FIG18B.

已描述了N-up並行處理之命名及一般過程，本案發明人現在提供了用於使用4-up並行處理將3乘3卷積函數應用於輸入陣列的設備及方法之特定實例。在此實例中，輸入陣列為RGB色彩之像素陣列，如本說明書中之許多其他實例中所使用。應注意，此並非對本發明之範圍的限制，此係因為具有4-up並行處理之3乘3卷積可用於輸入陣列之許多其他格式。在此實例中，同樣應理解，所展示之3乘3區塊表示對輸入串流執行核心函數之電路系統。 Having described the nomenclature and general process of N-up parallel processing, the inventors of this case now provide a specific example of an apparatus and method for applying a 3x3 convolution function to an input array using 4-up parallel processing. In this example, the input array is an array of pixels in RGB color, as used in many other examples in this specification. It should be noted that this is not a limitation on the scope of the invention, as 3x3 convolution with 4-up parallel processing can be used for many other formats of input arrays. In this example, it should also be understood that the 3x3 blocks shown represent the circuit system that executes the core function on the input stream.

圖21繪示了IC上使用4-up資料串流執行3乘3卷積節點之電路系統之實例。在圖21中，一組四個輸入2101為自緊接在前的輸入間隔保留之組，並且與當前組四個輸入2102一起保留，以提供3乘3卷積之所有四個輸出通道所需的所有輸入。使用來自緊接在前的輸入間隔之輸入連同來自即時間隔之輸入為必需的，以在管線處理中完整地計算輸出，如上面詳細描述的。 FIG21 illustrates an example of a circuit system on an IC that implements a 3 by 3 convolution node using a 4-up data stream. In FIG21 , a set of four inputs 2101 is reserved from the immediately preceding input interval and is reserved along with the current set of four inputs 2102 to provide all the inputs required for all four output channels of the 3 by 3 convolution. Using the inputs from the immediately preceding input interval along with the inputs from the current interval is necessary to fully compute the outputs in pipeline processing, as described in detail above.

p₀、p₁、p₂及p₃分別表示輸入陣列中第一列中之位置0、1、2及3的輸入通道值。為簡潔起見，僅使用單個符號，但各符號表示輸入位置之所有通道。對於像素情形，各資料點p_x表示彼像素之R、G及B之值。 _p0 , _p1 , _p2 , and _p3 represent the input channel values of positions 0, 1, 2, and 3 in the first row of the input array, respectively. For simplicity, only a single symbol is used, but each symbol represents all channels of the input position. For the pixel case, each data point _px represents the R, G, and B values of that pixel.

w_0,0至w_2,2表示待應用於輸入通道中之值的權重組。由於各權重應用於一個且僅一個輸入通道，因此輸入通道之數目並不影響電路之結構，所以多個通道未展示。 w _0,0 to w _2,2 represent the weight sets to be applied to the values in the input channels. Since each weight is applied to one and only one input channel, the number of input channels does not affect the structure of the circuit, so multiple channels are not shown.

核心列2103、2104及2105中之權重並行地(同時)應用於輸入通道p₀、p₁及p₂，並且各列之各組權重之部分乘積按3乘3卷積之孔徑函數之規則直接求和。如上文針對管線處理詳細描述的，部分和自各功能電路傳遞至下一功能電路，並且在完成所有必要部分時產生輸出。列2105中之權重之應用藉由組合部分乘積與來自應用列2104之權重的乘積之和而自先前列產生當前列之核心之最終輸出。列2104之權重之應用藉由組合部分乘積與應用列2103之權重的和而自先前列產生中間值。應用列2103之權重藉由對部分乘積求和且保留部分乘積以供稍後使用而產生初始值。可在任何級處引入偏置(若存在)。啟動函數(若存在)將應用於最終輸出2105。 The weights in core columns 2103, 2104, and 2105 are applied in parallel (simultaneously) to input channels _p0 , _p1 , and _p2 , and the partial products of each set of weights in each column are summed directly according to the rules of the aperture function of the 3-by-3 convolution. As described in detail above for pipeline processing, the partial sums are passed from each functional circuit to the next functional circuit, and the output is produced when all necessary parts are completed. The application of the weights in column 2105 produces the final output of the core of the current column from the previous column by combining the partial products with the sum of the products of the weights from the application column 2104. The application of the weights of column 2104 produces intermediate values from the previous column by combining the partial products with the sum of the weights of the application column 2103. The weights of the applied columns 2103 are generated by summing the partial products and retaining the partial products for later use to generate initial values. Bias (if any) can be introduced at any stage. The activation function (if any) will be applied to the final output 2105.

實施列2103、2104及2105之權重的完整電路(包括任何偏置及啟動函數)產生4-up組之第一輸出通道。 The complete circuit implementing the weights of columns 2103, 2104, and 2105 (including any biasing and activation functions) produces the first output channel of the 4-up set.

當自輸入陣列串流呈現第一4-up組時，沒有足夠的資料來計算所有四個所需輸出，因此延遲所有輸出之運算，直至獲取第二4-up組且有效資料可用於使用來自二個組2101及2102之輸入的運算為止。 When the first 4-up group is presented from the input array stream, there is not enough data to compute all four required outputs, so computation of all outputs is delayed until the second 4-up group is obtained and valid data is available for computation using inputs from both groups 2101 and 2102.

電路2106、2107及2108在作為先前電路之複本的電路中應用權重，並且該等函數僅在應用權重的輸入之位置方面不同。應注意，該組權重w_0,0至w_2,2對於所有輸出通道為相同的，但一個權重及一個輸入通道之各組合為唯一的。 Circuits 2106, 2107 and 2108 apply weights in circuits that are copies of the previous circuits, and the functions differ only in the location of the inputs where the weights are applied. It should be noted that the set of weights _w0,0 to _w2,2 is the same for all output channels, but each combination of a weight and an input channel is unique.

使用第一核心複本之列2105之權重計算的輸出產生輸出陣列串流之第一並行通道組q₀，而使用其他核心複本2106、2107及2108之權重的輸出分別產生輸出串流2109之剩餘並行通道組q₁、q₂及q₃。 The output of the weight calculation using row 2105 of the first core replica generates the first parallel channel set q ₀ of the output array stream, while the output of the weights of the other core replicas 2106 , 2107 , and 2108 generates the remaining parallel channel sets q ₁ , q ₂ , and q ₃ of the output stream 2109 , respectively.

由於第一輸出q₀對應於以p₁為中心之3乘3核心，因此對應於圖21中之配置的電路為3乘3卷積之插入或「有效」版本的解決方案。因此，輸出陣列串流之寬度比輸入陣列串流之寬度減小二個位置，如符合該變化之孔徑函數之定義。(高度通常亦減小了二列，但此與水平處理不相關。) Since the first output _q0 corresponds to a 3x3 core centered at _p1 , the circuit corresponding to the configuration in Figure 21 is a solution to an inserted or "valid" version of the 3x3 convolution. Therefore, the width of the output array stream is reduced by two positions compared to the width of the input array stream, as defined by the aperture function of the variation. (The height is also typically reduced by two rows, but this is not relevant for horizontal processing.)

圖22繪示了用於針對「相同」版本之3乘3卷積之4-up輸入通道產生輸出的電路之所需配置，亦即，其中未減小輸出陣列串流之維度且針對輸入陣列串流之每個不同位置產生一個輸出位置。在此變化中，輸入組2203呈現4-up輸入陣列串流之當前值，而輸入組2202及2201呈現來自先前二個組之保留值。 Figure 22 illustrates the required configuration of a circuit for generating outputs for a 4-up input channel of the "same" version of a 3 by 3 product, i.e., where the dimension of the output array stream is not reduced and one output position is generated for each distinct position of the input array stream. In this variation, input group 2203 presents the current value of the 4-up input array stream, while input groups 2202 and 2201 present the retained values from the previous two groups.

核心電路2204、2205、2206及2207之應用分別產生4-up輸出陣列串流2208之值q₀、q₁、q₂及q₃，並且現經對準以使得各核心之中心對應於4-up輸入陣列串流之一個位置。 The application of core circuits 2204, 2205, 2206 and 2207 respectively generates values _q0 , _q1 , _q2 and _q3 of the 4-up output array stream 2208, and is now aligned so that the center of each core corresponds to a position in the 4-up input array stream.

當自輸入陣列串流呈現第一4-up通道組時，沒有足夠的資料來計算所有四個所需輸出，因此延遲運算，直至第二4-up組經呈現且有效資料可用於組2202及2203二者為止。有效資料對於組2201尚不可用，並且核心電路2204將抑制包括應用於2201之p₃的權重或迫使未初始化值為零，如與將3乘3卷積孔徑函數應用於導致核心與輸入陣列之邊緣重疊的位置一致。此抑制機制係針對各列之第一組而觸發，但彼列上之後續組將利用p₃值組2201來運算彼輸出位置q₀之全核心。 When the first 4-up channel group appears from the input array stream, there is not enough data to compute all four required outputs, so the computation is deferred until the second 4-up group is presented and valid data is available for both groups 2202 and 2203. Valid data is not yet available for group 2201, and core circuitry 2204 will suppress the weights including _p3 applied to 2201 or force uninitialized values to zero, as consistent with applying a 3 by 3 convolution aperture function to locations that cause the core to overlap the edge of the input array. This suppression mechanism is triggered for the first group of each row, but subsequent groups on that row will utilize the _p3 value group 2201 to compute the full core for that output location _q0 .

在處理完整DNN中，出現其中將4-up串流技術應用於其寬度並非四之偶數倍的輸入陣列串流之情形。在此類情況下，藉由迫使無效值為零或藉由其他手段來抑制最終4-up組中之無效值，並且忽略列之最後4-up組中之最終輸出位置。此與3乘3孔徑函數之插入(「有效」)變化及完全(「相同」)變化二者一致。 In processing full DNNs, there are cases where the 4-up streaming technique is applied to a stream of input arrays whose width is not an even multiple of four. In such cases, invalid values in the final 4-up group are suppressed by forcing them to zero or by other means, and the final output position in the last 4-up group of the row is ignored. This is consistent with both the interpolated ("valid") and the exact ("identical") variations of the 3x3 aperture function.

在所有情況下，輸入陣列串流之各列之第一位置始終呈現在4-up 輸入組之第一位置中。 In all cases, the first position of each row of the input array stream always appears in the first position of the 4-up input set.

在輸入列長度並非處理組寬度之偶數倍的情況下，處理時脈增加，使得N-up處理之總體輸貫量與1-up輸入源之輸貫量相容，並且需要特殊緩衝以將傳入值封裝成N-up組。在下文描述此特殊緩衝。 In the case where the input row length is not an even multiple of the processing group width, the processing clock is increased to make the overall throughput of the N-up processing compatible with the throughput of the 1-up input source, and a special buffer is required to pack the incoming values into N-up groups. This special buffer is described below.

圖23繪示了用於輸出4-up資料上之1列乘7行卷積之二個變型的電路之所需配置。根據3乘3卷積之先前論述，熟習此項技術者應辨別出，核心中之列之特定數目僅影響隨時間推移而保留之部分和之數目，而非核心權重行至輸入組行之映射。因此，圖23中所展示之資料配置同樣適用於7乘7核心、3乘7核心或其寬度為7之任何其他核心。 Figure 23 shows the required configuration of the circuitry for two variations of a 1-row by 7-row convolution on the output 4-up data. Based on the previous discussion of the 3-by-3 convolution, one skilled in the art will recognize that the specific number of rows in the core affects only the number of partial sums preserved over time, not the mapping of core weight rows to input group rows. Therefore, the data configuration shown in Figure 23 applies equally to a 7-by-7 core, a 3-by-7 core, or any other core with a width of 7.

如上文所描述，輸入組2303為來自輸入陣列串流之當前呈現的4-up資料組，並且組2302及2301先前分別呈現並保留來自緊接在前的組及第二先前組之資料組。 As described above, input group 2303 is the currently presented 4-up data group from the input array stream, and groups 2302 and 2301 were previously presented and retained data groups from the immediately preceding group and the second preceding group, respectively.

核心處理電路2304、2305、2306及2307表示產生插入(「有效」)卷積輸出2308所需之對準，並且電路2309、2310、2311及2312表示產生完全(「相同」)卷積輸出2313所需之對準。 Core processing circuits 2304, 2305, 2306, and 2307 represent the alignment required to produce the inserted ("valid") convolution output 2308, and circuits 2309, 2310, 2311, and 2312 represent the alignment required to produce the exact ("identical") convolution output 2313.

電路2304之w_0,0與輸入組2301之P₀對準以產生插入變型，並且電路2308之w_0,3與輸入組2302之P₀對準以產生完全變型，其中電路2304及2309二者針對其各別使用情況產生輸出q₀。 _w0,0 of circuit 2304 is aligned with _P0 of input set 2301 to produce an insertion variant, and _w0,3 of circuit 2308 is aligned with _P0 of input set 2302 to produce a full variant, where both circuits 2304 and 2309 produce output _q0 for their respective usage scenarios.

熟習此項技術者應理解，二組核心具有相當大的相同功能重疊，並且簡單的是僅使用五個唯一映射之核心電路來配置單個電路以按需產生任一變型。熟習此項技術者亦應理解，任何M-up串流傳輸資料組(包括1-up)可視需要重新封裝成任何其他N-up串流傳輸格式(其中M≠N)，以維持系統之總體輸貫量高到足以按所呈現速率接受並處理輸入陣列串流。如此做之代價為需要某些核心處理電路之N個複本，但總體效應為允許電路將處理時脈限定為實施方法之合理限度，同時仍以完全速度接受輸入串流。 Those skilled in the art will appreciate that the two sets of cores have a considerable overlap of identical functionality, and that it is simple to configure a single circuit using only five uniquely mapped core circuits to produce any variation as desired. Those skilled in the art will also appreciate that any M-up stream data set (including 1-up) can be repackaged as needed into any other N-up stream format (where M≠N) to keep the overall throughput of the system high enough to accept and process the input array streams at the presented rate. This is done at the cost of requiring N copies of some of the core processing circuits, but the overall effect is to allow the circuit to limit the processing clock to reasonable limits for the implementation method, while still accepting the input stream at full speed.

圖24A及圖24B繪示了2乘2 MaxPool節點之典型實施方式，其中各通道之最大值經選擇用於二個鄰近列上之二個鄰近行位置之不同小塊。 Figures 24A and 24B illustrate a typical implementation of a 2x2 MaxPool node, where the maximum value of each channel is selected for different tiles at two adjacent row positions on two adjacent columns.

圖24A展示了在4-up資料串流2401上的2乘2 MaxPool節點之配置。當呈現各對之第一列時，比較器2402評估輸入p₀及p₁，並且將較大者傳遞至FIFO電路2403以便保留以供在呈現第二列時使用。比較器2404及FIFO 2405同時針對輸入p₂及p₃進行同一操作。當呈現各對之第二列時，比較器2402自FIFO 2403針對來自第一列之相同行位置接受經保留最大值，並且將該經保留最大值與輸入p₀及p₁進行比較，且輸出三個值中之較大者作為輸出q₀，而比較器2404及FIFO 2405對輸入p₂及p₃執行同一操作以產生輸出q₁。 Figure 24A shows the configuration of a 2x2 MaxPool node on a 4-up data stream 2401. When the first row of each pair is presented, comparator 2402 evaluates inputs _p0 and _p1 and passes the larger one to FIFO circuit 2403 to be retained for use when the second row is presented. Comparator 2404 and FIFO 2405 perform the same operation on inputs _p2 and _p3 at the same time. When the second column of each pair is presented, comparator 2402 accepts the retained maximum value from FIFO 2403 for the same row position from the first column and compares the retained maximum value with inputs _p0 and _p1 and outputs the larger of the three values as output _q0 , while comparator 2404 and FIFO 2405 perform the same operation on inputs _p2 and _p3 to produce output _q1 .

輸出組2406包含二組通道，其各個別值為各特定通道之四個樣本之最大值(在此孔徑函數中，來自不同通道之值並不相互作用)。因此，圖24A之輸出2406為自4-up輸入資料串流產生之2-up輸出資料串流。 Output set 2406 includes two sets of channels, each of whose individual values is the maximum of four samples for each particular channel (values from different channels do not interact in this aperture function). Therefore, output 2406 of FIG. 24A is a 2-up output data stream generated from a 4-up input data stream.

圖24B展示了在2-up資料串流2407上的同一2乘2 MaxPool節點之配置。比較器2408及FIFO 2409在功能上與上文所描述之彼等比較器及FIFO相同，但僅需要單個組來接受2-up輸入p₀及p₁以產生單組輸出通道2410。因此，第二實例之輸出2410為自2-up輸入資料串流產生之1-up輸出資料串流，並且所有下游節點可採用較小1-up形式。 24B shows the configuration of the same 2x2 MaxPool node on a 2-up data stream 2407. The comparators 2408 and FIFOs 2409 are functionally identical to those described above, but only a single set is required to accept the 2-up inputs _p0 and _p1 to produce a single set of output channels 2410. Therefore, the output 2410 of the second example is a 1-up output data stream generated from a 2-up input data stream, and all downstream nodes can adopt the smaller 1-up format.

平鋪MaxPool函數連同具有2乘2步幅之任何其他孔徑函數一起將輸入陣列之大小在各維度上減小2倍。由於N-up陣列串流之總寬度為所呈現組之數目的N倍，因此可藉由減小組中之寬度或減小N來影響減小，只要N可被水平步幅整除即可。因為N為並行執行之電路之複本的複製因數，因此儘可能減小N為較佳的。 The flattened MaxPool function, along with any other aperture function with a 2x2 stride, reduces the size of the input array by a factor of 2 in each dimension. Since the total width of an N-up array stream is N times the number of groups presented, the reduction can be effected by reducing the width within a group or by reducing N, as long as N is divisible by the horizontal stride. Since N is the replication factor for the copies of the circuit executed in parallel, it is better to reduce N as much as possible.

圖25繪示了其中不可能減小N之所設想實例。其應用2乘2 MaxPool 節點，但在此情況下應用於5-up輸入串流。如前所述，在3乘3卷積情況下，輸入組2501經保留且與當前輸入組2502配合使用以呈現最小值組，使得所有輸出可在同一時脈循環上產生。(其他配置亦為可能的，諸如切換第一比較器以在交替輸入組上處理p₀與p₁或p₁與p₂，同時設定中間比較器以處理交替輸入之p₄與p₀。此將使孔徑函數之所需複本之數目自五減小至三，並且在孔徑函數實施方式比簡單比較明顯更複雜的情況下將為有利的。) FIG25 illustrates a hypothetical example where it is not possible to reduce N. It applies a 2x2 MaxPool node, but in this case to a 5-up input stream. As before, in the case of a 3x3 convolution, input set 2501 is retained and used with the current input set 2502 to present a minimum set such that all outputs can be produced on the same clock cycle. (Other configurations are possible, such as switching the first comparator to process _p0 and _p1 or _p1 and _p2 on alternating input sets, while setting the middle comparator to process alternating inputs _p4 and _p0 . This would reduce the number of required copies of the aperture function from five to three, and would be advantageous where the aperture function implementation is significantly more complex than a simple comparison.)

在此實例中，比較器2503及FIFO 2504對p₀及p₁之保留值進行操作，比較器區塊2506對p₂及p₃之保留值進行操作，而比較器區塊2507對p₄之保留值及p₀之當前值進行操作。比較器區塊2508對p₁及p₂之當前值進行操作，並且比較器區塊2509對p₃及p₄之當前值進行操作。 In this example, comparator 2503 and FIFO 2504 operate on the retained values of _p0 and _p1 , comparator block 2506 operates on the retained values of _p2 and _p3 , and comparator block 2507 operates on the retained value of _p4 and the current value of _p0 . Comparator block 2508 operates on the current value of _p1 and _p2 , and comparator block 2509 operates on the current value of _p3 and _p4 .

由於在管線之約束內不可能實施2.5-up資料串流，因此在此實例中，維度之減小必須應用於輸入陣列之寬度，並且因此，輸出2510為反映5-up輸入串流之5-up輸出。 Since it is not possible to implement a 2.5-up data stream within the constraints of the pipeline, in this example, a dimensionality reduction must be applied to the width of the input array, and therefore, output 2510 is a 5-up output reflecting a 5-up input stream.

如上文所描述，在一些情形下，將M-up串流重新封裝為具有相同陣列維度之N-up串流可為合理的。特殊化FIFO電路可用於執行此功能。圖26A繪示了用於將4-up串流2601重新封裝成2-up串流2603之此類FIFO。FIFO 2602一次接受輸入4且將其儲存為個別輸入條目。每當FIFO中有2個條目可用時，一次產生輸出2。圖26(以及下圖)中之資料流自輸入向下通過電路系統到達輸出。 As described above, in some cases it may be reasonable to repack an M-up stream into an N-up stream of the same array dimensions. Specialized FIFO circuitry may be used to perform this function. FIG. 26A illustrates such a FIFO used to repack a 4-up stream 2601 into a 2-up stream 2603. FIFO 2602 accepts inputs 4 at a time and stores them as individual input entries. Outputs 2 are produced at a time whenever 2 entries are available in the FIFO. The data flow in FIG. 26 (and the following figure) flows from the inputs down through the circuitry to the outputs.

在輸入串流之寬度並非輸入串流組大小之整數倍的常見情形下，必須包括計數器以跟蹤針對各列呈現之有效條目之數目。舉例而言，若輸入陣列寬度為10，使用4-up輸入組，其中需要3組4來覆蓋完整列，則FIFO必須忽略所呈現之第3組輸入中之最後二個條目，並且輸出5組2-up輸出而非6組。在各列之後，重設計數器且開始計數下一列上之條目。陣列寬度限制可為固定的或經由預加載暫存器呈現。若已知陣列寬度始終為輸入組大小及輸出組大小二者之整數倍，則可省略此邏輯。 In the common case where the width of the input stream is not an integer multiple of the input stream group size, a counter must be included to keep track of the number of valid entries presented for each row. For example, if the input array width is 10, using 4-up input groups where 3 sets of 4 are required to cover a full row, the FIFO must ignore the last two entries in the 3rd set of input presented, and output 5 sets of 2-up output instead of 6. After each row, the counter is reset and starts counting entries on the next row. Array width limits can be fixed or presented via preload registers. If it is known that the array width is always an integer multiple of both the input group size and the output group size, this logic can be omitted.

圖26B繪示了將3-up串流2604重新封裝成5-up串流2606。FIFO 2605一次接受輸入3，但將其儲存為個別輸入條目。每當儲存器中有5個條目可用時，FIFO一次產生輸出5。 Figure 26B illustrates the repackaging of 3-up stream 2604 into 5-up stream 2606. FIFO 2605 accepts inputs 3 at a time but stores them as individual input entries. Whenever 5 entries are available in the memory, the FIFO produces outputs 5 at a time.

如上文所描述，必須實施額外操作以考慮在陣列寬度並非輸入組大小之整數倍的列之末端處可能出現的無效條目。當陣列寬度並非輸出組大小之整數倍時，出現類似問題。在此情況下，最終組必須在各列已經完全接收時經發出，最終組包含第一輸出中之列之最終條目，並且包含剩餘通道組中無特定值之無效條目。為方便起見，將所有零置放在無效條目中之實踐可用於減小後續節點中之總電路大小，其中零諸如在卷積及MaxPool中不具有效應。 As described above, additional operations must be implemented to account for invalid entries that may appear at the end of a row whose array width is not an integer multiple of the input group size. A similar problem arises when the array width is not an integer multiple of the output group size. In this case, the final group must be emitted when each row has been completely received, the final group containing the last entry of the row in the first output and containing invalid entries with no specific value in the remaining channel groups. For convenience, the practice of placing all zeros in invalid entries can be used to reduce the overall circuit size in subsequent nodes where zeros have no effect, such as in convolution and MaxPool.

FIFO之大小必須足以保留儘可能多的輸入組以保證無資料丟失。為了維持系統整體上之輸貫量，一旦足夠的條目可用於產生輸出組，便發出輸出。 The FIFO must be large enough to hold as many input sets as possible to ensure that no data is lost. To maintain overall system throughput, output is issued as soon as enough entries are available to generate output sets.

雖然任何組大小可經重新封裝成任何其他組大小，但所需處理頻率將與該等大小之比率成比例地改變。對於經重新封裝為N-up輸出之任何M-up輸入，所需處理頻率可經描述為

。 Although any group size can be repacked into any other group size, the required processing frequency will change in proportion to the ratio of the sizes. For any M-up input that is repacked into an N-up output, the required processing frequency can be described as

.

在整個系統中，對於最簡單操作，接受列之各電路應在所有列之末端處提供並忽略未使用之無效條目，其中列寬度並非組大小之整數倍。此並非嚴格限制，因為該電路不論如何皆可用此處未展示之額外邏輯來工作。此保證了每個行位置映射至針對每個列呈現之並行組內之同一通道組且使組合多個列上之同一行位置之值的操作之複雜度最小化。 In the entire system, for the simplest operation, each circuit that accepts rows should provide and ignore unused invalid entries at the end of all rows where the row width is not an integral multiple of the group size. This is not a strict restriction, as the circuit can work anyway with additional logic not shown here. This ensures that each row position maps to the same channel set within the parallel set presented for each row and minimizes the complexity of operations that combine values for the same row position on multiple rows.

圖27A繪示了串接節點之實施方式，其中來自一個源2701之通道在每位置基礎上與來自另一源2702或更多源(未展示)之通道串接，使得輸出2706包含來自所有源之所有通道。此節點不混合或改變通道值。在源具有不同時序之常見情形下，FIFO 2703及2704中之一者或二者將保留輸入通道值，直至全組輸出通道可用為止。交錯電路2705將串接來自各源之組p₀之所有通道以產生q₀，串接來自組p₁之所有通道以產生q₁，依此類推。 FIG. 27A illustrates an implementation of a concatenation node where channels from one source 2701 are concatenated on a per-position basis with channels from another source 2702 or more sources (not shown) so that the output 2706 includes all channels from all sources. This node does not mix or alter channel values. In the common case where the sources have different timings, one or both of the FIFOs 2703 and 2704 will hold the input channel values until the full set of output channels are available. Interleave circuit 2705 will concatenate all channels from group p ₀ of each source to produce q ₀ , concatenate all channels from group p ₁ to produce q ₁ , and so on.

需要此解決方案之常見實例將為3乘3卷積節點之輸出及1乘1卷積節點之輸出的組合，該等卷積節點中之各者應用於同一輸入陣列串流。雖然二個節點以相同速率處理串流，但3乘3節點之輸出無法終結，直至呈現輸入串流之第三列為止，而一旦呈現來自輸入串流之任何資料，1乘1節點之輸出即可終結。最終效應為對應於輸入陣列串流之特定位置的1乘1節點之輸出將明顯先於針對彼等相同位置的3乘3節點之輸出呈現給串接節點。由於在串接節點之後的下一節點將需要任何給定位置之所有通道在任何計算可進行之前呈現，因此串接節點必須緩衝較早呈現之輸入串流且等待稍後呈現之輸入串流在其可在輸出上呈現給定位置之全組所有通道之前到達同一位置的輸入串流。對於1-up資料串流或N-up資料串流同樣如此。 A common example where this solution is needed would be the combination of the output of a 3x3 convolution node and the output of a 1x1 convolution node, each of which is applied to the same input array stream. Although both nodes process the stream at the same rate, the output of the 3x3 node cannot terminate until the third row of the input stream is presented, whereas the output of the 1x1 node can terminate as soon as any data from the input stream is presented. The net effect is that the output of the 1x1 node corresponding to a particular position in the input array stream will be presented to the concatenated node significantly before the output of the 3x3 node for those same positions. Since the next node after a concatenated node will need all channels at any given position to be present before any computation can be done, the concatenated node must buffer input streams presented earlier and wait for input streams presented later to arrive at the same position before it can present the full set of all channels at a given position on the output. This is true for 1-up data streams or N-up data streams.

若最慢路徑之各輸入陣列位置始終經由所有其他路徑在同一位置之後呈現，則針對彼路徑之FIFO可經省略。若在一些條件(通常為串流之最終位置)下，最慢路徑將不會最後呈現，則彼路徑之FIFO中之資料必須保留有所需的最小數目個條目，以防止在彼等特殊條件下資料丟失。 If each input array position of the slowest path is always presented after the same position by all other paths, then the FIFO for that path can be omitted. If under some conditions (usually the final position of the stream) the slowest path will not be presented last, then the data in the FIFO of that path must be retained with the minimum number of entries required to prevent data loss under those special conditions.

若各種源之資料路徑寬度不同，則路徑寬度可經重新封裝以彼此匹配，如在圖26A及圖26B中，或彼功能可與用於串接緩衝之FIFO合併。熟習此項技術者應理解，藉由調整較早路徑中之各者的FIFO之大小，可將任何數目個路徑串接為單個操作，以在對應位置藉由最慢路徑呈現之前，在最差情況時序中，保留各路徑可呈現的儘可能多的值。 If the data path widths of the various sources are different, the path widths can be repacked to match each other, as in Figures 26A and 26B, or the function can be combined with a FIFO for cascading buffers. Those skilled in the art will appreciate that any number of paths can be cascaded into a single operation by adjusting the size of the FIFOs in each of the earlier paths to retain as many values as possible that can be presented by each path in worst case timing before the corresponding location is presented by the slowest path.

圖27B繪示了4-up密集節點之實施方式。密集節點在數學上等效於具有與輸入陣列之大小相同的核心大小之卷積。因此，為了創建各輸出通道，存在應用於各輸入通道之各輸入位置的一個不同權重。輸出通道之數目與輸入通道之數目無關係，並且所產生之輸出陣列始終為1乘1陣列。由於在此示例性實施方式中輸入2707以四個為一組提交，因此特定於各輸入位置之權重2708自本地儲存器(未展示)加載且在電路系統2709中乘以當前輸入以形成全核心之部分乘積。對來自所呈現之所有輸入通道的所有部分乘積求和以產生輸出通道之單個1-up組2710。 FIG27B illustrates an implementation of a 4-up dense node. A dense node is mathematically equivalent to a convolution with a kernel size equal to the size of the input array. Therefore, to create each output channel, there is a different weight applied to each input position of each input channel. The number of output channels is independent of the number of input channels, and the resulting output array is always a 1 by 1 array. Since the inputs 2707 are submitted in groups of four in this exemplary implementation, the weights 2708 specific to each input position are loaded from local storage (not shown) and multiplied by the current input in the circuit system 2709 to form a partial product of the full kernel. All partial products from all input channels presented are summed to produce a single 1-up group 2710 of output channels.

圖27C繪示了4-up全域平均節點之實施方式，該全域平均節點採用各輸入通道之所有位置之所有值且對其求平均以產生相同數目個輸出通道。全域平均節點在數學上等效於具有與輸入陣列之大小相同的核心大小之卷積，並且僅個別地應用於各輸入通道(而非如上文剛剛描述地一起應用於所有輸入通道)，其公共常量值等於核心中之元素之數目的倒數。由於其在數學上等效於在求和運算之前或之後乘以倒數，因此電路2712僅對各輸入通道2711之各位置之所有值求和，並且隨後在所有輸入值已經求和時將其乘以元素之數目的倒數，以產生各輸出通道。由於所有輸入位置經合併成單個值，因此輸出2713為具有1乘1之陣列大小的通道之單個1-up組。 27C illustrates an implementation of a 4-up global average node that takes all values at all positions of each input channel and averages them to produce the same number of output channels. The global average node is mathematically equivalent to a convolution with a kernel size that is the same as the size of the input array, and is only applied to each input channel individually (rather than all together as just described above), with a common constant value equal to the inverse of the number of elements in the kernel. Since it is mathematically equivalent to multiplying by the inverse before or after the summation operation, circuit 2712 simply sums all values at each position of each input channel 2711, and then multiplies it by the inverse of the number of elements when all input values have been summed to produce each output channel. Since all input positions are combined into a single value, the output 2713 is a single 1-up set of channels with a 1 by 1 array size.

圖28繪示了3乘3局部平均節點之4-up實施方式，該局部平均節點利用滑動孔徑函數來運算各輸入通道在位置之子組上之平均值以產生輸出通道。此實施方式形成插入或「有效」輸出組，其中孔徑不與輸入陣列之邊緣重疊，並且樣本之數目對於所有輸出位置為相同的。各輸出通道對應於單個輸入通道，並且資料並不在通道之間混合。與上面在圖21中展示的具有類似大小及輸入映射之卷積節點之實施方式一樣，當前輸入組2802由暫存器2801保留，使得可同時存取當前輸入組及緊接在前的輸入組。電路2803、2804及2805中之各者對組p₀、p₁及p₂之各輸入通道應用相同的求和，但隨時間推移而將彼和應用於三個不同的部分和以產生輸出陣列串流2809之組q₀。電路2803初始化第一列之運行和，電路 2804利用由FIFO(未展示)延遲的電路2803之輸出來產生中間列之運行和，並且電路2805利用電路2804之經延遲輸出來產生各最終求和。隨後，電路2805將最終求和乘以元素之數量的倒數(在此情況下為1/9)以產生輸出組q₀。啟動函數可經整合至電路中或等效地置於節點之間。 Figure 28 illustrates a 4-up implementation of a 3 by 3 local average node that uses a sliding aperture function to compute the average of each input channel on a subset of positions to generate an output channel. This implementation forms an inserted or "valid" output group in which the aperture does not overlap the edge of the input array and the number of samples is the same for all output positions. Each output channel corresponds to a single input channel and data is not mixed between channels. As with the implementation of the convolution node with similar size and input mapping shown above in Figure 21, the current input group 2802 is retained by register 2801 so that the current input group and the immediately preceding input group can be accessed simultaneously. Each of circuits 2803, 2804, and 2805 applies the same summation to each input channel of groups _p0 , _p1 , and _p2 , but applies that summation to three different partial sums over time to produce group _q0 of output array stream 2809. Circuit 2803 initializes the running sum of the first column, circuit 2804 produces the running sum of the middle columns using the output of circuit 2803 delayed by a FIFO (not shown), and circuit 2805 produces each final summation using the delayed output of circuit 2804. Circuit 2805 then multiplies the final summation by the inverse of the number of elements (in this case 1/9) to produce output group _q0 . The activation function can be integrated into the circuit or equivalently placed between nodes.

等效電路2806自先前輸入組之通道組p₁、p₂及p₃產生輸出組q₁。同樣地，電路2807自先前輸入組2801之p₂及p₃連同當前輸入組2802之p₀產生q₂，並且電路2808自先前輸入組2801之p₃連同當前輸入組2802之p₀及p₁產生q₃。 Equivalent circuit 2806 generates output set _q1 from channel set _p1 , _p2 and _p3 of previous input set. Similarly, circuit 2807 generates _q2 from _p2 and _p3 of previous input set 2801 together with _p0 of current input set 2802, and circuit 2808 generates _q3 from _p3 of previous input set 2801 together with _p0 and _p1 of current input set 2802.

若要針對每個有效位置產生局部平均孔徑函數，則輸出2809相較於輸入具有減小之陣列大小，在此情況下，寬度及高度各自減小二個位置，但此通常不足以顯著地減小4-up串流。若使用除一之外的水平步進大小，亦即未利用每個可能輸出位置，則水平維度之減小可在電路中實施為N之減小。舉例而言，若水平步進大小為2，則僅需要每隔一個值，並且電路可藉由僅運算q₀及q₂且省略針對q₁及q₃之未使用電路系統而產生2-up輸出通道。類似地，若水平步進大小大於4，則用於運算q₀至q₃之各種電路可依次用於產生1-up輸出串流。 If a local average aperture function is to be generated for each valid position, the output 2809 has a reduced array size compared to the input, in which case the width and height are each reduced by two positions, but this is generally not enough to significantly reduce the 4-up stream. If a horizontal step size other than one is used, that is, not every possible output position is utilized, then the reduction in horizontal dimension can be implemented in the circuit as a reduction in N. For example, if the horizontal step size is 2, only every other value is required, and the circuit can generate a 2-up output channel by operating only on _q0 and _q2 and omitting the unused circuitry for _q1 and _q3 . Similarly, if the horizontal step size is greater than 4, then the various circuits used to operate on _q0 through _q3 can be used in turn to generate a 1-up output stream.

圖29繪示了3乘3局部平均節點之另一4-up實施方式，該局部平均節點形成完全或「相同」輸入組，其中孔徑與輸入陣列之邊緣重疊，並且輸出陣列維度與輸入陣列維度相同。在此情況下，在邊緣處取樣之輸入位置之數目與在內部獲取之全組樣本不相同，因此用於各輸出位置之最終倒數必須反映用於彼輸出位置之樣本之數目。 Figure 29 shows another 4-up implementation of a 3x3 local averaging node that forms a complete or "identical" set of inputs where the aperture overlaps the edge of the input array and the output array dimension is the same as the input array dimension. In this case, the number of input positions sampled at the edge is not the same as the full set of samples taken internally, so the final inverse for each output position must reflect the number of samples used for that output position.

以與圖22中所示之示例性電路類似的方式，圖29中之變化利用輸入組2903來呈現4-up輸入陣列串流之當前值，而輸入組2902及2901呈現來自先前二個輸入組之保留值。 In a manner similar to the exemplary circuit shown in FIG. 22 , the variation in FIG. 29 utilizes input set 2903 to present the current value of the 4-up input array stream, while input sets 2902 and 2901 present the retained values from the previous two input sets.

求和電路2904、2905、2906及2907之應用分別產生4-up輸出陣列串流2908之值q₀、q₁、q₂及q₃，並且現經對準以使得各求和之中心對應於4-up輸入陣列串流之一個位置。在此實例中，當第一4-up輸入在各列之開始處呈現時，僅求和電路2904將與輸入陣列之左邊緣相交，但所有四個求和電路可與輸入陣列之右邊緣相交，此取決於在列之末端處填充的組之數目，因此反映所獲取樣本之數目的倒數之挑選將相應地變化。 The application of summation circuits 2904, 2905, 2906 and 2907 respectively produces the values _q0 , _q1 , _q2 and _q3 of the 4-up output array stream 2908, and is now aligned so that the center of each summation corresponds to a position in the 4-up input array stream. In this example, when the first 4-up input is presented at the beginning of each row, only summation circuit 2904 will intersect the left edge of the input array, but all four summation circuits may intersect the right edge of the input array, depending on the number of groups filled at the end of the row, so the selection reflecting the inverse of the number of samples obtained will vary accordingly.

觀察圖21及圖28之示例性電路的緊密對應關係以及亦觀察圖22及圖29之示例性電路的緊密對應關係，熟習此項技術者應理解，運算之結構及複製不受所實施之孔徑函數之性質的影響，並且進一步地，此設備及方法同樣適用於在類似滑動窗口上定義之任何孔徑函數。 Observing the close correspondence of the exemplary circuits of FIG. 21 and FIG. 28 and also observing the close correspondence of the exemplary circuits of FIG. 22 and FIG. 29, one skilled in the art will appreciate that the structure and replication of the operations are not affected by the nature of the aperture function implemented, and further, that the apparatus and method are equally applicable to any aperture function defined on a similar sliding window.

圖30A繪示了4-up子組節點之實施方式，該子組節點僅將特定通道傳遞至下一節點但傳遞具有等效陣列維度及時序之特定通道。此節點類型通常用於拆分傳入通道，使得不同型式之處理可應用於傳入通道之各群組。若經路由至輸出之該組通道為固定的，則輸入3001與輸出3003之間的連接可藉由實體導體之直接佈線製成。否則，路由電路系統3002將影響使用多工器的通道之所需選擇。 Figure 30A illustrates an implementation of a 4-up subgroup node that passes only certain channels to the next node but passes the certain channels with equivalent array dimensions and timing. This node type is typically used to split incoming channels so that different types of processing can be applied to each group of incoming channels. If the group of channels routed to the output is fixed, then the connection between input 3001 and output 3003 can be made by direct wiring of physical conductors. Otherwise, routing circuitry 3002 will affect the required selection of channels using the multiplexer.

圖30B繪示了4-up剪裁節點之典型實施方式，該剪裁節點將輸入陣列串流之位置之子組呈現給輸出陣列串流。通常，省略了頂部邊緣或底部邊緣或二者處之整個列，連同左邊緣或右邊緣或二者處之行。為了允許在左邊緣處省略之行為並非資料組大小N之整數倍的數目，當前輸入組3005與重新封裝電路系統3006中之先前輸入組3004組合，以產生輸出3007之通道組q₀、q₁、q₂及q₃，使得q₀始終用於各列之第一行。當在輸入陣列串流之左邊緣上不需要省略或所省略之行之數目為N之整數倍時，可自簡化電路省略先前輸入組3004。若輸出陣列自輸入陣列充分減小，則可將N-up輸入串流重新封裝成位置選擇電路系統內之M-up輸出串流。 FIG30B illustrates a typical implementation of a 4-up trim node that presents a subset of the positions of the input array stream to the output array stream. Typically, entire rows at the top edge or the bottom edge, or both, are omitted, along with rows at the left edge or the right edge, or both. To allow for a number of rows omitted at the left edge that is not an integer multiple of the data set size N, the current input set 3005 is combined with the previous input set 3004 in the repackaging circuit system 3006 to produce the channel set _q0 , _q1 , _q2 , and _q3 of the output 3007, so that _q0 is always used for the first row of each row. When no omissions are required on the left edge of the input array stream or the number of omitted rows is an integer multiple of N, the previous input set 3004 can be omitted from the simplified circuit. If the output array is sufficiently reduced from the input array, the N-up input stream can be repacked into an M-up output stream within the position selection circuitry.

在上文所描述之節點中之任一者中，大量乘法器或個別乘法器可與相同設施一起使用。在針對各輸入應用許多權重之情況下，大量乘法器具有基於被乘數及乘積之位元寬度的優於個別乘法器之優勢。在其他情形下，具有等效精確度之個別乘法器在功率使用上可較小或較低。N-up管線不依賴於所使用之乘法器之類型。 In any of the nodes described above, large numbers of multipliers or individual multipliers may be used with the same facility. In situations where many weights are applied to each input, large numbers of multipliers have advantages over individual multipliers based on the bit width of the multiplicands and products. In other cases, individual multipliers of equivalent accuracy may be smaller or lower in power usage. The N-up pipeline is independent of the type of multiplier used.

在本發明之另一態樣中，IC可具備一個或多個互連功能電路以及輸入埠及輸出埠，各IC實施神經網路之一部分，如上文參考圖17A及圖17B以及圖18A及圖18B所描述。系統實施例中之此類IC中之個別者可以線性次序或以具有並行連接之互連鏈形式自接收來自源陣列之主要輸入的第一IC連接至其他IC，自輸出埠連接至輸入埠。經連接組中之最後IC之輸出埠將隨後提供包含所有IC之功能性的神經網路之輸出。 In another aspect of the invention, an IC may have one or more interconnected functional circuits and input ports and output ports, each IC implementing a portion of a neural network, as described above with reference to Figures 17A and 17B and Figures 18A and 18B. Individual of such ICs in a system embodiment may be connected from a first IC receiving a primary input from a source array to other ICs in a linear order or in an interconnect chain with parallel connections, from output port to input port. The output port of the last IC in the connected group will then provide the output of the neural network containing the functionality of all ICs.

圖31繪示了經互連以實施神經網路之IC之此類系統3100。IC 3101具有接收輸入值串流之輸入埠3102。輸入值可針對輸入陣列呈如上文所描述之任何協定的形式，該等輸入陣列可在陣列中之每位置具有單個值，或每位置具有多個值，如在針對輸入陣列中之各位置具有RG及B值之HDMI影像的實例中，或輸入串流可作為如以上實施例中所描述之N-up串流排序。 FIG. 31 illustrates such a system 3100 of ICs interconnected to implement a neural network. IC 3101 has an input port 3102 that receives a stream of input values. The input values may be in the form of any protocol as described above for input arrays, which may have a single value per position in the array, or multiple values per position, as in the example of an HDMI image with RG and B values for each position in the input array, or the input stream may be ordered as an N-up stream as described in the above embodiments.

在圖31中，五個IC 3101、3105、3106、3107及3108被展示為在輸入埠與輸出埠之間互連。IC 3101經繪示為具有在IC上互連之功能電路3104，從而產生連接至IC 3105之輸入埠的輸出埠3103。功能電路正實施如上文在不同實施例中所描述之孔徑函數。在此實例中，IC 3105、3106、3107及3108展示具有與IC 3101相同之互連的功能電路，但應強調，IC不同且功能電路與功能電路之間的互連不相同。圖形為代表性的。 In FIG. 31 , five ICs 3101, 3105, 3106, 3107, and 3108 are shown interconnected between input ports and output ports. IC 3101 is depicted as having functional circuitry 3104 interconnected on the IC, resulting in output port 3103 connected to the input port of IC 3105. The functional circuitry is implementing the aperture function as described above in various embodiments. In this example, ICs 3105, 3106, 3107, and 3108 are shown with functional circuitry having the same interconnections as IC 3101, but it should be emphasized that the ICs are different and the interconnections from functional circuit to functional circuit are not the same. The diagrams are representative.

IC 3105藉由輸出埠連接至用於IC 3106及3107二者之輸入埠，以繪示IC當中可能不存在簡單線性連接。IC 3106及3107之輸出埠二者經展示為連接至IC 3108之輸入埠。此外，圖形為代表性的。在互連IC之任何系統中，互連件可更複雜。IC 3108，作為系統中之最後IC，輸出用於藉由互連IC之系統實施之神經網路的輸出串流。輸入埠與輸出埠之間的連接為用於遞送各輸出間隔之值之位元的導體之並行路徑。IC之系統實施具有某一深度之神經網路。在本發明之此態樣中，可藉由在個別IC上互連具備不同節點及互連件之個別IC來實施無限多種神經網路。 IC 3105 is connected via output ports to input ports for both ICs 3106 and 3107 to illustrate that simple linear connections may not exist in ICs. The output ports of both ICs 3106 and 3107 are shown connected to input ports of IC 3108. Again, the diagram is representative. In any system of interconnected ICs, the interconnects can be more complex. IC 3108, as the last IC in the system, outputs an output stream for a neural network implemented by the system of interconnected ICs. The connections between the input ports and the output ports are parallel paths of conductors for delivering bits of the value of each output interval. The system of ICs implements a neural network of some depth. In this aspect of the invention, an unlimited variety of neural networks can be implemented by interconnecting individual ICs having different nodes and interconnects on individual ICs.

三維影像資料之應用Application of 3D Image Data

在上文在許多實例中詳細描述的本發明之實施例及實例中，主要輸入資料源呈二維影像之像素的值之形式，該二維影像為諸如作為HDMI圖框之RGB影像，為RGB值之1080乘1920矩陣，其各像素為色彩分量之3乘1向量。然而，在本發明之各種實施例中的設備及方法之操作中，並非所有輸入資料組將呈如HDMI圖框之二維(2D)陣列之形式。然而，在上文第41頁，引入主要輸入資料源為資料點之三維陣列的情形。 In the embodiments and examples of the present invention described in detail above in many examples, the primary input data source is in the form of pixel values of a two-dimensional image, such as an RGB image as an HDMI frame, which is a 1080 by 1920 matrix of RGB values, each pixel of which is a 3 by 1 vector of color components. However, in the operation of the apparatus and methods in various embodiments of the present invention, not all input data sets will be in the form of a two-dimensional (2D) array such as an HDMI frame. However, on page 41 above, the case where the primary input data source is a three-dimensional array of data points is introduced.

作為可充當神經網路中之輸入的三維影像之實例，許多醫療裝置、磁共振成像(MRI)裝置例如捕獲三維(3D)影像資料。本申請案中之以下描述延伸了上文詳細描述之獨特2D處理設備及方法以處理表示3D影像資料的值串流。 As an example of a three-dimensional image that can serve as an input in a neural network, many medical devices, such as magnetic resonance imaging (MRI) devices, capture three-dimensional (3D) image data. The following description in this application extends the unique 2D processing apparatus and method described in detail above to process a stream of values representing 3D image data.

為了理解本發明之實施例對3D資料之應用，吾人可重新考慮先前描述中所使用之2D資料樣本，諸如HDMI圖框，其在上文已經描繪為配置在2D平面中之一系列像素。孔徑函數經描述為基於小塊，諸如3乘3小塊，意謂三個像素寬且三個像素高。3乘3孔徑函數管理涉及來自先前節點之九個像素或特徵之資料值的運算，該等像素或特徵取決於以影像中之特定像素為中心的小塊之位置。若吾人現考慮正交於2D陣列之平面的第三維度，則吾人可考慮具有27個資料點之3乘3乘3 3D小塊。3D小塊必須在各維度上具有多個資料點，並且如3乘3乘3之常見數目為典型的。在3D影像中，資料點被稱為立體像素而非像素。不需要3D孔徑函數之三個維度相同。此在本發明之實施例中並非限制性的。因此，出於本說明書之目的，由於2D小塊在上文被稱為M乘N，因此3D孔徑函數具有維度L乘M乘N。 To understand the application of embodiments of the present invention to 3D data, we can reconsider the 2D data samples used in the previous description, such as an HDMI frame, which has been described above as a series of pixels arranged in a 2D plane. The aperture function is described as being based on small blocks, such as a 3 by 3 small block, meaning three pixels wide and three pixels high. The 3 by 3 aperture function manages operations involving data values of nine pixels or features from the previous node, which depend on the position of the small block centered on a particular pixel in the image. If we now consider a third dimension that is orthogonal to the plane of the 2D array, we can consider a 3 by 3 by 3 3D small block with 27 data points. A 3D patch must have multiple data points in each dimension, and a common number such as 3 by 3 by 3 is typical. In 3D images, data points are called stereo pixels rather than pixels. It is not necessary for the three dimensions of the 3D aperture function to be the same. This is not limiting in the embodiments of the present invention. Therefore, for the purposes of this specification, since the 2D patch is referred to above as M by N, the 3D aperture function has dimensions L by M by N.

圖32描繪了在積體電路上的合成器之配置，其經組態以在二十七個個別資料樣本上實施作為3D孔徑函數之3乘3乘3卷積。在此3D孔徑函數中，L、M及N全部等於三。如在使用HDMI圖框之實例中，資料樣本可為單色的，或可具有三個色彩值，或可為每樣本具有多於三個值的特徵。在本發明之實施例中，操作始終為管線式的，因此資料樣本之值為以預定次序作為串流輸入。在一個協定中，資料值首先跨資料陣列之行呈現在輸入串流中，隨後沿各列向下行進，並且因此行進至第三維度中之各平面。各資料點處之值僅處理一次，且表示與各資料點重疊之各3D孔徑位置之相交點的部分和經運算且經轉發以供進一步處理，此係因為其與2D實施方式中之設備的操作一起。 Figure 32 depicts a configuration of a synthesizer on an integrated circuit, which is configured to implement a 3 by 3 by 3 convolution as a 3D aperture function on twenty-seven individual data samples. In this 3D aperture function, L, M, and N are all equal to three. As in the example using HDMI frames, the data samples may be monochrome, or may have three color values, or may feature more than three values per sample. In embodiments of the present invention, operations are always pipelined so that the values of the data samples are input as a stream in a predetermined order. In one protocol, the data values first appear in the input stream across the rows of the data array, then proceed down the columns, and thus to the planes in the third dimension. The value at each data point is processed only once, and the partial sum representing the intersection of each 3D aperture location overlapping each data point is calculated and forwarded for further processing as it would be with the operation of the device in a 2D implementation.

合成器3201之3乘3乘3陣列各自將單個輸出通道之27個權重中之一者應用於資料點且轉發部分和以供進一步處理。列緩衝器FIFO 3202呈現來自先前列的經恰當延遲之部分，並且平面緩衝器FIFO 3203呈現來自先前平面的亦經恰當延遲之部分。一旦已應用最後權重W_2,2,2，則求和完成，並且將完全和傳遞至最終處理電路3204。熟習此項技術者應理解，各合成器處之權重之下標依次係指平面、列及行。 The 3 by 3 by 3 arrays of combiner 3201 each apply one of the 27 weights of a single output channel to the data point and forward the partial sum for further processing. Row buffer FIFO 3202 presents the appropriately delayed portion from the previous row, and plane buffer FIFO 3203 presents the appropriately delayed portion from the previous plane. Once the last weight W _2,2,2 has been applied, the summation is complete and the full sum is passed to final processing circuit 3204. Those skilled in the art will appreciate that the subscripts for the weights at each combiner refer to plane, row, and column, respectively.

輸入資料點包含單個純量值抑或多個值(諸如來自先前神經網路節點之特徵)無關緊要。可藉由擴展並行權重及在整個電路中求和而針對同一輸入源陣列產生任何數目個並行輸出通道。 It does not matter whether the input data point contains a single scalar value or multiple values (such as features from previous neural network nodes). Any number of parallel output channels can be generated for the same input source array by expanding the parallel weights and summing them throughout the circuit.

特殊情況邏輯嵌入在合成器3201及FIFO 3202中以處置第一行及最後行、第一列及最後列以及第一平面及最後平面之邊緣情況。熟習此項技術者應理解，用於第三維度之邊緣情況緊密地對應於用於第二維度之彼等邊緣情況，並且類似解決方案便足夠了。 Special case logic is embedded in the synthesizer 3201 and FIFO 3202 to handle edge cases for the first and last rows, first and last columns, and first and last planes. Those skilled in the art will appreciate that the edge cases for the third dimension correspond closely to those for the second dimension, and similar solutions will suffice.

亦應理解，具有圖32中所示之合成器陣列的全功能IC亦將具有操作至少一個計數器且產生耦接至合成器、延遲電路及終結電路之控制信號的輸入埠、輸出埠及控制電路系統，如上文針對在2D資料陣列上操作的系統所描述。 It should also be understood that a fully functional IC having a synthesizer array as shown in FIG. 32 will also have input ports, output ports, and control circuitry that operates at least one counter and generates control signals coupled to the synthesizers, delay circuits, and termination circuits as described above for a system operating on a 2D data array.

圖33繪示了一個實施例，其中可同時緩衝並呈現來自多個平面之資料，使得單個合成器可對多個平面應用權重。當輸入通道之數目比輸出通道之數目低得多時，此實施方式可為較佳的，以便以增加所需共用大量乘法器為代價來減小所需的平面緩衝器之總大小。 Figure 33 illustrates an embodiment in which data from multiple planes may be buffered and presented simultaneously, allowing a single synthesizer to apply weights to multiple planes. This embodiment may be preferred when the number of input channels is much lower than the number of output channels, in order to reduce the total size of the required plane buffers at the expense of increasing the number of multipliers required to be shared.

對於示例性3乘3乘3卷積，對於三個最近呈現之平面中之各者，藉助於其大小等於RC-1之FIFO 3302而將輸入立體像素保留在暫存器3301中。來自此等暫存器之資料經合併至匯流排3303中，該匯流排跨三個平面將用於同一2D位置之源輸入資料分佈至所有合成器3304。 For an exemplary 3x3x3 convolution, for each of the three most recently rendered planes, input stereo pixels are retained in registers 3301 by means of a FIFO 3302 whose size is equal to RC-1. Data from these registers are merged into a bus 3303 that distributes source input data for the same 2D position to all compositors 3304 across the three planes.

各合成器3304將3乘3乘3卷積之各平面的一個權重應用於彼平面之對應源輸入且對結果求和。部分和經傳遞至其他合成器且藉由FIFO 3305適當延遲，如在2D實施方式中。邊緣情況條件(包括用於第一平面及最後平面之彼等條件)嵌入在各種合成器中，並且與由圖32中所示之完全翻轉形式實施的彼等條件緊密地匹配。 Each synthesizer 3304 applies one weight for each plane of the 3x3x3 convolution to the corresponding source input for that plane and sums the results. Partial sums are passed to other synthesizers and appropriately delayed by FIFO 3305, as in a 2D implementation. Edge condition conditions (including those for the first and last planes) are embedded in the various synthesizers and closely match those implemented by the fully flipped form shown in FIG. 32.

應注意，使用圖33中之部分翻轉形式打破了經分解孔徑函數之嚴格定序，此係因為來自三個不同平面之左上角之源值經組合為第一運算動作。若孔徑函數為其中可以任何次序添加所有乘積以得到相同結果的卷積，則此技術等效於完全翻轉變型。對於依賴於操作之時間序列來運算有效結果的任何孔徑函數，必須替代地使用圖32中之完全翻轉變型。 Note that using the partially flipped form in Figure 33 breaks the strict ordering of the decomposed aperture function, since the source values from the upper left corners of three different planes are combined as the first operation. If the aperture function is a convolution where all products can be added in any order to get the same result, then this technique is equivalent to the fully flipped variant. For any aperture function that relies on a temporal sequence of operations to compute a valid result, the fully flipped variant in Figure 32 must be used instead.

再次在P乘R乘C輸入陣列體積上考慮3乘3乘3實例，其中輸入為各立體像素之單個純量值，圖32中之第一形式僅需要用於所有輸出通道之單個大量乘法器，但需要每通道二個R乘C FIFO來轉發部分和。相比之下，圖33之替代實施例需要三個大量乘法器，但總共僅需要二個R乘C FIFO來緩衝原始純量資料。在此情況下，若R乘以C之乘積足夠大以使FIFO之大小相比於大量乘法器之大小占主導，則第二形式可為較佳的。但若輸入通道之數目相比於輸出通道之數目為大的，則當相對於圖33之實施例使用圖32之實施例時，所需的大量乘法器之總數目及FIFO之總大小二者減小。二個實施例之間的挑選可僅基於輸入通道及輸出通道之數目以及緩衝器及乘法器之總大小進行。當操作次序並不重要時，二個實施例在數值上等效，並且該等二個實施例之間的挑選僅為總體成本及便利性中之一者，只要孔徑函數經正確且完整地運算即可。 Considering again the 3x3x3 example on the PxRxC input array volume, where the input is a single scalar value for each stereo pixel, the first form in FIG. 32 requires only a single bulk multiplier for all output channels, but requires two RxC FIFOs per channel to forward the partial sums. In contrast, the alternative embodiment of FIG. 33 requires three bulk multipliers, but only two RxC FIFOs total to buffer the raw scalar data. In this case, the second form may be preferred if the product of RxC is large enough to dominate the size of the FIFOs compared to the size of the bulk multipliers. But if the number of input channels is large compared to the number of output channels, then both the total number of large multipliers required and the total size of the FIFOs are reduced when using the embodiment of FIG. 32 relative to the embodiment of FIG. 33. The choice between the two embodiments can be made based solely on the number of input and output channels and the total size of the buffers and multipliers. When the order of operations is not important, the two embodiments are numerically equivalent, and the choice between the two embodiments is only one of overall cost and convenience, as long as the aperture function is correctly and completely calculated.

圖34描繪了應用於4-up輸入串流之典型3乘3乘3卷積之實施方式。如先前所描述，4-up輸入串流並行地呈現用於四個輸入陣列位置之資料。源輸入組係由暫存器3402呈現且隨後由暫存器3401保留，使得二個組在呈現次序上同時呈現給所有四個合成器組3403。各合成器組對列與平面(FIFO，未展示)之間的資料進行定序，如針對參考圖32所描述之1-up實施例的情況一樣，並且主要差異為同時而非依序處理行。 FIG. 34 depicts an implementation of a typical 3x3x3 convolution applied to a 4-up input stream. As previously described, the 4-up input stream presents data for four input array locations in parallel. The source input set is presented by register 3402 and then retained by register 3401 so that both sets are presented to all four synthesizer sets 3403 simultaneously in presentation order. Each synthesizer set sequences the data between rows and planes (FIFOs, not shown) as was the case for the 1-up embodiment described with reference to FIG. 32, with the primary difference being that the rows are processed simultaneously rather than sequentially.

各合成器組3403具有經嵌入獨特邊緣條件，因為各合成器組可在水平方向上曝露於不同子組。即使水平操作同時完成，操作序列仍可相同，因此，此實施例適用於任何孔徑函數。 Each synthesizer group 3403 has a unique edge condition embedded in it because each synthesizer group can be exposed to different subgroups in the horizontal direction. Even if the horizontal operations are completed simultaneously, the operation sequence can still be the same, so this embodiment is applicable to any aperture function.

圖35繪示了上文參考圖34所描述的應用於4-up資料串流之同一3乘3乘3卷積實例的完全翻轉實施例。在此實施例中，各源立體像素僅被處置一次，並且每輸入分量需要總共四個大量乘法器來支援並行組中之所有輸出通道。 FIG35 illustrates a fully inverted implementation of the same 3x3x3 convolution example described above with reference to FIG34 applied to a 4-up data stream. In this implementation, each source stereo pixel is processed only once, and a total of four bulk multipliers per input component are required to support all output channels in a parallel group.

源輸入資料呈現在暫存器3501中且分佈至所有合成器組。合成器組3502類似於參考圖34所描述之合成器組3403。組3504中之合成器接受來自P₂及P₃之資料，並且將部分轉發至組3503中之合成器，其中應用來自下一源輸入資料組之P₀的資料。組3506中之合成器接受來自P3之資料，並且將部分轉發之組3505 中之合成器，其中應用來自下一源輸入資料組之P₀及P₁的資料。列間FIFO在合成器組3503及3505中提供標準延遲功能，並且不包括在合成器組3504及3506中。平面間FIFO為合成器3502、3503及3505提供等效延遲功能。 Source input data is presented in register 3501 and distributed to all synthesizer groups. Synthesizer group 3502 is similar to synthesizer group 3403 described with reference to Figure 34. The synthesizers in group 3504 accept data from _P2 and _P3 and forward a portion to the synthesizers in group 3503, where data from _P0 of the next source input data group is applied. The synthesizers in group 3506 accept data from P3 and forward a portion to the synthesizers in group 3505, where data from _P0 and _P1 of the next source input data group is applied. The inter-row FIFO provides a standard delay function in synthesizer groups 3503 and 3505 and is not included in synthesizer groups 3504 and 3506. The inter-plane FIFO provides equivalent delay function for synthesizers 3502, 3503 and 3505.

最終輸出由實施權重W_2,2,2之合成器產生，此處將該等合成器描繪為在針對所有四個輸出之權重W_0,2,2下面最遠的合成器。 The final output is produced by the combiners implementing weights W _2,2,2 , which are depicted here as the combiners farthest below the weights W _0,2,2 for all four outputs.

應注意，合成器組3502在合成器組3503及3505之前的一個輸入源間隔產生輸出，並且因此必須經延遲以同時產生所有四個輸出。 It should be noted that synthesizer group 3502 produces outputs one input source interval before synthesizer groups 3503 and 3505, and therefore must be delayed to produce all four outputs simultaneously.

在適當限制以符合特定孔徑函數之要求的情況下，熟習此項技術者應理解，可經分解成順序步驟序列之任何此類函數可藉由此類電路隨時間推移而連續地運算。此外，雖然合成器組之數目及形式由N-up(包括1-up)資料呈現及每輸出的樣本大小之三維陣列之特定組合規定，但基本形式、時序及異常規則對於所有組合為通用的。 Those skilled in the art will appreciate that any such function that can be decomposed into a sequence of sequential steps can be computed continuously over time by such circuits, with appropriate restrictions to meet the requirements of a particular aperture function. Furthermore, while the number and form of the synthesizer banks are dictated by a particular combination of N-up (including 1-up) data representations and three-dimensional arrays of sample sizes per output, the basic form, timing, and exception rules are common to all combinations.

典型地用於深神經網路中之其他孔徑函數(諸如MaxPool)例如可適應於所描述之一般形式內。 Other aperture functions typically used in deep neural networks (such as MaxPool) can be adapted within the general form described.

應用於共用神經網路之合成縮放Synthetic scaling for shared neural networks

在本發明之又一態樣中，提供了一種系統，其使用實施為IC電路之孔徑函數之單個例子來處理呈現為獨立並行輸入陣列串流或呈現為輸入陣列串流之動態合成尺度的多個輸入資料源。 In yet another aspect of the invention, a system is provided that uses a single instance of an aperture function implemented as an IC circuit to process multiple input data sources presented as independent parallel input array streams or as dynamically composited scales of input array streams.

跨具有多於一個維度之資料陣列應用孔徑函數的性質使得必須保留來自各列上之各行位置的部分完成之子函數值，並且將其與後續列上之對應行位置處的子函數值組合。若輸入維度之數目大於二，則亦必須保留自第一平面至下一平面之對應值、自第一體積至下一體積之對應值，依此類推。在此論述中，需要保留以便在稍後處理期間重新組合之特定資料項被稱為所考慮的位置之上下文。在以上描述中，用於逐列處理有序輸入值串流之所需上下文、用於上下文值之有序保留的電路系統(諸如FIFO電路系統)以及其他暫存器經描述為亦實施孔徑函數之IC之一部分。 The nature of applying an aperture function across a data array with more than one dimension necessitates retaining partially completed subfunction values from each row position on each column and combining them with the subfunction values at the corresponding row position on the subsequent column. If the number of input dimensions is greater than two, corresponding values from the first plane to the next plane, from the first volume to the next volume, and so on, must also be retained. In this discussion, the particular data item that needs to be retained for reassembly during later processing is referred to as the context of the position under consideration. In the above description, the required context for processing an ordered stream of input values column by column, circuitry for ordered retention of context values (such as FIFO circuitry), and other registers are described as part of an IC that also implements an aperture function.

在針對將孔徑函數應用於多個獨立並行輸入陣列串流或作為輸入陣列串流之多個動態合成尺度的以下描述中，由孔徑函數電路處理的輸入串流之當前列的上下文(基於自先前列保留之結果)藉由獨立IC電路系統同步地提供至孔徑函數電路。 In the following description of applying an aperture function to multiple independent parallel input array streams or as multiple dynamic synthesis scales of an input array stream, the context of the current row of the input stream processed by the aperture function circuit (based on the results retained from the previous row) is synchronously provided to the aperture function circuit by the independent IC circuit system.

在將資料劃分成列之情況下，以連續串流呈現輸入資料意謂在各列之末端處存在不連續性，其中實施電路必須擱置針對右邊緣運算之子函數值且恢復處理左邊緣處之值。當此上下文切換發生時，接下來處理哪一列無關緊要。用於正確操作之唯一要求為所呈現之上下文對應於當前呈現之輸入值。 In the case of data divided into columns, presenting the input data in a continuous stream means that there is a discontinuity at the end of each column where the implementation circuitry must suspend the subfunction value operated on for the right edge and resume processing the value at the left edge. When this context switch occurs, it does not matter which column is processed next. The only requirement for correct operation is that the context presented corresponds to the input value currently presented.

因此，單個孔徑函數電路可不斷地處理呈現為一系列輸入位置之資料列，並且彼等列可來自僅經受亦呈現用於彼等列之上下文之約束的不同源。此外，列不必全部具有相同寬度。 Thus, a single aperture function circuit can continually process rows of data presented as a series of input positions, and those rows can come from different sources subject only to the constraints of the context also presented for those rows. Furthermore, the rows need not all be of the same width.

電腦圖形中之常見技術為使用一組預先運算的經縮小2D紋理，以減少工作負荷以及在應用於3D模型時增強視覺品質。此通常被稱為MIPMAP技術且利用一系列樣本組，該系列樣本組中之各組具有先前組之寬度之½及高度之½。在此形式之MIPMAP中，對於各後續組，高度及寬度二者均減小2，並且各後續縮放影像之面積減小4。下文展示之無限系列之和被視為

，並且該系列之任何經截斷有限序列將始終較小。 A common technique in computer graphics is to use a set of pre-computed scaled-down 2D textures to reduce the workload and enhance visual quality when applied to 3D models. This is often called the MIPMAP technique and utilizes a series of sample sets where each set has ½ the width and ½ the height of the previous set. In this form of MIPMAP, both the height and width are reduced by 2 for each subsequent set, and the area of each subsequent scaled image is reduced by 4. The sum of the infinite series shown below is considered to be

, and any truncated finite sequence of that series will always be smaller.

，其中

,in

因此，對於用於在半縮放(在二個維度中)輸入陣列之交錯組上運算孔徑函數之單個電路，電路僅需要在增加33%之頻率下操作以確保與1：1尺度資料之呈現同步地完成所有尺度之運算。 Thus, for a single circuit operating an aperture function on an alternating set of half-scaled (in two dimensions) input arrays, the circuit only needs to operate at a 33% increase in frequency to ensure that the operation at all scales is completed synchronously with the presentation of 1:1 scale data.

典型實例為處理以60個圖框每秒(FPS)運行之HD相容輸入視訊串流，並且在148.5MHz下呈現個別RGB像素樣本。若150MHz之處理頻率足以處理串流之1：1原始尺度，則200MHz頻率足以按相同輸入速率處理任何有限系列半縮放影像。 A typical example is processing an HD-compatible input video stream running at 60 frames per second (FPS) and rendering individual RGB pixel samples at 148.5MHz. If 150MHz is sufficient to process the 1:1 native scale of the stream, 200MHz is sufficient to process any limited series of half-scaled images at the same input rate.

當孔徑函數為CNN之節點的實施方式以提供影像之理解時，2：1縮放將需要模型學習在2：1大小範圍內辨識各物件分類。在使用較窄範圍更準確或另外需要之情況下，模型可學習

：1大小範圍內之辨識任務，並且隨著面積減小一因數或

，頻率之所需增加藉由下式給出

，其中

When the aperture function is implemented as a node of a CNN to provide image understanding, 2:1 scaling will require the model to learn to recognize each object category within the 2:1 scale range. In cases where using a narrower range is more accurate or otherwise necessary, the model can learn

: 1 size range of identification tasks, and as the area decreases by a factor of

, the required increase in frequency is given by

,in

因此，300MHz之處理頻率將足以完成針對上文參考之實例的尺度之

：1組的運算。 Therefore, a processing frequency of 300 MHz will be sufficient to accomplish the scaled implementation for the examples referenced above.

: 1 group of operations.

在處理影像資料之CNN中，第一層接受呈像素或樣本陣列之形式的輸入，諸如典型RGB或灰階格式。後續層自上游層接受呈典型地在[0.0,1.0)範圍內之特徵強度之形式的輸出。 In a CNN processing image data, the first layer accepts input in the form of an array of pixels or samples, typically in RGB or grayscale format. Subsequent layers accept output from the upstream layer in the form of feature intensities, typically in the range [0.0, 1.0).

可在呈現資料時即時地合成第一層之多組經縮放影像陣列。對於2：1比率縮放，存在多個有效取樣方案，並且該取樣方案與所提議電路相容。對於平滑縮放，當接收到各對像素時，個別分量(例如，RGB)單獨地經求和且經保留。在各奇數列上，當前對之和與來自先前列之和組合，並且最終和除以四以符合原始值範圍。此方案避免取樣假影且對於呈現給人類觀看者之影像通常為較佳的。但簡單地捨棄每隔一列及每隔一行亦為有效的。此產生較小且較簡單的電路，並且可為較佳的，只要其與模型之訓練方案匹配即可。後續層接收已以次縮形式創建且不需要修改或操控之特徵。電路藉由在各層處即時交錯各尺度之資料串流而進行操作。 Multiple sets of scaled image arrays of the first layer can be synthesized on the fly as the data is presented. For 2:1 ratio scaling, there are multiple valid sampling schemes that are compatible with the proposed circuit. For smooth scaling, as each pair of pixels is received, the individual components (e.g., RGB) are summed separately and retained. On each odd column, the sum of the current pair is combined with the sum from the previous column, and the final sum is divided by four to fit the original value range. This scheme avoids sampling artifacts and is generally better for images presented to human viewers. But simply discarding every other column and every other row is also valid. This results in a smaller and simpler circuit, and can be better as long as it matches the training scheme of the model. Subsequent layers receive features that have already been created in a reduced form and do not require modification or manipulation. The circuit operates by interleaving data streams at each scale in real time at each layer.

一個此類配置，以2：1縮放實例為例，在各上下文改變(亦即，在給定尺度下各列之末端)時按需要切換尺度。最初，全尺度像素經呈現給實施第一層之電路，並且在第一列之末端處，進行上下文切換，以繼續第二列之左邊緣處的處理。縮放電路在孔徑函數處理全尺度樣本時按需要同時對原始資料進行取樣。在第二列之末端處，替代將上下文切換至1：1尺度之第三列，將上下文切換至2：1尺度之第一列，並且縮放器現在開始對2：1資料進行子取樣，並且將其饋送至孔徑函數中以為運算4：1尺度做準備。 One such configuration, using the 2:1 scaling example, switches scale as needed at each context change (i.e., at the end of each row at a given scale). Initially, full-scale pixels are presented to the circuitry implementing the first layer, and at the end of the first row, a context switch is made to continue processing at the left edge of the second row. The scaler circuitry simultaneously samples the original data as needed while the aperture function processes the full-scale samples. At the end of the second row, instead of switching the context to the third row at 1:1 scale, the context is switched to the first row at 2:1 scale, and the scaler now begins subsampling the 2:1 data and feeding it into the aperture function in preparation for operating on the 4:1 scale.

在處理2：1尺度輸入陣列之第一列之後，上下文切換回1：1尺度之第三列，並且在處理1：1資料之第四列之後，上下文切換至2：1尺度之第二列，此後4：1尺度之第一列可用。在處理4：1尺度之第一列之後，上下文切換回1：1尺度之第五列及第六列，繼之以2：1尺度之第三列，繼之以1：1尺度之第七列及第八列，繼之以2：1尺度之第四列，繼之以4：1尺度之第二列及8：1尺度之第一列。 After processing the first row of the 2:1 scale input array, the context switches back to the third row of the 1:1 scale, and after processing the fourth row of 1:1 data, the context switches to the second row of the 2:1 scale, after which the first row of the 4:1 scale is available. After processing the first row of the 4:1 scale, the context switches back to the fifth and sixth rows of the 1:1 scale, followed by the third row of the 2:1 scale, followed by the seventh and eighth rows of the 1:1 scale, followed by the fourth row of the 2:1 scale, followed by the second row of the 4:1 scale, and the first row of the 8:1 scale.

上述方法可繼續至任何所要有限程度之子取樣。使用固定緩衝器來保留1：1尺度之傳入樣本，同時處理較低尺度。取樣緩衝器針對所有較低尺度執行此功能。孔徑函數之電路實施方式必須在足夠高頻率下操作，以便在任何單個緩衝器溢出之前處理待決結果與待決輸入。由以上公式指示之增量為可能的最小值；在較高速度下之操作可用於進一步簡化緩衝邏輯。 The above method can be continued to any desired finite degree of subsampling. A fixed buffer is used to retain incoming samples at 1:1 scale while processing lower scales. The sampling buffer performs this function for all lower scales. The circuit implementation of the aperture function must operate at a high enough frequency to process pending results and pending inputs before any single buffer overflows. The increments indicated by the above formula are the minimum possible; operation at higher speeds can be used to further simplify the buffering logic.

所描述之配置之輸出為以上文所描述之排序中之特徵列之交錯組的單個串流。後續層不必對1：1特徵輸入進行子取樣或緩衝；在預期時間傳遞較低尺度特徵，使得上下文切換之模式以與所產生的完全相同之模式發生。 The output of the described configuration is a single stream of interleaved sets of feature columns in the ordering described above. Subsequent layers do not have to subsample or buffer the 1:1 feature input; lower-scale features are passed at expected times so that the pattern of context switching occurs in exactly the same pattern as that produced.

實施

：1縮放要複雜得多。各像素經組合成加權子取樣以產生C/

個樣本每列及R/

列每圖框，而非組合或捨棄離散像素以產生經縮小輸入陣列的固定模式。呈現給孔徑函數電路之列的切分排序將不再以簡單規則模式發生；替代地，當各列經取樣資料完成時，處理列。後續層必須以完全相同之次序針對各經取樣尺度處理資料。此可藉由複製用於產生取樣模式之計數器邏輯或藉由發送與複合特徵串流並行之信號以指示上下文何時將切換及切換至何種尺度或藉由其他手段來達成。 Implementation

:1 Scaling is much more complicated. Each pixel is combined into a weighted subsample to produce C /

samples per row and R /

Instead of combining or discarding discrete pixels to produce a fixed pattern of the reduced input array, the rows are sliced and diced per frame. The ordering of the rows presented to the aperture function circuit no longer occurs in a simple regular pattern; instead, the rows are processed as the sampled data for each row is complete. Subsequent layers must process the data for each sampled scale in exactly the same order. This can be accomplished by duplicating the counter logic used to generate the sampling pattern or by sending a signal in parallel with the composite feature stream to indicate when the context is to be switched and to what scale, or by other means.

可以類似方式實施其他合理或無理化取樣模式用於子尺度排序。可使電路設計符合CNN訓練過程中發現之任何縮放配置或其他考慮因素。 Other rational or irrational sampling patterns for subscale sorting can be implemented in a similar manner. The circuit design can be tailored to any scaling configuration or other considerations discovered during CNN training.

雖然迄今為止所描述之實例產生經縮小樣本串流，但此在本發明中並非為限制性的。在一些實施例中，取樣可產生經放大串流。 Although the examples described thus far produce a downscaled sample stream, this is not limiting in the present invention. In some embodiments, sampling may produce an upscaled stream.

尺度之配置並不改變孔徑函數之核心實施方式。上文所描述之此變化與先前單尺度實施方式之間的差異在於基於行計數器自右邊緣至左邊緣之單個上下文切換由針對多個尺度管理多個上下文之外部機制替換。將當前輸入值與針對當前位置之上下文組合所用的孔徑函數之運算及邊緣規則保持不變，並且總體電路大小最低限度地增加。 The configuration of scales does not change the core implementation of the aperture function. The difference between this change described above and the previous single-scale implementation is that the single context switch from the right edge to the left edge based on the row counter is replaced by an external mechanism that manages multiple contexts for multiple scales. The operation and edge rules of the aperture function used to combine the current input value with the context for the current position remain unchanged, and the overall circuit size increases minimally.

使用上下文切換之另一應用為使用同一孔徑函數處理來自不相關串流之交錯列。如同上文所描述之縮放實例，用於各種陣列之輸入列不必皆具有相等寬度。其處理必須經延遲之所有輸入經緩衝，並且孔徑函數實施方式之總體頻率必須經增加，使得在任何單個緩衝器溢出之前處理整組輸入串流。 Another application of context switching is to use the same aperture function to process interleaved rows from unrelated streams. As in the scaling example described above, the input rows for the various arrays do not have to all be of equal width. All inputs whose processing must be delayed are buffered, and the overall frequency of the aperture function implementation must be increased so that the entire set of input streams is processed before any single buffer overflows.

可自由地混合不同串流之單尺度及多尺度處理。且由於任何給定尺度之任何給定位置之全上下文包括彼位置在頂部列抑或底部列上，因此不同輸入串流不必經同步以一起開始及結束，甚至亦無需具有相同高度。 Single-scale and multi-scale processing of different streams can be mixed freely. And because the full context of any given position at any given scale includes whether that position is on the top row or the bottom row, different input streams do not need to be synchronized to start and end together, or even to be the same height.

圖36繪示了孔徑函數IC電路3602至諸如由上游CNN節點運算之影像像素或特徵的有序樣本之輸入陣列串流3601的應用。對於當前呈現之列之任何行位置，當前列之上下文藉由單獨上下文管理電路3603同步地呈現給孔徑函數電路3602，並且與針對緊接在前的行位置運算之值組合以產生用於孔徑函數之最終輸出串流3604的資料，假定存在先前列，該當前列之該上下文為針對先前列上之對應行位置連同由上下文管理器維持之異常旗標(包括第一行及最後行以及第一列及最後列信號)運算的若干值。用於當前位置之部分子函數值亦由上下文管理電路3603運算並保持作為待在後續列上組合之上下文。在針對個別樣本的自左至右、自上至下之標準感測器呈現次序中，立即使用來自先前行之孔徑子函數之個別元素的值；在處理各列之第一行的邊緣情況中，根據孔徑函數之定義自運算省略或合成該等值。 36 illustrates the application of an aperture function IC circuit 3602 to an input array stream 3601 of ordered samples of image pixels or features, such as those operated on by an upstream CNN node. For any row position of the currently presented column, the context of the current column is synchronously presented to the aperture function circuit 3602 by a separate context management circuit 3603 and combined with the value operated on for the immediately preceding row position to produce data for the final output stream 3604 of the aperture function, assuming there was a previous column, which is a number of values operated on for the corresponding row position on the previous column together with exception flags maintained by the context manager (including first and last row and first and last column signals). The partial subfunction values for the current position are also computed by context management circuit 3603 and maintained as context to be assembled on subsequent rows. In standard sensor presentation order from left to right, top to bottom for individual samples, the values of individual elements of the aperture subfunction from the previous row are used immediately; in the edge case of the first row of each column, these values are either omitted or synthesized from the computation according to the definition of the aperture function.

其他基於非標準之呈現次序同樣可行，諸如以交錯列或蛇形方式(自左至右與自右至左交替)遞送資料之感測器，其需要上下文及運算電路系統以允許此呈現次序。 Other non-standard presentation orders are also possible, such as sensors that deliver data in staggered rows or in a serpentine fashion (alternating left-to-right and right-to-left), which requires context and computational circuitry to allow this presentation order.

用於各行位置的經部分運算之子函數值經保留且變為用於後續列之上下文。保留及呈現二者可經組合至孔徑函數3602之IC電路系統的實施方式中，但可在整體上不影響系統之大小或效率的情況下與孔徑函數之實施方式分離。 The partially computed sub-function values for each row position are retained and become context for subsequent columns. Both retention and presentation can be combined into the implementation of the IC circuit system of the aperture function 3602, but can be separated from the implementation of the aperture function without affecting the size or efficiency of the system as a whole.

每當輸入樣本之給定跨度已經處理且不同的不連續跨度之處理必須開始時，所處理之最後行之上下文必須經保留並擱置，並且下一跨度之第一行之上下文必須變得可存取。在標準單樣本自左至右、自上至下呈現次序中，在行自最右邊緣轉變至最左邊緣時發生。為清楚起見，在此論述中，可能任意跨度將在下文被稱為列。在此最簡單實例中，作為先前列上之上下文保留的值按捕獲次序呈現為下一列之上下文。 Whenever a given span of input samples has been processed and processing of a different non-contiguous span must begin, the context of the last row processed must be preserved and set aside, and the context of the first row of the next span must become accessible. In standard single-sample left-to-right, top-to-bottom presentation order, this occurs when the row transitions from the rightmost edge to the leftmost edge. For clarity, in this discussion, potentially arbitrary spans will be referred to below as rows. In this simplest example, the values preserved as context on the previous row are presented as context for the next row in the order captured.

由於不存在處理當前呈現之列所需的其他資料，因此在孔徑函數之實施方式中不需要或不依賴於接下來處理緊接在後的列，因此可在任何時間呈現任何列，只要同時呈現用於彼列之上下文即可。 Since there is no additional data required to process the currently presented row, the implementation of the aperture function does not require or depend on the subsequent processing of the immediately following row, so any row can be presented at any time as long as the context for that row is also presented.

圖37繪示了簡單實例，該簡單實例接收二個獨立輸入陣列串流，亦即第一串流3701及第二串流3702，並且藉助於保留來自一個串流之輸入同時處理另一串流之儲存及轉發多工器3703而交替地呈現來自各串流之列。由於來自一個串流之值決不與來自另一串流之值共混，因此孔徑函數電路3602自第一實例未改變(圖36)，其中上下文管理在孔徑函數電路外部。在圖37之系統中，藉由與孔徑函數電路分離之上下文電路3704進行上下文管理。 FIG37 illustrates a simple example that receives two independent input array streams, a first stream 3701 and a second stream 3702, and alternately presents columns from each stream by means of a store and forward multiplexer 3703 that retains input from one stream while processing the other stream. Since values from one stream are never mixed with values from the other stream, the aperture function circuit 3602 is unchanged from the first example (FIG36), where context management was external to the aperture function circuit. In the system of FIG37, context management is performed by a context circuit 3704 that is separate from the aperture function circuit.

經保留上下文現包含替代地捕獲之各串流之值，並且經呈現上下文交替地呈現用於各串流之上下文。若二個串流在維度上相同，則捕獲及呈現可經組合至單個FIFO中，該FIFO之寬度為孔徑函數所需之二倍且其列數與孔徑函數所需相同，此將符合各列之上下文與用於彼列之輸入陣列樣本同時呈現的要求。最終輸出3705隨後與隨時間推移而交錯之二個串流一起呈現，並且任何後續節點必須接受此時間交錯之輸入陣列形式。由於輸入串流為獨立的，因此運算交錯列之單個串流之孔徑函數所需的複雜度將不適用於任何後續孔徑節點。 The retained context now contains the values of each stream captured alternatively, and the presented context alternately presents the context for each stream. If the two streams are identical in dimension, then the capture and presentation can be combined into a single FIFO whose width is twice that required by the aperture function and whose number of columns is the same as that required by the aperture function, which will satisfy the requirement that the context for each column be presented simultaneously with the input array samples for that column. The final output 3705 is then presented with the two streams interleaved over time, and any subsequent node must accept this time-interleaved input array form. Since the input streams are independent, the complexity required to compute the aperture function for a single stream of interleaved columns will not be applied to any subsequent aperture nodes.

二個串流無需為相同維度，亦無需一起開始及結束。陣列維度以及圖框之開始及結束的任何變化必須藉由可在上下文電路3704中操作之捕獲上下文及呈現上下文來適應，該捕獲上下文及呈現上下文將採用足以滿足同時性要求之形式，但孔徑函數3602保持不變。串流不必以相同圖框或像素速率呈現，在此情況下交錯電路系統3703將緩衝傳入資料且以足以保持所有輸入串流而無資料丟失之處理頻率在完成列上通過。 The two streams need not be of the same dimensions, nor need they start and end together. Any changes in array dimensions and the start and end of frames must be accommodated by the capture context and presentation context operable in context circuitry 3704, which will take a form sufficient to satisfy the concurrency requirements, but aperture function 3602 remains unchanged. The streams need not be presented at the same frame or pixel rate, in which case interleaving circuitry 3703 will buffer incoming data and pass it through on the completion column at a processing frequency sufficient to maintain all input streams without data loss.

組合輸入陣列之二個或更多個獨立串流的此方法可應用於多種其他處理情形。特別值得注意的為接受單個輸入視訊串流、將各圖框合成為一組經取樣尺度及藉由相同孔徑函數處理全部尺度的系統。 This method of combining two or more independent streams of an input array can be applied to a variety of other processing situations. Of particular note are systems that accept a single input video stream, combine the frames into a set of sampled scales, and process all scales with the same aperture function.

圖38繪示了用於一系列四次2：1減小的完全資料列及經縮小資料列之序列。第一未縮放子串流3801描繪了經標記為1：1之16列像素資料，此意謂無減小。在類似於由圖37表示之系統的系統操作中，子串流3801中來自列R₀至列R₁₆(及以外)之資料可作為第一資料串流呈現給多工器3703。當呈現來自子串流 3801之資料時，可藉由例如對子串流3801之各對列上的數對行求平均而減小彼資料(2：1減小)，從而產生一組8個半寬資料列作為子串流3802。由子串流3802表示為R'₀至R'₇的所得資料陣列為由子串流3801表示的資料陣列之面積的四分之一。當針對子串流3802創建經縮小資料時，可將彼資料作為第二並行資料串流呈現給多工器3703。子串流3803表示按2：1比率進一步縮小為R"₀至R"₃的來自子串流3802之資料，並且子串流3804表示按2：1比率進一步縮小為R'''₀至R'''₁的來自子串流3803之資料。亦將來自子串流3803及3804之資料作為單獨資料串流呈現給多工器3703。 FIG. 38 illustrates a sequence of full data rows and reduced data rows for a series of four 2:1 reductions. The first unscaled substream 3801 depicts 16 rows of pixel data labeled 1:1, meaning no reduction. In system operation similar to that represented by FIG. 37 , data from rows R ₀ through R ₁₆ (and beyond) in substream 3801 may be presented to multiplexer 3703 as a first data stream. When presenting the data from substream 3801, the data may be reduced (2:1 reduction) by, for example, averaging pairs of rows over each pair of rows of substream 3801, thereby producing a set of 8 half-width data rows as substream 3802. The resulting data array represented by substream 3802 as R' ₀ to R' ₇ is one quarter the area of the data array represented by substream 3801. When the downscaled data is created for substream 3802, that data may be presented to multiplexer 3703 as a second parallel data stream. Substream 3803 represents the data from substream 3802 further downscaled by a 2:1 ratio as R" ₀ to R" ₃ , and substream 3804 represents the data from substream 3803 further downscaled by a 2:1 ratio as R''' ₀ to R''' _1. The data from substreams 3803 and 3804 are also presented to multiplexer 3703 as separate data streams.

圖39繪示了用於

：1之無理縮放的完全列(子串流3901)及經縮小列(子串流3902)之序列。應注意，各列呈現之不規則性反映了列之無理取樣。後續尺度在相同意義上將不會為不規則的：下一子尺度將為1：1尺度之2：1減小，並且此後之該子尺度將為

：1尺度之2：1減小。同樣，經縮小子串流經動態地產生且作為獨立資料串流呈現給多工器3703。藉由直接傳信始終呈現之特定子尺度，將在後續節點中消除複製列之初始不規則定序的需要。 Figure 39 shows the

:1. Note that the irregularity of the appearance of the columns reflects the irrational sampling of the columns. Subsequent scaling will not be irregular in the same sense: the next subscaling will be a 2:1 reduction of the 1:1 scale, and the subscaling thereafter will be

Likewise, the shrunken substreams are dynamically generated and presented as independent data streams to the multiplexer 3703. By directly signaling the specific subscale that is always presented, the need to replicate the initial irregular ordering of the rows is eliminated in subsequent nodes.

圖40繪示了在資料經呈現為如在圖36中之全尺度樣本(通常為像素)之陣列3601且經減小以由公共孔徑函數處理的情形下全尺度資料及經縮放資料至孔徑函數3602之應用。孔徑函數之輸出為呈與上文詳細描述之交錯子串流相同之模式的不同尺度之輸出之時間交錯串流。 Figure 40 illustrates the application of full scale data and scaled data to an aperture function 3602 where the data is presented as an array 3601 of full scale samples (typically pixels) as in Figure 36 and is reduced for processing by a common aperture function. The output of the aperture function is a time interleaved stream of outputs at different scales in the same pattern as the interleaved substreams described in detail above.

單個輸入陣列串流3601由級聯多尺度取樣器4001接受，該級聯多尺度取樣器將樣本之任何及所有所要尺度呈現為時間交錯序列4002。無需改變孔徑函數3602；視需要維持樣本列及其相關聯之上下文之呈現。上下文管理電路4003管理由孔徑函數生成之部分輸出，使得在對應於彼同一尺度之後續列的時間接受並呈現各種尺度之上下文資料。此可藉由諸如針對2：1情況嵌入預期序列之知識或藉由諸如針對任何無理縮放比率自取樣器之直接傳信(未展示)來達成。來自孔徑函數之最終輸出經呈現為函數特定運算(通常為所偵測之特徵)之交錯多尺度陣列串流4004。 A single input array stream 3601 is accepted by a cascaded multi-scale sampler 4001 which presents samples at any and all desired scales as a time-interleaved sequence 4002. There is no need to change the aperture function 3602; the presentation of the sample row and its associated context is maintained as needed. Context management circuitry 4003 manages the partial output generated by the aperture function so that context data at various scales is accepted and presented at times corresponding to subsequent rows at the same scale. This can be accomplished by embedding knowledge of the expected sequence, such as for the 2:1 case, or by direct signaling from the sampler (not shown), such as for any irrational scaling ratio. The final output from the aperture function is presented as an interleaved multi-scale array stream 4004 of the function specific operation (usually the detected feature).

圖41繪示了後續CNN節點對不再需要任何縮放邏輯之交錯串流的處理。串流可為獨立串流及經合成、經子縮放圖框之任何混合。上下文管理電路系統4102基於列呈現之所建立模式而實施切分上下文切換。來自先前節點的輸出之交錯輸入串流4101經饋送至未改變之孔徑函數3602中，並且同時，上下文管理電路4102藉由直接嵌入呈現序列或藉由接受描述列維度及轉變之信號(未展示)來管理孔徑函數之部分輸出。最終輸出4103為可具有或可不具有與輸入串流4101相同之維度且可不按相同次序呈現的相容交錯串流。 Figure 41 illustrates the processing of interleaved streams by subsequent CNN nodes that no longer require any scaling logic. The streams can be any mixture of independent streams and synthesized, sub-scaled frames. Context management circuitry 4102 implements split context switching based on the established pattern of column presentation. Interleaved input stream 4101 from the output of the previous node is fed into the aperture function 3602 unchanged, and at the same time, context management circuitry 4102 manages partial outputs of the aperture function by either directly embedding the presentation sequence or by accepting signals (not shown) that describe column dimensions and transitions. The final output 4103 is a compatible interleaved stream that may or may not have the same dimensions as the input stream 4101 and may not be presented in the same order.

作為實例，若孔徑函數為2乘2平鋪MaxPool節點，則輸出串流將產生用於各二個輸入列之半寬輸出列。對於2：1縮放情況，串流元素之序列將為相同的；對於任何無理情況，特定尺度之序列將不相同。 As an example, if the aperture function is a 2x2 tiled MaxPool node, the output stream will produce half-width output rows for every two input rows. For the 2:1 scaling case, the sequence of stream elements will be the same; for any irrational case, the sequence will be different for a particular scale.

圖42A繪示了產生1：1、2：1、4：1及8：1尺度之交錯串流所需的取樣及切分邏輯。將可縮放值之完全輸入陣列串流3601直接傳遞至FIFO 4202，該FIFO保留在處理其他尺度時所接收之任何值。2：1取樣器4201接受來自多個列之多個樣本以產生單個取樣值，該單個取樣值隨後在完成時轉發至FIFO 4203以供保留直至處理為止。另一2：1取樣器4201接受2：1樣本且將經完成4：1樣本轉發至FIFO 4204。另一2：1取樣器4201接受4：1樣本且將經完成8：1樣本轉發至FIFO 4205。切分多工器4206表面上在各經縮放列變得可用時自可用資料進行選擇，並且將該資料連同在交錯模式並未嵌入在保留邏輯中時描述該交錯模式所需之任何信號轉發至交錯樣本串流4002。 Figure 42A illustrates the sampling and slicing logic required to generate interleaved streams of 1:1, 2:1, 4:1, and 8:1 scales. The complete input array stream 3601 of scalable values is passed directly to FIFO 4202, which holds any values received while processing other scales. A 2:1 sampler 4201 accepts multiple samples from multiple columns to produce a single sample value, which is then forwarded to FIFO 4203 upon completion to be held until processing. Another 2:1 sampler 4201 accepts 2:1 samples and forwards the completed 4:1 samples to FIFO 4204. Another 2:1 sampler 4201 accepts 4:1 samples and forwards the completed 8:1 samples to FIFO 4205. The demultiplexer 4206 ostensibly selects from the available data as each scaled row becomes available and forwards that data to the interleaved sample stream 4002 along with any signaling required to describe the interleaving pattern if it is not embedded in the retention logic.

圖42B繪示了產生1：1、

：1、2：1、2

：1及4：1尺度之交錯串流所需的取樣及切分邏輯。將可縮放值之完全輸入陣列串流3601直接傳遞至FIFO 4202，該FIFO保留在處理其他尺度時所接收之任何值。

：1取樣器4207接受來自多個列之多個樣本以產生單個取樣值，該單個取樣值隨後在完成時轉發至FIFO 4208以供保留直至處理為止。用於構成經縮放值之列及樣本的實際數目將根據經取樣區域之無理寬度及高度而變化；一些輸入列及行用於產生多於一個輸出。2：1取樣器4201接受1：1樣本(繞過取樣器4207)且將經完成樣本轉發至FIFO 4203。另一2：1取樣器4201接受來自4207之

：1樣本且將經完成2

：1樣本轉發至FIFO 4209。另一2：1取樣器4201接受2：1樣本且將經完成4：1樣本轉發至FIFO 4204。切分多工器4210表面上在各經縮放列變得可用時自可用資料進行選擇，並且將該資料連同在交錯模式並未嵌入在保留邏輯中時描述該交錯模式所需之任何信號轉發至交錯樣本串流4002。 Figure 42B shows the generation of 1:1,

:1,2:1,2

The sampling and slicing logic required for interleaving streams at 4:1 and 4:1 scales is provided. The complete input array stream 3601 of scalable values is passed directly to the FIFO 4202 which retains any values received when processing other scales.

4203. The 2:1 sampler 4201 accepts 1:1 samples (bypassing sampler 4207) and forwards the completed sample to FIFO 4203. Another 2:1 sampler 4201 accepts 1:1 samples from 4207 and forwards the completed sample to FIFO 4203.

：1 sample and will be completed2

:1 samples are forwarded to FIFO 4209. Another 2:1 sampler 4201 accepts 2:1 samples and forwards completed 4:1 samples to FIFO 4204. The demultiplexer 4210 ostensibly selects from the available data as each scaled row becomes available and forwards that data to the interleaved sample stream 4002 along with any signals needed to describe the interleaving pattern if it is not embedded in the retention logic.

圖43A繪示了自各種縮小之串流產生交錯串流所需的子取樣及切分邏輯。將可縮放值之完全輸入陣列串流3601直接傳遞至FIFO 4202，該FIFO保留在處理其他尺度時所接收之值。U：1取樣器4301接受來自1：1串流之多個列的多個樣本以產生單個值，該單個值隨後在完成時轉發至FIFO 4304以供保留直至處理為止。V：1取樣器4302接受U：1樣本且將經完成UV：1樣本轉發至FIFO 4305。W：1取樣器4303接受UV：1樣本且將經完成UVW：1樣本轉發至FIFO 4306。 Figure 43A illustrates the subsampling and slicing logic required to generate an interleaved stream from various scaled down streams. The full input array stream 3601 of scalable values is passed directly to FIFO 4202, which holds the values received while processing other scales. U:1 sampler 4301 accepts multiple samples from multiple columns of the 1:1 stream to produce a single value, which is then forwarded to FIFO 4304 upon completion to be held until processing. V:1 sampler 4302 accepts U:1 samples and forwards completed UV:1 samples to FIFO 4305. W:1 sampler 4303 accepts UV:1 samples and forwards completed UVW:1 samples to FIFO 4306.

切分多工器4307表面上在各尺度列變得可用時自可用資料進行選擇，並且將該資料連同在交錯模式並未嵌入在保留邏輯中時描述該交錯模式所需之任何信號轉發至交錯樣本串流4002。 Demultiplexer 4307 ostensibly selects from the available data as each metric row becomes available and forwards that data to interleaved sample stream 4002 along with any signaling required to describe the interleaving pattern if it is not embedded in the retention logic.

交錯資料可採用方便用於下游處理之任何形式或排序。經完成列之跨度可減小保留邏輯之複雜度，但在它們變得可用時產生個別經縮放樣本之序列亦為可行的；必須滿足之唯一要求為必須捕獲並呈現個別樣本之上下文，此將額外電路系統添加至保留區段(因為來自緊接在前的行之結果將並不始終可用且必須包括在上下文中)，但若對於特定節點類型存在一些優勢，則其可達成。若需要在進行至其他尺度之前針對全尺度輸入陣列產生輸出，則在FIFO 4202由立即直通替換的同時由FIFO 4304、4305及4306保留個別輸入陣列亦為可行的。 The interleaved data may be in any form or order convenient for downstream processing. The span of completed rows reduces the complexity of the retention logic, but it is also feasible to produce a sequence of individual scaled samples as they become available; the only requirement that must be met is that the context of the individual samples must be captured and presented, which adds additional circuitry to the retention section (since the results from the immediately preceding row will not always be available and must be included in the context), but it can be achieved if there is some advantage for a particular node type. If it is necessary to produce output for the full scale input array before proceeding to other scales, it is also feasible to retain the individual input arrays by FIFOs 4304, 4305 and 4306 while FIFO 4202 is replaced by immediate pass-through.

圖43B繪示了在另一實施例中產生交錯多尺度樣本串流。在圖43B中，W：1取樣器4303對U：1經縮小串流而非V：1經縮小串流進行取樣，從而產生至FIFO 4309之UW：1經縮小串流。V：1取樣器4302對1：1串流而非U：1經縮小串流進行取樣，從而產生至FIFO 4308之V：1經縮小串流。經縮小串流由切分多工器4307處理，從而產生交錯串流4002。 FIG. 43B illustrates another embodiment of generating an interleaved multi-scale sample stream. In FIG. 43B , W:1 sampler 4303 samples the U:1 downscaled stream instead of the V:1 downscaled stream, thereby generating the UW:1 downscaled stream to FIFO 4309. V:1 sampler 4302 samples the 1:1 stream instead of the U:1 downscaled stream, thereby generating the V:1 downscaled stream to FIFO 4308. The downscaled stream is processed by the demultiplexer 4307, thereby generating the interleaved stream 4002.

圖43C繪示了在又一實施例中產生交錯多尺度樣本串流。圖43C繪示了圖43A之所有元件，但添加額外1：T放大取樣器4310。因此，在圖43C之實施例中，存在保留在FIFO 4202中之1：1全尺度串流、由取樣器4301、4302及4303產生之三個經縮小串流以及由取樣器4310產生且保留在FIFO 4311中之一個經放大串流，所有五個串流藉由多工器4312交錯，從而產生多尺度樣本串流4002。熟習此項技術者應理解，在其他實施例中，可存在更多經放大串流，以及單個經放大或經縮小串流之多個取樣。 FIG. 43C illustrates the generation of an interleaved multi-scale sample stream in yet another embodiment. FIG. 43C illustrates all of the elements of FIG. 43A, but with the addition of an additional 1:T upsampling sampler 4310. Thus, in the embodiment of FIG. 43C, there is a 1:1 full-scale stream retained in FIFO 4202, three downscaled streams generated by samplers 4301, 4302, and 4303, and one upscaled stream generated by sampler 4310 and retained in FIFO 4311, all five streams being interleaved by multiplexer 4312 to generate multi-scale sample stream 4002. Those skilled in the art will appreciate that in other embodiments, there may be more upscaled streams, as well as multiple samples of a single upscaled or downscaled stream.

可組態元件之固定ASIC上的CNN模型之實施方式Implementation of CNN Model on Fixed ASIC of Configurable Device

在本發明之又一態樣中，提供一種系統，藉此利用數個重複管線電路元件的固定形式之電路可經組態以在輸入陣列串流上運算CNN。具有特定核心大小之一或多個卷積節點可與某一特定數目個並行輸入及輸出連接組合且經複製，以為並行輸入及輸出連接以及除直接實施之大小外的核心大小之其他組合提供基礎。 In yet another aspect of the invention, a system is provided whereby a fixed form circuit utilizing a number of repeated pipeline circuit elements can be configured to operate a CNN on a stream of input arrays. One or more convolution nodes having a particular core size can be combined with a particular number of parallel input and output connections and replicated to provide a basis for other combinations of parallel input and output connections and core sizes other than the size directly implemented.

在本發明之一個實施例中，提供一種特殊應用積體電路(ASIC)，其包含相同核心處理塊之許多複本，該等複本以所挑選之模式配置以允許根據實施庫存深度神經網路及定製深度神經網路二者所需的無阻礙前向連接。在本發明之其他實施例中，對存在於核心處理塊組中之核心大小之選擇經最佳化以支援特定模型形式。可包括額外輔助塊以執行除卷積外之計算。由於不需要結果向上游流動，因此並非所有塊皆需要可由所有其他塊存取。為了效率，塊以群組形式配置於ASIC上，該等群組可為線性列或其他形式以便於高效佈局。 In one embodiment of the invention, an application specific integrated circuit (ASIC) is provided that includes many copies of the same core processing block, which are configured in a pattern selected to allow unimpeded forward connections required for implementing both stock deep neural networks and custom deep neural networks. In other embodiments of the invention, the selection of core sizes present in the core processing block group is optimized to support specific model forms. Additional auxiliary blocks may be included to perform calculations other than convolutions. Because results are not required to flow upstream, not all blocks need to be accessible to all other blocks. For efficiency, blocks are configured on the ASIC in groups, which may be linear rows or other forms for efficient layout.

在圖44中描繪了本發明之實施例中的ASIC之一個實例，其中ASIC經圖解化，其中核心處理塊包含用於公共核心大小(在此實例中為3乘3)之管線卷積電路，經構造具有經選擇並行輸入及輸出連接大小，亦即在此實例中各自為16。至系統之主要輸入4401經呈現為多通道輸入串流。在一些實施例中，通道為RGB或YUV視訊之三個通道，並且經提供至輸入匯流排4402，且可經由多工器4403及4404來供塊4405之第一群組。 An example of an ASIC in an embodiment of the present invention is depicted in FIG. 44 , where the ASIC is diagrammatically illustrated where the core processing block includes pipeline convolution circuitry for a common core size (3 by 3 in this example), constructed with selected parallel input and output connection sizes, namely 16 each in this example. The primary input 4401 to the system is presented as a multi-channel input stream. In some embodiments, the channels are three channels of RGB or YUV video and are provided to input bus 4402 and may be provided to a first group of blocks 4405 via multiplexers 4403 and 4404.

熟習此項技術者應理解，互連展示為單線之匯流排實際上為具有至少部分地由實施在系統中之所要準確度判定之數目的並行導體。在連接匯流排之分支的情況下，連接由放大點展示。展示匯流排交叉之其他位置在匯流排交叉之間不存在連接。 Those skilled in the art will appreciate that interconnects shown as single lines are actually parallel conductors having a number determined at least in part by the desired accuracy of implementation in the system. Where branches of a bus are connected, the connection is shown with a point of magnification. Other locations showing bus crossings have no connection between the bus crossings.

主要輸入4401可包含以下各者中的至少一者：直接攝影機介面、適合於自CPU匯流排存取之DMA介面、視訊串流解壓縮電路或僅限於並行通道之串流經串流傳輸至輸入匯流排4402中的其他串流介面。輸入串流無需為標稱16並行連接大小之倍數，此係因為未使用連接在不利用連接之電路之任何部分中被停用及忽略。 Primary input 4401 may include at least one of the following: a direct camera interface, a DMA interface suitable for access from the CPU bus, video stream decompression circuitry, or streams limited to parallel channels are streamed to other stream interfaces in input bus 4402. Input streams do not need to be multiples of the nominal 16 parallel connection size, as unused connections are disabled and ignored in any portion of the circuitry not utilizing the connection.

在圖44中，在此實例中在實體構造方面相同之核心處理塊4405經展示配置成四列。在此實例中，存在展示於第一列(上部)中之六個核心處理塊。此第一數目個核心處理塊與ASIC上之其他核心處理塊群組的不同之處在於，此第一數目各自透過多工器4403耦接至輸入匯流排4402，並且針對CNN之第一層進行處理。應理解，透過多工器連接至輸入匯流排4402的此上部列中之核心處理塊之數目為任意數目。在一些最低限要求實施例中，一個便足夠了，但本發明之實施例中的ASIC之目標係能夠將ASIC用於具有廣泛多種輸入及輸出通道之情形且用於不同核心大小。 In FIG. 44 , core processing blocks 4405 that are identical in physical construction in this example are shown configured in four rows. In this example, there are six core processing blocks shown in the first row (upper). This first number of core processing blocks differs from other groups of core processing blocks on the ASIC in that each of these first number of core processing blocks is coupled to input bus 4402 via multiplexer 4403 and processes the first layer of the CNN. It should be understood that the number of core processing blocks in this upper row connected to input bus 4402 via multiplexer is arbitrary. In some minimum requirement embodiments, one is sufficient, but the goal of the ASIC in embodiments of the present invention is to enable the ASIC to be used in situations with a wide variety of input and output channels and for different core sizes.

在此實例中，核心處理塊與16個並行輸入連接、16個並行輸出連接相同，並且塊實施3乘3核心。下文描述了互連及方法，其中可存在任何數目個輸入及輸出通道，並且核心大小可不為3乘3，諸如5乘5、7乘7或9乘9。可透過組合多個3乘3核心之功能而提供適應性。 In this example, the core processing block is the same as 16 parallel input connections, 16 parallel output connections, and the block implements a 3x3 core. The interconnects and methods are described below, where there can be any number of input and output channels, and the core size can be other than 3x3, such as 5x5, 7x7, or 9x9. Adaptability can be provided by combining the functionality of multiple 3x3 cores.

第一列中之各核心處理塊4405透過多工器4403自輸入匯流排4402獲得呈16個並行輸入連接之組形式的主要輸入。熟習此項技術者應理解，將輸入提供至各核心處理塊4405之多工器相同，並且因此並非皆以元件編號標註。各核心處理塊亦自緊接在前的核心處理塊接受相同形式之輸入，其可或可不實體地在如圖44中所示之左側。各列之第一核心處理塊自以上列(除了上方不具有列之第一列以外)之輸出匯流排接受來自多工器4404之輔助輸入，此允許跨多列拆分複合核心。除了列中之最後一者以外，各核心處理塊4405將主要輸出提供至輸出匯流排4406，並且將同一輸出提供至下一鄰近核心處理塊。 Each core processing block 4405 in the first row receives its primary input from input bus 4402 via multiplexer 4403 in the form of a set of 16 parallel input connections. Those skilled in the art will appreciate that the multiplexers providing input to each core processing block 4405 are identical and therefore are not all labeled with component numbers. Each core processing block also accepts the same form of input from the immediately preceding core processing block, which may or may not be physically to the left as shown in FIG. 44 . The first core processing block in each row receives an auxiliary input from multiplexer 4404 from the output bus of the row above (except the first row which has no row above), which allows a composite core to be split across multiple rows. Each core processing block 4405, except the last one in the row, provides a primary output to an output bus 4406 and provides the same output to the next adjacent core processing block.

輸入匯流排4402及所有輸出匯流排4406為與用於各連接組之單個驅動器的單向連接，並且將輸入提供至多個多工器4403、4407、4409及4412且因此提供至核心處理塊及其他組件。匯流排決不用於雙向資料流，並且圖44中之描繪僅標示一列之輸出可用作至同一列之輸入。 Input bus 4402 and all output buses 4406 are unidirectional connections to a single driver for each connection set and provide input to multiple multiplexers 4403, 4407, 4409 and 4412 and hence to the core processing blocks and other components. The buses are never used for bidirectional data flow and the depiction in Figure 44 only indicates that the outputs of one row can be used as inputs to the same row.

輸出匯流排4406上之值經時間多工。當經組態模型使得呈現在連接之實體組上的值與系統之處理頻率相同時，各實體連接針對各陣列串流位置僅攜載單個資料通道。當經組態模型以小於處理頻率之某一倍數在連接之實體組上呈現值時，各實體連接可針對各陣列串流位置攜載多個經時間多工值或可針對多個處理循環保持單個值常量。時間多工之各形式對於具有大量輸入及輸出通道之模型節點的實施方式為有利的。 The values on output bus 4406 are time multiplexed. When the model is configured so that the values presented on the connected entity set are the same as the processing frequency of the system, each entity connection carries only a single data channel for each array stream position. When the model is configured to present values on the connected entity set at a multiple less than the processing frequency, each entity connection may carry multiple time multiplexed values for each array stream position or may hold a single value constant for multiple processing cycles. Various forms of time multiplexing are advantageous for implementations of model nodes with large numbers of input and output channels.

作為例示性實例，其中處理路徑流經四個2乘2 MaxPool節點之模型具有以256之準確因數減小的資料速率，並且可在各實體連接上攜載256個資料通道。藉由添加組態邏輯及多個索引將利用此潛在配置之規定建置至各核心處理塊以及輔助功能塊中使得各塊可適應所呈現之組態。最終結果為實施大型模型所需之該組實體塊減小多個數量級。在下文關於圖54給出可組態時間多工之使用的完整描述。 As an illustrative example, a model in which the processing path flows through four 2x2 MaxPool nodes has a data rate reduced by an exact factor of 256 and can carry 256 data channels on each physical connection. Provision is built into each core processing block and auxiliary function blocks to exploit this potential configuration by adding configuration logic and multiple indexes so that each block can adapt to the configuration presented. The end result is that the set of physical blocks required to implement large models is reduced by multiple orders of magnitude. A full description of the use of configurable time multiplexing is given below with respect to Figure 54.

雙多工器4407在此實例中將二組16個並行輸入連接提供至輔助功能塊4408，並且等效於4403個多工器中之二者。16個並行連接之輸出由各輔助功能塊4408提供至輸出匯流排4406上的16個連接之組中之一者。 Dual multiplexer 4407 in this example provides two sets of 16 parallel input connections to auxiliary function block 4408 and is equivalent to two of the 4403 multiplexers. The outputs of the 16 parallel connections are provided by each auxiliary function block 4408 to one of the sets of 16 connections on output bus 4406.

16個並行輸出連接之一或多個組(描繪為3)由多工器4409提供至可或可不駐存於同一ASIC晶粒上或同一IC封裝中之外部功能電路系統4410。16個並行輸入連接4411之一或多個組從外部功能電路系統4410返回作為相異之匯流排連接，並且在輸出匯流排4406上可用。 One or more sets of 16 parallel output connections (depicted as 3) are provided by multiplexer 4409 to external functional circuitry 4410 which may or may not reside on the same ASIC die or in the same IC package. One or more sets of 16 parallel input connections 4411 are returned from external functional circuitry 4410 as distinct bus connections and are available on output bus 4406.

各列核心處理塊之輸出匯流排4406將輸入提供至下一列，從而將輸入匯流排4402之功能替換為第一列，並且與彼列之輸出匯流排相結合，可經由多工器4403、4404及4407來供彼列之核心處理塊使用，且可經由多工器4409來供彼列之外部功能介面使用。 The output bus 4406 of each row of core processing blocks provides input to the next row, thereby replacing the function of the input bus 4402 with the first row, and combined with the output bus of the other row, it can be used by the core processing block of the other row through multiplexers 4403, 4404 and 4407, and can be used by the external function interface of the other row through multiplexer 4409.

由於處理資料串流之時間管線之固有本質為資料並不向上游流動，因此並非所有可能連接必需為可用的。若藉由以自左至右次序分配使用來完成各特定塊之分派，則可省略各匯流排4406之不可用連接，同時實現以無功能成本減小總體電路大小。各核心處理塊之輸出隨後可供下方列上之所有核心處理塊及在同一列上產生輸出的核心處理塊右側之任何核心處理塊使用。 Due to the inherent nature of the time pipeline that processes the data stream, data does not flow upstream, so not all possible connections are necessarily available. If the assignment of each specific block is done by allocating them in order from left to right, the unavailable connections of each bus 4406 can be omitted, while reducing the overall circuit size at the cost of no function. The output of each core processing block is then available to all core processing blocks on the row below and any core processing block to the right of the core processing block on the same row that produced the output.

多工器4412選擇16個並行輸入連接之一或多個組且將其提供至主要輸出電路系統4413，該主要輸出電路系統可包含最終輸出處理(通常為SoftMax)且可經由直接並行連接、標準SERDES(串列器/解串列器)、直接記憶體存取(DMA)或產生模型之已知結果所需的任何其他介面輸出結果。 Multiplexer 4412 selects one or more groups of the 16 parallel input connections and provides them to the main output circuitry 4413, which may include final output processing (typically SoftMax) and may output results via direct parallel connections, standard SERDES (serializer/deserializer), direct memory access (DMA), or any other interface required to produce a known result for the model.

圖45描繪了將處理16個並行輸入及輸出連接之固定3乘3核心的核心處理塊組合以組態具有更多輸入、輸出或二者之3乘3核心的配置。左上方之前二個核心處理塊4501及4502經組態以(藉由經由多工器進行選擇，未展示)獲得16個輸入連接之同一組，以產生16個並行輸出連接之二個不同組中的32個輸出通道。下二個核心處理塊4503及4504經組態以藉由將來自第一核心處理塊4503之部分和傳遞至第二核心處理塊4504中而自應用於16個輸入通道之二個不同組的不同權重運算16個輸出通道，其中部分和與剩餘部分和組合以產生單個並行輸出連接組上之輸出。最終四個塊4505、4506、4507及4508經組態為自16個並行輸入連接之二個組獲得輸入以產生具有總共32個輸入及32個輸出之3乘3卷積的部分和對之二個組。 Figure 45 depicts a configuration of combining core processing blocks of a fixed 3x3 core that handles 16 parallel input and output connections to configure a 3x3 core with more inputs, outputs, or both. The first two core processing blocks 4501 and 4502 at the upper left are configured (by selection via a multiplexer, not shown) to receive the same set of 16 input connections to produce 32 output channels in two different sets of 16 parallel output connections. The next two core processing blocks 4503 and 4504 are configured to compute 16 output channels from different weights applied to two different sets of 16 input channels by passing partial sums from the first core processing block 4503 to the second core processing block 4504, where the partial sums are combined with the remaining partial sums to produce outputs on a single set of parallel output connections. The final four blocks 4505, 4506, 4507 and 4508 are configured to take inputs from two sets of 16 parallel input connections to produce two sets of partial sum pairs of 3 times 3 products having a total of 32 inputs and 32 outputs.

圖45中之3乘3核心處理塊之組合在實務中藉由將在各配置中之核心處理塊連接至匯流排的多工器之操作而達成。熟習此項技術者應理解，3乘3卷積管線節點可藉由將一組任意大核心處理塊成群在一起而經組態有任何數目個輸入及輸出通道。其輸入並非資料通道組大小(標稱地為16)之整數倍的任何節點將藉由在運算中將未使用通道之權重設定成零或以其他方式停用該等未使用通道而簡單地忽略該等未使用通道。其輸出並非通道組大小之整數倍的任何節點可避免運算未使用通道之值，該等未使用通道將由下游節點忽略。 The combination of 3x3 core processing blocks in Figure 45 is achieved in practice by the operation of multiplexers that connect the core processing blocks in each configuration to a bus. Those skilled in the art will appreciate that a 3x3 convolution pipeline node can be configured with any number of input and output channels by grouping together an arbitrarily large set of core processing blocks. Any node whose inputs are not integer multiples of the data channel group size (nominal 16) will simply ignore the unused channels by setting their weights to zero or otherwise disabling them in computations. Any node whose outputs are not integer multiples of the channel group size can avoid computing the values of the unused channels, which will be ignored by downstream nodes.

並行連接組大小(在此實例中為16)之挑選為任意的，並且經選擇粒度之唯一效應為給定模型之完整組態中的未使用連接之分率。對於將組大小限於二之冪，並未提出任何要求，並且並未預見任何優勢。實體佈局之一些較高效率可藉由使所有連接具有相同寬度來達成，但甚至此亦並非要求，並且連接大小之不規則或變化的寬度在一些情況下可為最佳的。 The choice of the parallel connection group size (16 in this example) is arbitrary, and the only effect of the chosen granularity is the fraction of unused connections in the full configuration of a given model. No requirement is made, and no advantage is foreseen, for limiting the group size to a power of two. Some greater efficiency of physical layout can be achieved by making all connections the same width, but even this is not a requirement, and irregular or varying widths of connection sizes may be optimal in some cases.

圖46繪示了核心處理塊4405之內部結構。由多工器4403(未展示，參見圖44)選擇之並行輸入連接組4601經拆分成不同的單連接(在此實例中為16)，並且各個別連接經提供至大量乘法器4603，該大量乘法器運算可能來自彼輸入之全組倍數。所有輸入之所有倍數連同來自輔助連接組4602之單個連接一起提供至各卷積單元4604。各卷積單元之所得輸出經分組在一組16個並行輸出連接4605中，並且可供輸出匯流排4406(未展示，參見圖44)上之其他塊使用。 FIG46 illustrates the internal structure of the core processing block 4405. The set of parallel input connections 4601 selected by the multiplexer 4403 (not shown, see FIG44) is split into different single connections (16 in this example), and each individual connection is provided to a large number of multipliers 4603, which operate on the full set of multiples that may come from that input. All multiples of all inputs are provided to each convolution unit 4604 along with a single connection from the auxiliary connection set 4602. The resulting output of each convolution unit is grouped in a set of 16 parallel output connections 4605 and is available to other blocks on the output bus 4406 (not shown, see FIG44).

圖47展示了各卷積單元4604之內部結構。全組輸入乘積4701由內部匯流排4703提供至此3乘3實例中所使用之九個權重求和胞元4704中之各者。如同用於卷積之管線電路系統之先前所揭露版本，一對陣列串流寬度FIFO 4705延遲部分乘積和以適應對於單個電路處理如由輸入串流所呈現之3乘3小塊之資料的要求。在此實例中，各卷積之最終和與加法器4707中之偏置值連同在由FIFO 4706進行之延遲之後自輔助輸入4702接收之任擇值組合。一旦最終和與偏置及任擇輔助輸入組合，和便輸出4708至包含塊，其中它將形成並行輸出連接組之一個元素。 FIG. 47 shows the internal structure of each convolution unit 4604. The full set of input products 4701 is provided by internal bus 4703 to each of the nine weighted sum cells 4704 used in this 3 by 3 example. As with previously disclosed versions of pipeline circuitry for convolution, a pair of array stream width FIFOs 4705 delay the partial sums of products to accommodate the requirement for a single circuit to process 3 by 3 blocks of data as presented by the input stream. In this example, the final sum of each convolution is combined with a bias value in adder 4707 together with an optional value received from auxiliary input 4702 after delay by FIFO 4706. Once the final sum is combined with the bias and optional auxiliary inputs, the sum is output 4708 to the containing block, where it forms an element of the parallel output connection set.

所有FIFO 4705及4706按輸入串流位置而非按時脈循環延遲值，並且傳遞給定的經時間多工資料。在上文圖45中所描述之實例中，FIFO 4706可經組態以在無延遲的情況下傳遞資料。呈現FIFO 4706以支援建構複合核心，在圖51中所繪示且在下文所描述。 All FIFOs 4705 and 4706 delay values based on input stream positions rather than clock cycles and pass a given time-multiplexed data. In the example described in FIG. 45 above, FIFO 4706 can be configured to pass data without delay. FIFO 4706 is presented to support the construction of a complex core, shown in FIG. 51 and described below.

圖48描繪了應用所呈現輸入通道倍數4801中之各者且經由組態暫存器4802及多工器4803來選擇一個特定倍數的單個求和胞元4704。經選擇倍數隨後在經轉發至加法器4806之前傳遞通過由組態暫存器4804控制之可變移位暫存器4805，其中經選擇倍數與個別輸入通道之所有其他經選擇且經縮放倍數組合。 Figure 48 depicts a single summing cell 4704 applying each of the presented input channel multiples 4801 and selecting a particular multiple via configuration register 4802 and multiplexer 4803. The selected multiple is then passed through a variable shift register 4805 controlled by configuration register 4804 before being forwarded to adder 4806 where it is combined with all other selected and scaled multiples for the individual input channels.

可組態索引4802及可組態尺度4804之組合允許將浮點權重值應用於固定點特徵或像素值。與經組態偏置值4807組合的每輸入通道一個權重之求和完成卷積之此步驟隨時間推移的運算。求和胞元4704亦可將經由FIFO 4705來自先前列之經延遲值4808與來自左側胞元之經轉發部分和4809相加。當二者存在時，它們藉由加法器4810組合。在第一列胞元上，上方不存在列，因此經延遲連接4808將不存在，並且將省略加法器4810，且將經轉發值4809直接連接至加法器4811，其中值與偏置值4807組合。在第一行胞元上，左側不存在行，經轉發值不存在，並且再次省略加法器4810。在單個左上胞元中，既不存在經延遲連接亦不存在經轉發連接，省略加法器4810及4811二者，並且將偏置值4807直接路由至最終加法器4812，其中偏置值與加法器樹之和組合以形成輸出4813。 The combination of configurable index 4802 and configurable scale 4804 allows floating point weight values to be applied to fixed point features or pixel values. The summation of one weight per input channel combined with configured bias value 4807 completes the operation of this step of the convolution over time. Summing cell 4704 may also add delayed value 4808 from the previous column via FIFO 4705 with the forwarded partial sum 4809 from the cell to the left. When both exist, they are combined by adder 4810. On the first column of cells, there is no column above, so delayed connection 4808 will not exist and adder 4810 will be omitted and forwarded value 4809 will be connected directly to adder 4811 where the value is combined with bias value 4807. On the first row of cells, there is no row on the left, the forwarded value does not exist, and adder 4810 is again omitted. In the single top left cell, there is neither a delayed connection nor a forwarded connection, both adders 4810 and 4811 are omitted, and bias value 4807 is routed directly to final adder 4812, where the bias value is combined with the adder tree sum to form output 4813.

本發明不要求對待使用之大量乘法器起作用。若大匯流排寬度及大多工器之組合為繁重的，則實際輸入值可經分散(經由小得多的匯流排)，並且可使用本地雙輸入固定乘法器，其中組態要相乘之實際值而非索引。移位器及尺度暫存器經保持以允許全浮點權重待自固定點乘積應用。 The present invention does not require a large number of multipliers to be used to function. If the combination of large bus width and large multiplexers is taxing, the actual input values can be spread out (over a much smaller bus) and a local two-input fixed multiplier can be used, with the actual values to be multiplied configured rather than indices. Shifters and scale registers are maintained to allow full floating point weights to be applied from fixed point products.

若系統之經選擇精確度極大，例如，尾數中之12位元精確度，則大量乘法器及輸送匯流排可為過大的。在此情況下，可使用具有相對應地較小匯流排之較小大量乘法器，隨後對其進行多次索引並求和以產生各所需乘積。在此情況下，可將6位元大量乘法器與二個索引值及二個(小得多的)多工器一起使用以選擇隨後與適當縮放組合以產生與12位元大量乘法器相同之結果的二個獨立倍數。 If the chosen precision of the system is very large, e.g. 12 bits of precision in the mantissa, then the bulk multiplier and transport bus may be oversized. In this case, a smaller bulk multiplier with a correspondingly smaller bus may be used, which is then indexed multiple times and summed to produce each desired product. In this case, a 6-bit bulk multiplier may be used with two index values and two (much smaller) multiplexers to select two independent multiples which are then combined with appropriate scaling to produce the same result as the 12-bit bulk multiplier.

系統之操作整體上並不受乘法方法之挑選影響，並且僅自塊之輸入4601至各求和胞元之縮放器4805的路徑受影響。可推遲乘法方法之挑選及大量乘法器匯流排寬度之組態(若使用)，直至ASIC呈佈局形式且就大小及功率而言之最佳形式經判定為止。 The operation of the system as a whole is not affected by the choice of multiplication method, and only the path from the block input 4601 to the scaler 4805 of each summing cell is affected. The choice of multiplication method and the configuration of the bulk multiplier bus width (if used) can be deferred until the ASIC is laid out and the optimum form in terms of size and power is determined.

圖49描繪了輔助功能塊4408之內部結構。在此實例中，如由雙多工器4407選擇之主要並行輸入連接組4901透過可變FIFO 4902路由，藉由組態設定延遲，並且任擇地藉由加法器4904與亦來自多工器4407之輔助輸入4903組合。若在當前組態中不需要和，則多工器4905可將經延遲輸入值直接饋送至公共查找表4906中。 Figure 49 depicts the internal structure of auxiliary function block 4408. In this example, the main parallel input connection set 4901 as selected by dual multiplexer 4407 is routed through variable FIFO 4902, delayed by configuration, and optionally combined with auxiliary input 4903 also from multiplexer 4407 by adder 4904. If the sum is not required in the current configuration, multiplexer 4905 can feed the delayed input value directly into common lookup table 4906.

公共查找表4906為經組態有用於各可能索引值至某一任意輸出值之映射的一組靜態暫存器。輸入通道中之各者控制多工器進入此組暫存器輸出，使得相同功能個別地應用於所有輸入通道。此允許實施模型所需之任何啟動函數，諸如但不限於RELU、S型或雙曲正切，而不必包括各種電路系統來運算各此類函數。 Common lookup table 4906 is a set of static registers configured to map each possible index value to some arbitrary output value. Each of the input channels controls the multiplexer into this set of register outputs so that the same function is applied to all input channels individually. This allows any activation function required by the implementation model, such as but not limited to RELU, sigmoid, or hyperbolic tangent, to be implemented without having to include various circuit systems to operate on each such function.

若查找表不用於此塊之組態中，則輸入可經由多工器4907來繞過該查找表，並且查找4906可經停用以節省功率。 If the lookup table is not used in the configuration of this block, the input can bypass the lookup table via multiplexer 4907 and lookup 4906 can be disabled to save power.

MaxPool 4908、平均4909、取樣4910、擴展4911及旁路或通道串接函數中之僅一者可由可組態多工器4914選擇以用於所有通道，並且形成驅動輸出匯流排4406上之一組並行輸出連接的輸出4915。 Only one of the MaxPool 4908, Averaging 4909, Sampling 4910, Expanding 4911, and Bypass or Channel Cascading functions can be selected by the configurable multiplexer 4914 for use on all channels and form an output 4915 that drives a set of parallel output connections on the output bus 4406.

MaxPool功能區塊4908允許計算輸入串流之可組態小塊隨時間推移的最大值，以及直接實施具有減小下游資料速率之效應的步幅機制。 MaxPool functional block 4908 allows computing the maximum value of a configurable chunk of the input stream over time and directly implementing a stride mechanism that has the effect of reducing the downstream data rate.

平均功能區塊4909提供可組態小塊上之平均值之計算，並且可與MaxPool 4908共用電路系統。取樣功能區塊4910提供輸入陣列之剪裁以及跳過步幅，並且亦可與MaxPool 4908及平均區塊4909共用電路系統。 Averaging function block 4909 provides calculation of average values over configurable tiles and can share circuitry with MaxPool 4908. Sampling function block 4910 provides trimming of input arrays and skipping strides and can also share circuitry with MaxPool 4908 and averaging block 4909.

實施非統一步幅機制之各功能區塊有效地減小輸入陣列串流之維度。由於串流之圖框速率整體上恆定，故此按比例增加由下游節點處理各輸入串流位置所允許之時間。此為匯流排連接之時間多工使用提供了基礎，並且規定了可用的時間多工程度。核心處理塊可利用經增加時間間隔來在固定並行連接組上傳送並處理多個資料通道。權重經擴展以適應額外值，但乘法器及累加器被重複使用，從而允許節點之輸入及/或輸出通道之擴展而不需要使用多個核心處理塊。控制邏輯座標隨時間推移而協調經連接塊之間的值之部署。 Each functional block implementing a non-uniform stride mechanism effectively reduces the dimensionality of the input array stream. Since the frame rate of the stream is generally constant, the time allowed by the downstream node to process each input stream position is proportionally increased. This provides the basis for time multiplexing of bus connections and dictates the degree of time multiplexing available. The core processing block can use the increased time interval to transmit and process multiple data channels on a fixed parallel connection set. The weights are expanded to accommodate the additional values, but the multipliers and accumulators are reused, allowing the expansion of the input and/or output channels of the node without the use of multiple core processing blocks. Control logic coordinates the placement of values between connected blocks over time.

擴展功能區塊4911提供輸入陣列之填充以及複製值，以將經減小輸入陣列串流與未減小輸入陣列串流對準。 Extension function block 4911 provides padding of input arrays and copying of values to align the reduced input array stream with the unreduced input array stream.

專用多工器4912將自主要並行輸入連接4901獲得之通道與自輔助並行輸入連接4903獲得之通道串接，以有效地將通道重新路由至特定並行輸出連接中。在每塊使用固定16個輸入通道之本實例中，多工器4912可自輸入4901獲得1至15個最低通道，並且自輸入4903獲得對應的15至1個最低通道，以產生待由多工器4914選擇之單組16個通道。由於當所有所使用通道置放於各組16個並行連接中可用之最低位置處時資料通道至核心之特定位置的分配為任意的(任何輸入資料通道及對應權重可在不改變結果之情況下佔據核心中之任何位置)，因此，此配置足以執行待實施的CNN模型所需之所有重新路由。此多工器之使用限於具有等效時間多工之資料通道。 Dedicated multiplexer 4912 concatenates channels obtained from primary parallel input connection 4901 with channels obtained from auxiliary parallel input connection 4903 to effectively reroute channels to specific parallel output connections. In this example using a fixed 16 input channels per block, multiplexer 4912 can obtain 1 to 15 lowest channels from input 4901 and the corresponding 15 to 1 lowest channels from input 4903 to produce a single set of 16 channels to be selected by multiplexer 4914. Since the assignment of data channels to specific locations in the core is arbitrary (any input data channel and corresponding weight can occupy any location in the core without changing the result) when all used channels are placed at the lowest location available in each set of 16 parallel connections, this configuration is sufficient to perform all the rerouting required by the CNN model to be implemented. The use of this multiplexer is limited to data channels with equivalent time multiplexing.

另一專用多工器4913藉由使來自各並行輸入連接組之樣本交替而將二個輸入通道串接成一個輸出通道。使用此電路需要將二個通道上之各輸入值保留二個或更多個串流處理循環。輸出值將保留一半，只要傳入值即可，並且僅佔據匯流排上之單個並行輸出連接組。 Another dedicated multiplexer 4913 connects two input channels in series into one output channel by alternating samples from each parallel input connection set. Using this circuit requires retaining each input value on the two channels for two or more stream processing cycles. The output value will retain half as many input values as needed and will only occupy a single parallel output connection set on the bus.

若不藉由可組態多工器4914選擇小塊函數或串接，則塊在旁路模式中進行操作，該旁路模式可適用於在並行輸入連接組與其將用於後續列上之核心處理塊之間提供經延遲路徑。 If a small block function or concatenation is not selected by configurable multiplexer 4914, the block operates in a bypass mode which may be useful for providing a delayed path between the parallel input connection set and the core processing block to which it will be applied on subsequent rows.

圖50描繪了實施包括在輔助塊中之小塊函數中之任一者所需的元件。單個輸入通道5001連同由小塊功能電路系統5002自先前列發射之延遲值一起饋送至彼小塊電路系統中，並且經輸出5004作為各小塊之最終值。此允許運算均勻地處理所有輸入串流位置之任何函數。 Figure 50 depicts the components needed to implement any of the small block functions included in the auxiliary block. A single input channel 5001 is fed into the small block circuitry along with the delay value emitted from the previous row by the small block function circuitry 5002 and output 5004 as the final value for each small block. This allows the operation to process any function uniformly across all input stream locations.

對於MaxPool 4908，當小塊函數經定義為橫跨多個列時，函數選擇所見之最大值，並且FIFO 5003用於使來自先前列之小塊之值與當前列相關聯。對於平均4909，函數累加各小塊內之所有值之和，並且FIFO 5003用於呈現來自先前列之和。對於取樣4910，在各小塊內之一個經選擇位置將作為小塊整體之值輸出的情況下，當經選擇值不落在所處理小塊之最後列上時，FIFO 5003用於保留來自先前列之經選擇值。此等小塊函數中之各者皆僅為每小塊產生單個值，並且具有減小傳遞至下游之輸入串流陣列之面積及維持串流處理所需之輸貫量二者的最終效應。 For MaxPool 4908, when the chunk function is defined to span multiple columns, the function selects the maximum value seen and FIFO 5003 is used to associate the value of the chunk from the previous column with the current column. For Average 4909, the function accumulates the sum of all values within each chunk and FIFO 5003 is used to present the sum from the previous column. For Sample 4910, in the case where a selected position within each chunk is to be output as the value of the chunk as a whole, FIFO 5003 is used to retain the selected value from the previous column when the selected value does not fall on the last column of the chunk being processed. Each of these chunk functions produces only a single value per chunk and has the net effect of both reducing the size of the input stream array passed downstream and maintaining the throughput required for stream processing.

對於擴展4911，將來自輸入串流之各個別位置之值重複至輸出串流一或多次，並且FIFO 5003用於保留先前遇到之值以在後續列上再次重複，其中該函數經定義為產生多於一個列。此函數具有如下效應：增加傳遞至下游之輸入串流陣列之面積及維持串流處理所需之輸貫量，但當通過該模型之不同路徑具有不同大小但需要重新組合時需要匹配陣列大小。 For extension 4911, the value from each individual position of the input stream is repeated one or more times to the output stream, and FIFO 5003 is used to retain previously encountered values for repetition again on subsequent rows, where the function is defined to produce more than one row. This function has the effect of increasing the area of the input stream array passed downstream and maintaining the throughput required for stream processing, but requires matching array sizes when different paths through the model have different sizes but need to be reassembled.

圖51描繪了連接至外部電路系統之配置。輸入5101表示一組外部並行輸入連接，並且可處於並行緩衝器、SERDES或對於連接至其他電路可為方便的其他形式中之任一者中，且將該並行輸出連接組提供至輸出匯流排4406。可組態多工器5102自匯流排4406選擇一個並行輸入連接組，並且將其傳遞至可為任何輸出形式之外部輸出5103。預期外部輸入及輸出中之至少一些相互相容，使得它們可用於連接ASIC之多個單元以構成過大而不能適應所提供之塊的模型。 FIG51 depicts a configuration for connection to external circuitry. Inputs 5101 represent a set of external parallel input connections and may be in any of a parallel buffer, SERDES, or other form that may be convenient for connection to other circuitry, and provide the set of parallel output connections to output bus 4406. Configurable multiplexer 5102 selects a set of parallel input connections from bus 4406 and passes it to external output 5103 which may be any output form. It is expected that at least some of the external inputs and outputs are compatible with each other so that they can be used to connect multiple units of the ASIC to form models that are too large to fit in the blocks provided.

外部電路系統之重要用途為提供未包括在輔助功能塊4408中之任意功能。此特徵有效地使ASIC「永不過時」，並且規避了預測模型設計中未來開發之結果的要求。 An important use of the external circuitry is to provide arbitrary functionality not included in the auxiliary function block 4408. This feature effectively makes the ASIC "future-proof" and circumvents the requirement to predict the results of future developments in the model design.

圖52描繪了用於運算5乘5卷積的四個3乘3核心處理塊4405之配置。並不參與此組態之所有元件自此圖中省略，並且藉由組態而停用。 Figure 52 depicts the configuration of four 3x3 core processing blocks 4405 for computing 5x5 products. All components that do not participate in this configuration are omitted from this figure and are disabled by configuration.

左上方位置中之核心處理塊經組態以對輸入串流之5乘5小塊之左上方3乘3個值運算3乘3卷積，而右上方核心處理塊經組態以對5乘5小塊之右上方2乘3個值運算2乘3卷積。 The core processing block in the upper left position is configured to perform a 3x3 convolution on the upper left 3x3 values of the 5x5 block of the input stream, while the upper right core processing block is configured to perform a 2x3 convolution on the upper right 2x3 values of the 5x5 block.

左下方核心處理塊經組態以對左下方3乘2個值運算3乘2卷積，並且右下方核心處理塊經組態以對5乘5小塊之右下方2乘2個值運算2乘2卷積。 The lower left core processing block is configured to compute a 3x2 product of the lower left 3x2 values, and the lower right core processing block is configured to compute a 2x2 product of the lower right 2x2 values of the 5x5 block.

由於塊之左行將在對應於在塊之右行將產生最終結果之前的二個輸入串流位置之時間產生最終結果，因此二個FIFO 4706經組態用於雙位置延遲。 Since the left row of the block will produce a final result at a time corresponding to two input stream positions before the right row of the block will produce the final result, the two FIFOs 4706 are configured for a two-position delay.

上部列加法器4707之輸出係選自輸出匯流排4406且作為主要輸入路由至FIFO 4902。由於塊之上部列將在對應於在塊之下部列將完成之前的二個陣列寬度之時間處完成針對5乘5卷積之上部5乘3區段的運算，因此FIFO 4902必須經組態以將部分和延遲輸入串流位置之二個陣列寬度。 The output of the upper row adder 4707 is selected from output bus 4406 and routed as the primary input to FIFO 4902. Since the upper row of the block will complete the operation for the upper 5 by 3 segment of the 5 by 5 convolution at a time corresponding to two array widths before the lower row of the block will complete, FIFO 4902 must be configured to delay the partial sum input into the stream position by two array widths.

下部列加法器4707之輸出係選自輸出匯流排4406且作為輔助輸入路由至加法器4904，其中該輸出與來自上部列之經延遲部分和組合以產生5乘5卷積之最終結果。其隨後傳遞通過其中應用啟動函數之查找表4906。 The output of the lower column adder 4707 is selected from output bus 4406 and routed as an auxiliary input to adder 4904 where it is combined with the delayed partial sums from the upper column to produce the final result of a 5 by 5 product. It is then passed through a lookup table 4906 where the activation function is applied.

圖53描繪了以類似方式配置以實施7乘7核心的九個3乘3核心處理塊之配置。由於核心處理塊之間的時序差與先前實例中之時序差相同，因此FIFO經組態用於相同延遲。左側上之輔助加法器4904組合來自前二列之部分結果，並且右側上之加法器4904將彼等值與來自下部列之部分結果組合且將其作為最終結果轉發至查找表。 FIG. 53 depicts the configuration of nine 3x3 core processing blocks configured in a similar manner to implement a 7x7 core. Since the timing differences between the core processing blocks are the same as in the previous example, the FIFOs are configured for the same delays. The auxiliary adder 4904 on the left combines the partial results from the first two columns, and the adder 4904 on the right combines those values with the partial results from the lower column and forwards them to the lookup table as the final result.

熟習此項技術者應理解，任何任意卷積核心大小可使用部分停用之元素及適當延遲的類似配置由任何核心處理塊核心大小構成。此方法與圖45中所示之塊之配置的組合允許具有任何數目個輸入及輸出通道之任何核心之組態。 Those skilled in the art will appreciate that any arbitrary convolution kernel size can be constructed from any core processing block kernel size using a similar configuration of partially disabled elements and appropriate delays. This approach in combination with the configuration of the blocks shown in FIG. 45 allows configuration of any core with any number of input and output channels.

各列塊中包括多工器4404具體而言以提供可橫跨配置於ASIC內之多個塊群組的大卷積。雖然預期重複核心處理塊之規則列在某些情形下可為最佳的，但本發明之操作決不限於規則配置。當設計本發明之實施ASIC時，可選擇核心處理塊大小以及塊之分組的任何混合來反映特定模型之需要，或可選擇統計判定之混合來涵蓋單個ASIC晶粒內之最寬模型範圍。若模型要求比單個ASIC之塊可適應之資源更多的資源，則多個ASIC可鏈接在一起以進一步擴展覆蓋任意較大模型之能力。 Multiplexers 4404 are included in each column block specifically to provide large volumes that can span multiple groups of blocks configured within the ASIC. Although it is expected that regular columns of repeated core processing blocks may be optimal in some cases, the operation of the present invention is in no way limited to regular configurations. When designing an ASIC implementing the present invention, any mixture of core processing block sizes and groupings of blocks may be selected to reflect the needs of a particular model, or a mixture of statistical decisions may be selected to cover the widest range of models within a single ASIC die. If the model requires more resources than a single ASIC block can accommodate, multiple ASICs can be chained together to further expand the ability to cover arbitrarily large models.

圖54繪示了功能上完整的卷積神經網路之抽象實例。輸入5401在吾人之示例性實施方式中將RGB像素資料提供至16個並行連接中之3個並行連接；不使用其他13個並行連接。 FIG54 illustrates an abstract example of a functionally complete convolutional neural network. Input 5401 provides RGB pixel data to 3 of the 16 parallel connections in our exemplary implementation; the other 13 parallel connections are not used.

卷積5402為具有RELU啟動函數之3輸入、32輸出、7乘7卷積，其在此實例中藉由九個3乘3核心處理塊之二個不同組及二個輔助塊實施以產生總共32個輸出資料通道。RELU啟動函數係藉由最終輔助塊之查找表針對各組16個並行輸出連接實施。 Convolution 5402 is a 3-input, 32-output, 7x7 convolution with a RELU activation function, which in this example is implemented by two different groups of nine 3x3 core processing blocks and two auxiliary blocks to produce a total of 32 output data channels. The RELU activation function is implemented by a lookup table in the final auxiliary block for each group of 16 parallel output connections.

取樣5403為卷積輸出之2乘2子取樣，此為ResNet及YOLO模型變化中發現之典型配置。可在7乘7卷積內所包含之最終輔助塊中之各者中選擇取樣函數，因此無需額外塊。取樣函數之輸出以每位置四分之一處理速率產生四分之一面積之陣列串流。在下一值呈現於並行輸出連接上之前，各輸出值在四個串流處理循環內保持恆定。此為下游節點中之經時間多工資料之四個通道提供機會。 Sample 5403 is a 2 by 2 sub-sampling of the convolution output, a typical configuration found in ResNet and YOLO model variations. The sampling function can be selected in each of the final auxiliary blocks contained within the 7 by 7 convolution, so no additional blocks are required. The output of the sampling function produces a stream of arrays of one quarter area at one quarter the processing rate per position. Each output value remains constant for four stream processing cycles before the next value appears on the parallel output connection. This provides an opportunity for four channels of time-multiplexed data in downstream nodes.

卷積5404為具有RELU啟動函數之32輸入、64輸出、5乘5卷積。由於32個輸入通道經呈現為至多工器之二組16個並行輸入連接，因此二組四個3乘3核心處理塊成群在一起以將各個別輸出通道之權重應用於輸入。但由於輸入以四分之一串流處理速率呈現，因此可運算四組輸出通道值且將其置放於單組16個並行輸出連接上，各並行輸出連接攜載經時間多工資料。 Convolution 5404 is a 32-input, 64-output, 5x5 convolution with a RELU activation function. Since the 32 input channels are presented as two sets of 16 parallel input connections to the multiplexer, two sets of four 3x3 core processing blocks are grouped together to apply the weights of each individual output channel to the input. But since the input is presented at one-quarter the streaming rate, four sets of output channel values can be computed and placed on a single set of 16 parallel output connections, each carrying the time-multiplexed data.

MaxPool 5405為64輸入、64輸出、2乘2最大值函數，其經組態以在 16個並行輸入連接中之各者上接受並處理四個經時間多工值。輸出為在二位置寬及二位置高之小塊內發現的各個別通道之最大值。輸出經置放於單組16個並行輸出連接上，並且在下一值呈現之前，各值在四個串流處理循環內保持恆定。僅需要單個輔助塊以處理經時間多工資料之所有64個通道。此為下游節點中之經時間多工資料之十六個通道提供機會。 MaxPool 5405 is a 64-input, 64-output, 2-by-2 maximum function configured to accept and process four time-multiplexed values on each of the 16 parallel input connections. The output is the maximum value of each individual channel found in a block that is two places wide and two places high. The output is placed on a single set of 16 parallel output connections, and each value remains constant for four stream processing cycles before the next value is presented. Only a single auxiliary block is required to process all 64 channels of time-multiplexed data. This provides the opportunity for sixteen channels of time-multiplexed data in downstream nodes.

卷積5406為具有S型啟動函數之64輸入、128輸出、3乘3卷積，其在單組16個並行輸入連接上接受64個經時間多工輸入且在單組16個並行輸出連接上產生128個經時間多工輸出值。各輸出值經保持用於二個串流處理循環，而非傳入的四個串流處理循環。具有加載至該查找表中之適當值的輔助塊在資料通道上實施具有等效時序之S型啟動函數。 Convolution 5406 is a 64-input, 128-output, 3 by 3 convolution with an S-type enable function that accepts 64 time-multiplexed inputs on a single set of 16 parallel input connections and produces 128 time-multiplexed output values on a single set of 16 parallel output connections. Each output value is maintained for two stream processing cycles instead of the four stream processing cycles passed in. A helper block with the appropriate values loaded into the lookup table implements an S-type enable function with equivalent timing on the data channel.

MaxPool 5407為128輸入、128輸出、2乘2最大值函數，其經組態以在16個並行輸入連接中之各者上接受並處理八個經時間多工值。輸出為在二位置寬及二位置高之小塊內發現的各個別通道之最大值。輸出經置放於單組16個並行輸出連接上，並且在下一值呈現之前，各值在八個串流處理循環內保持恆定。僅需要單個輔助塊以處理經時間多工資料之所有128個通道。此為下游節點中之經時間多工資料之六十四個通道提供機會。 MaxPool 5407 is a 128-input, 128-output, 2-by-2 maximum function configured to accept and process eight time-multiplexed values on each of the 16 parallel input connections. The output is the maximum value of each individual channel found in a block that is two places wide and two places high. The output is placed on a single set of 16 parallel output connections, and each value remains constant for eight stream processing cycles before the next value is presented. Only a single auxiliary block is required to process all 128 channels of time-multiplexed data. This provides the opportunity for sixty-four channels of time-multiplexed data in downstream nodes.

卷積5408為具有任意啟動函數之128輸入、256輸出、3乘3卷積，其在單組16個並行輸入連接上接受128個經時間多工輸入且在單組16個並行輸出連接上產生256個經時間多工輸出值。各輸出值經保持用於四個串流處理循環，而非傳入的八個串流處理循環。具有加載至該查找表中之適當值的輔助塊在資料通道上實施具有等效時序之任意啟動函數。 Convolution 5408 is a 128-input, 256-output, 3-by-3 convolution with arbitrary start function that accepts 128 time-multiplexed inputs on a single set of 16 parallel input connections and produces 256 time-multiplexed output values on a single set of 16 parallel output connections. Each output value is maintained for four stream processing cycles instead of the eight passed in. A helper block with the appropriate values loaded into the lookup table implements the arbitrary start function with equivalent timing on the data channel.

擴展5409為256輸入、256輸出、2乘2擴展，其在二個列中之各者上將各輸入值重複二次。256個經時間多工輸出通道置放於單組16個並行輸出連接上，但各重複值僅在一個串流處理循環而非傳入的四個串流處理循環內保持恆定。需要此函數以使輸出陣列大小與下文所描述之卷積5410之輸出相容，使得等效位置處之多個通道之串接可根據模型之定義來實施。此減少下游節點中之經時間多工資料之十六個通道的機會。 Extension 5409 is a 256-input, 256-output, 2-by-2 extension that repeats each input value twice on each of the two rows. The 256 time-multiplexed output channels are placed on a single set of 16 parallel output connections, but each repeat value remains constant for only one stream processing cycle instead of the incoming four. This function is needed to make the output array size compatible with the output of convolution 5410 described below so that concatenation of multiple channels at equivalent locations can be implemented according to the definition of the model. This reduces the chance of time-multiplexing sixteen channels of data in downstream nodes.

卷積5410為128輸入、256輸出、3乘3卷積，其在單組16個並行輸入連接上接受128個經時間多工輸入且在單組16個並行輸出連接上產生256個經時間多工輸出值。各輸出值經保持用於僅一個串流處理循環，而非傳入的二個串流處理循環。具有加載至該查找表中之適當值的輔助塊在資料通道上實施具有等效時序之啟動函數。 Convolution 5410 is a 128-input, 256-output, 3-by-3 convolution that accepts 128 time-multiplexed inputs on a single set of 16 parallel input connections and produces 256 time-multiplexed output values on a single set of 16 parallel output connections. Each output value is maintained for only one stream processing cycle, rather than two stream processing cycles passed in. A helper block with the appropriate values loaded into the lookup table implements an activation function with equivalent timing on the data channel.

串接5411經呈現有具有相容時序及陣列串流位置的256個輸入通道之二個不同串流，使得串接成單組512個輸出通道為可實施的。由於二組16個並行連接上之各值僅經保持用於一個串流處理循環，因此無需進一步多工或多工為不可行的。因此，串接函數僅藉由匯流排連接上之經時間多工資料之配置在電路中表示，並且在此情況下不需要輔助塊。 The concatenation 5411 is presented with two different streams of 256 input channels with compatible timing and array stream positions, making concatenation into a single set of 512 output channels feasible. Since the values on the two sets of 16 parallel connections are only maintained for one stream processing cycle, no further multiplexing is required or is not feasible. Therefore, the concatenation function is represented in the circuit simply by the configuration of the time-multiplexed data on the bus connections, and no auxiliary blocks are required in this case.

MaxPool 5412為512輸入、512輸出、2乘2最大值函數，其經組態以在二組16個並行輸入連接上接受並處理16個經時間多工值。由於傳入串流呈現在並行輸入連接之二個不同組上，因此輸出最初置放於二組16個並行輸出連接上，並且在下一值呈現之前，各值在四個串流處理循環內保持恆定。此二組16個並行輸出連接隨後經路由至另一輔助塊中，其中它們傳遞通過時間多工器區段以產生由單組16個並行輸出連接攜載之512個資料通道，其中各資料值經保持用於二個串流處理循環，而非四個串流處理循環。 MaxPool 5412 is a 512-input, 512-output, 2-by-2 maximum function configured to accept and process 16 time-multiplexed values on two sets of 16 parallel input connections. Since the incoming streams are presented on two different sets of parallel input connections, the outputs are initially placed on two sets of 16 parallel output connections, and each value remains constant for four stream processing cycles before the next value is presented. These two sets of 16 parallel output connections are then routed to another auxiliary block where they are passed through a time multiplexer section to produce 512 data channels carried by a single set of 16 parallel output connections, where each data value is held for two stream processing cycles instead of four.

卷積5413為512輸入、1024輸出、1乘1卷積，其在單組16個並行輸入連接上接受512個經時間多工輸入且在單組16個並行輸出連接上產生1024個經時間多工輸出值。各輸出值經保持用於僅一個串流處理循環，而非傳入的二個串流處理循環。藉由僅利用3乘3核心處理塊之九個胞元中之右下方輸出求和胞元來實施1乘1卷積。 Convolution 5413 is a 512-input, 1024-output, 1-by-1 convolution that accepts 512 time-multiplexed inputs on a single set of 16 parallel input connections and produces 1024 time-multiplexed output values on a single set of 16 parallel output connections. Each output value is maintained for only one stream processing loop, rather than two stream processing loops passed in. The 1-by-1 convolution is implemented by utilizing only the bottom right output summing cell of the nine cells of the 3-by-3 core processing block.

平均5414在單組16個並行輸入連接上接受1024個經時間多工資料通道，並且經組態以運算等於整個經減小陣列串流之小塊大小上的輸入值之平均值。 The Average5414 accepts 1024 channels of time-multiplexed data on a single set of 16 parallel input connections and is configured to compute the average of the input values over a chunk size equal to the entire reduced array stream.

模型之輸出5415經遞送為16個外部連接上之1024個經時間多工值，其表示1024個不同類別之存在或不存在。 The output 5415 of the model is delivered as 1024 time-multiplexed values on 16 external connections, representing the presence or absence of 1024 different categories.

此模型形式並不意欲作為穩健實例，而是經證實以利用本發明之許多屬性且闡明電路實務上如何起作用。 This model form is not intended as a robust example, but rather is demonstrated to exploit many of the properties of the invention and to illustrate how the circuit actually works.

熟習此項技術者應理解，諸圖中所繪示且上文所描述之實施例皆為例示性的，並且並不詳述本發明可採用之每個形式。可存在可在本發明之範圍內實現的多種其他形式。 Those skilled in the art will appreciate that the embodiments depicted in the figures and described above are exemplary and do not describe every form that the invention may take. There may be many other forms that may be implemented within the scope of the invention.

本發明之範圍僅受到申請專利範圍限制。 The scope of this invention is limited only by the scope of the patent application.

4401:主要輸入 4402:輸入匯流排 4403,4404,4407,4409,4412:多工器 4405:核心處理塊 4406:輸出匯流排 4408:輔助功能塊 4410:外部功能電路系統 4411:16個並行輸入連接 4413:主要輸出電路系統 4401: Main input 4402: Input bus 4403,4404,4407,4409,4412: Multiplexer 4405: Core processing block 4406: Output bus 4408: Auxiliary function block 4410: External function circuit system 4411: 16 parallel input connections 4413: Main output circuit system

Claims

An application specific integrated circuit (ASIC) for computing a convolutional neural network (CNN), comprising: a first input bus that receives an ordered stream of values from an array, each position in the array having one or more data channels; A first ordered group of core processing blocks, the core processing blocks having a fixed number of parallel input connections and a fixed number of parallel output connections, each core processing block of the first ordered group coupled to the input bus through one of a first set of configurable multiplexers, the core processing blocks being adapted to compute a convolution of a common core size and pass the computed value to a first output bus and to an adjacent downstream core processing block of the first ordered group, the first output bus being connected back as an input to each configurable multiplexer in the first set of configurable multiplexers; A second ordered group of core processing blocks, the core processing blocks having the fixed number of parallel input connections and the fixed number of parallel output connections, each of the core processing blocks of the second ordered group coupled to the first output bus through one of a second set of configurable multiplexers, the core processing blocks of the second ordered group adapted to compute a convolution of the common core size and pass the computed value to a second output bus and to an adjacent downstream core processing block of the second ordered group, the second output bus being connected back as an input to each of the configurable multiplexers in the second set of configurable multiplexers; and A third ordered group of core processing blocks, the core processing blocks having the fixed number of parallel input connections and the fixed number of parallel output connections, each of the core processing blocks of the third ordered group being coupled to the second output bus through one of a third set of configurable multiplexers, the core processing blocks of the third ordered group being adapted to compute a convolution of the common core size and to pass the computed value to a third output bus and to an adjacent downstream core processing block of the third ordered group, the third output bus being connected back as an input to each of the configurable multiplexers in the third set of configurable multiplexers, the third output bus also being connected to a primary output circuit through a single primary output multiplexer, the primary output circuit being adapted to perform primary output processing and provide final output.

A special application integrated circuit as claimed in claim 1, further comprising: an additional configurable multiplexer coupled to the first output bus, which provides the selected value to the first core processing block in the second ordered group; and an additional configurable multiplexer coupled to the second output bus, which provides the selected value to the first core processing block in the third ordered group.

The special application integrated circuit of claim 1 further includes one or more auxiliary function blocks that provide functions other than the functions of the core processing blocks.

A special application integrated circuit as claimed in claim 3, wherein the one or more auxiliary functional blocks receive input from a dual multiplexer connected to the input bus and the first output bus and provide output to the first output bus.

A special application integrated circuit as claimed in claim 3, wherein the one or more auxiliary functional blocks receive input from a dual multiplexer connected to the first output bus and the second output bus and provide output to the second output bus.

A special application integrated circuit as claimed in claim 3, wherein the one or more auxiliary functional blocks receive inputs from a dual multiplexer connected to the second output bus and the third output bus and provide outputs to the third output bus.

The special application integrated circuit of claim 1 further includes an external functional circuit system that selects input from the first output bus through a configurable multiplexer and provides output to the first output bus.

The special application integrated circuit of claim 7 further includes an external functional circuit system that selects input from the second output bus through a configurable multiplexer and provides output to the second output bus.

The special application integrated circuit of claim 8 further includes an external functional circuit system that selects input from the third output bus through a configurable multiplexer and provides output to the third output bus.

A special application integrated circuit as claimed in claim 1, wherein the common core size is a 3×3 core.

A special application integrated circuit as claimed in claim 1, wherein the fixed number of parallel input connections is 16, and the fixed number of parallel output connections is 16.

A special application integrated circuit as claimed in claim 1, wherein the ordered value stream is provided by one of a direct camera output of RGB values, a DMA interface suitable for access from a CPU bus, or a video stream decompression circuit, thereby producing three parallel channels of red, green and blue (RGB) values for an image.

A special application integrated circuit as claimed in claim 11, comprising combining the operations of two or more core processing blocks of a 3x3 core having 16 parallel input and output connections to configure a 3x3 core having more than 16 inputs or more than 16 outputs or more than 16 inputs and outputs.

A special application integrated circuit as claimed in claim 10, which is adapted by additional circuitry to compute a volume having a core greater than 3 times 3 by combining operations of a plurality of said 3 times 3 core processing blocks.

The special application integrated circuit of claim 14, wherein the special application integrated circuit is suitable for calculating a 5 times 5 product, a 7 times 7 product or a 9 times 9 product.

A special application integrated circuit as claimed in claim 1, wherein the core processing blocks present each input channel value to a large number of multipliers which operate on the full set of possible multiples of the input and provide those multiples together with the single channel values from an auxiliary parallel connection set to a single convolution unit, the resulting outputs from each convolution unit being grouped in a set of 16 parallel output connections and available to other core processing blocks on the output bus.

A special application integrated circuit as claimed in claim 1, wherein the core processing block input value is processed by a local dual-input fixed multiplier.

A special application integrated circuit as claimed in claim 3, wherein each of the auxiliary function blocks receives parallel input connections from two independent buses through a dual multiplexer and outputs one of a MaxPool function, an averaging function, a sampling function and an expansion function selected by an output multiplexer.

A special application integrated circuit as claimed in claim 18, wherein the auxiliary function block sums the values from the parallel input connections via individual channels and multiplexes the summed values into a lookup table, the lookup table being suitable for providing any activation function that can be represented in a tabular form, including a RELU, S-type or hyperbolic tangent activation function.

A special application integrated circuit as claimed in claim 18, wherein the input channels received from independent parallel input connections are input to a first dedicated multiplexer, which first dedicated multiplexer connects the parallel input connections in series to effectively reroute the data channels to specific parallel output connections and provides the serialized output to the output multiplexer as a candidate for selection as an output of the auxiliary functional block.

A special application integrated circuit as claimed in claim 18, wherein the input channels received from independent parallel input connections are input to a second dedicated multiplexer, which connects the two parallel input connections in series into a parallel output connection by alternating samples from each connection and provides the result to the output multiplexer as a candidate for selection as an output of the auxiliary function block.