TW200923803A

TW200923803A - Hardware neural network learning and recall architecture

Info

Publication number: TW200923803A
Application number: TW96144692A
Authority: TW
Inventors: Meng-Shen Cai; Jun-Cheng Cai
Original assignee: Univ Nat Taipei Technology
Priority date: 2007-11-26
Filing date: 2007-11-26
Publication date: 2009-06-01

Abstract

Disclosed herein is a hardware neural network learning and recall architecture, which provides a kind of highly efficient general-purpose neural network system. The use of the ring-like Single Instruction-Bus Multiple Data-Bus (SIMD) and the back-propagation training in the neural network enables the neural network to have a complete function of recall and learning so that a user is allowed to adjust the number of processing units in the array according to the difference in the back-propagation architecture, without having to reconfigure and redesign the overall system. It is expected that by using such a design development, neural network's application can be extended to low-order embedded systems so as to drive a new generation's application. The present invention can improve previous neural network architectures in that fewer logic components are used, while more flexibility and better performance efficiency can be simultaneously achieved.

Description

200923803 九、發明說明：【發明所屬之技術領域】本發明係關於硬體類神經網路懷回想架構，特別是指—種同時具有回錄㈣功能及學習（Learaing)功能的硬體類神經網路架構。【先前技術】現今許多的人工智慧細領域巾，__路技術已逐輪演著重要的角色。在大部分的應财，_經網路収以軟_方式在—般的計算機上面實現，雖絲具了·，但運算速針分耗時。學⑽過程僅能進行離線（Off-Line)的運算’阻礙了應用範圍的拓展。類神經網路在運算時需要進行大#的數學運算’這些透過軟體實現的系統只能於高速的計算機上順利執行，而無法在低階的嵌人式系統中進行應用。隨著科技的發展，人們嘗試以硬體來實賴神，_路，藉以提高速度及魏。有些硬體僅能針對特定用途的類神經網路架構及參數進行設計，開發時間長且限制了移植性；另一些硬體則使用大量的邏輯元件、佔據龐大晶片面積，耗費成本。習知關於類神經網路技術的專利如下所述： L美國專利第5,087,826號：其提出的架構乃對於每個神經元的鏈結皆使用一個乘法器運算輸入與鍵值的乘積(iw)，以形成一個二維的陣列架構，運算速度雖快但耗費大量的硬體成本且產生大量的匯流排，不利於設計。 2. CNAPS ( Dan Hammerstrom, "A VLSI architecture for high-performance, low-cost, on-chip learning，" Proceedings of 200923803200923803 IX. INSTRUCTIONS: [Technical field to which the invention pertains] The present invention relates to a hardware-like neural network, in particular to a hardware-like neural network having both a re-recording (four) function and a learning (Learaing) function. Road architecture. [Prior Art] Many of today's artificial wisdom fine-field towels, __ road technology has played an important role. In most of the accounts, _ via the network to receive the soft _ way on the general computer to achieve, although the silk has, but the speed of the operation of the needle. The learning (10) process can only perform offline (Off-Line) operations, which hinders the expansion of the application range. Class-like neural networks require large mathematical operations during computation. These systems implemented through software can only be executed smoothly on high-speed computers, and cannot be applied in low-order embedded systems. With the development of science and technology, people try to use the hardware to rely on God, _ road, in order to improve speed and Wei. Some hardware can only be designed for specific-purpose neural network architectures and parameters, and it takes a long time to develop and limits migration. Other hardware uses a large number of logic components, occupying a large area of the chip, and cost. The patents relating to neural network-like techniques are as follows: L. U.S. Patent No. 5,087,826: The proposed architecture uses a multiplier operation input product and a key value (iw) for each neuron chain. In order to form a two-dimensional array architecture, the operation speed is fast, but it consumes a large amount of hardware cost and generates a large number of bus bars, which is not conducive to design. 2. CNAPS ( Dan Hammerstrom, "A VLSI architecture for high-performance, low-cost, on-chip learning," Proceedings of 200923803

International Joint Conference on Neural Networks, 1990, pp. 537-544. Dan Hammerstrom, Digital VLSI for Neural Networks, The Handbook of Brain Theory and Neural Networks, Second Edition, Michael Arbib, MIT Press，2003.): CNAPS的優點在於每個運算節點内建了加法與乘法器’且可藉由指令匯流排輸入指令控制其運算，每個節點相裏於一個簡單的算數單元（Arithmetic Unit )，所以在計算不同類神經網路的演算法有很大的彈性。應用該架構於類神經網路的硬體開發上，由於運算節點使用相同架構，可以很容易的進行節點數 - 量的調整。报可惜的，該架構使用的控制指令十分繁複，故需搭配軟體進行指令的編譯，也沒有專屬於活化函數運算的硬體架構，並因採用了通用性的架構，因而在運算速度上有所犧牲。 3.美國專利第5,〇91，864號：其提出的架構雖沒有CNAps那樣的彈性’但其架構精簡且更容易設計及使用，不僅提升了運算的速度也降低了的成本，同時也簡化了控制單元的複雜度，縮短 U Μ發的週期。其輸人資料是採料接的方式傳遞，亦即資料會先 :傳入第一個運算單元，經過兩個週期後才會傳入第二個運算單 : A 為資料纟】達各個處理單元的時間不同，也增添了控制單元設計的難度。其巾較可取的部分是精_化錄的數量Q减構考量到-轉列的特性，f料是配合時脈週期—筆筆的進行輸入與輸出’因此運算單元並不需要同時進行活化函數的計算，進而將活化函數從神經元中取出，獨立置放於陣列的回傳部分僅需叹冲-個活化函數即可完成運算，也不會因此耽誤到運算的速 200923803 度此外該架構設計了_組移位暫存器，可以將運算完畢的資料先儲存，再—筆筆往回傳遞，傳遞關時所有運算單元可立即進行下-筆資料的運算，充分的節树間。縣構的缺點是沒有學各部分的_ ’僅能針對訓練完成賴路進行回想運算。 4·美國專利第5,799，134號：其所提出的架構與美國專利第5，_料號的概念相似，但輸人資料是採用並聯方式連接，亦即所有運算單元於同-時刻接收到烟的輸人訊號。並藉由在運算單元中增 () 添減法器及乡工11使得運算的變化更加紐。但職沒有學習部分的機制，僅能執行類神經網路的回想功能。由此可見，上述習用技術仍有諸多缺失及不足，實非一良善之設計，而亟待加以改良。同時，習知的技術僅提供回想的功能，學習的功能仍須透過主機始能完成。本案發明人鑑於上述習用技術所衍生的各項缺點及不足，乃亟思加以改良創新’並經多年苦心孤指潛心研究後，終於成功研發完成本件硬體類 U 神經網路學習與回想架構。 : 【發明内容】 : 本發明之目的即在於提供硬體類神經網路學習與回想架構，係同時具備有回想功能與學習功能的類神經網路架構。可達成上述發明目的之硬體類神經網路學習與回想架構，係由運算單元（Proc⑽ Unit，PE )、活化函數（Activation Function )、控制匯流排（ControlInternational Joint Conference on Neural Networks, 1990, pp. 537-544. Dan Hammerstrom, Digital VLSI for Neural Networks, The Handbook of Brain Theory and Neural Networks, Second Edition, Michael Arbib, MIT Press, 2003.): The advantage of CNAPS is that Each operation node has built-in addition and multiplier' and can control its operation by command bus input instruction. Each node is in a simple arithmetic unit (Arithmetic Unit), so it is used to calculate different neural networks. The algorithm has a lot of flexibility. Applying this architecture to the hardware development of neural networks, the number of nodes can be easily adjusted because the computing nodes use the same architecture. Unfortunately, the control instructions used in this architecture are very complicated, so it is necessary to compile the instructions with the software, and there is no hardware architecture dedicated to the activation function. Because of the universal architecture, the operation speed is sacrifice. 3. U.S. Patent No. 5, No. 91,864: The proposed architecture does not have the flexibility of CNAPs, but its architecture is simpler and easier to design and use, which not only improves the speed of calculation but also reduces the cost, and also simplifies The complexity of the control unit shortens the cycle of U bursts. The input data is transmitted by means of picking, that is, the data will be first: the first arithmetic unit is passed in, and the second operation list is passed after two cycles: A is the data 纟] each processing unit The time is different, which also adds to the difficulty of the control unit design. The preferred part of the towel is the number Q of the fine-grained recording, the characteristic of the de-conversion, and the f-material is the clock-input and the output of the pen. Therefore, the arithmetic unit does not need to perform the activation function at the same time. The calculation, and then the activation function is taken out from the neuron, and the return part of the array is placed independently of the sing-activating function to complete the operation, and the error is not delayed to the 200923803 degree. The _ group shift register can store the calculated data first, and then pass the pen back. When the transfer is closed, all the arithmetic units can immediately perform the calculation of the lower-pen data, which is sufficient between the nodes. The disadvantage of the county structure is that the _ ’ without the various parts of the school can only perform the recall operation for the training completion. 4. U.S. Patent No. 5,799,134: The proposed structure is similar to the concept of U.S. Patent No. 5,_, but the input data is connected in parallel, that is, all the arithmetic units receive the smoke at the same time. The input signal. And by adding () adding a subtractor and a laborer 11 in the arithmetic unit, the change of the operation is further improved. However, there is no mechanism for learning the part, and only the function of recalling the neural network can be performed. It can be seen that there are still many shortcomings and deficiencies in the above-mentioned conventional technologies, which is not a good design and needs to be improved. At the same time, the conventional technology only provides the function of recalling, and the learning function still needs to be completed through the host. In view of the shortcomings and shortcomings derived from the above-mentioned conventional technologies, the inventors of the present invention have made innovations and innovations, and after years of hard work and painstaking research, they finally succeeded in researching and developing this hardware-based U neural network learning and recalling architecture. [Description of the Invention]: The object of the present invention is to provide a hardware-like neural network learning and recalling architecture, which is a neural network architecture with both recall function and learning function. The hardware-like neural network learning and recall architecture for achieving the above object is composed of a computing unit (Proc(10) Unit, PE), an activation function (Activation Function), and a control bus (Control).

Bus)、輸入資料匯流排（input Data Bus)、權重資料匯流排（Weight Data 200923803Bus), input data bus (input Data Bus), weight data bus (Weight Data 200923803

Bus )、位址匯流排（Address Bus )、學習區塊（Leaming Block )、控制單元 (Control Unit)及多工器（Mux)所組成；藉由環狀串列多資料匯流排架構，進行倒傳遞類神經網路的運算，使其具有回想與學習的完整功能；讓使用者可以根據倒傳遞架構的不同，調整陣列中運算單元的數量，而不需要再重新規劃及設計整個系統，期望透過這樣的設計開發，將類神經網路的應用延伸到低階的嵌入式系統中，將可以帶動新一代的應用；本發明可改善以往類神經網路硬體架構，在透過較少的邏輯元件數目且兼具彈性的 f、同時，還能達到更佳的執行效能者。【實施方式】請參閱圖一、圖二及圖三，為本發明硬體類神經網路學習與回想架構之實施架構示意圖、運算單元所代表的節點示意圖及環形架構示意圖，由圖中可知，本發明硬體類神經網路學習與回想架構丨，係由運算單元(Pr〇cess Unit, PE) U、活化函數（Activation Function ) 12、控制匯流排（Control Bus ) 13、輸入資料匯流排（Input Data Bus ) 14、權重資料匯流排（Weight Data Bus ) 15、位址匯流排（Address Bus) 16、學習區塊（Learning Block) 17、控制單元（Control Unit) 18、多工器（Mux) 19所組成；本發明乃以環狀串列多資料匯流排架構（Single Instruction-Bus Multiple Data-Bus，SIMD)架構作為基礎，將所有的運算單元11鏈結成一個一維的運算單元（PE)陣列2 ; 所有的運算單元11連接到同一個控制匯流排13，並於同一時刻進行相同的運算，其中輸入資料匯流排14是傳送類神經網路中各層的輸入值χ以及逆向過程中的J值；由於硬體共構的緣故，在逆向運算的過程中，本發明依然 200923803 使用運算單元陣列2進行輸入資料匯流排Μ同樣需要進行在修正後轉重值產生的同時，間，她咖—_立睛==/㈣《細存的時陣列2中，而不至概誤運算的時^排Μ來卿重值存人運算單元考麵财縣值__若全部接顺辟幻s，當運算單元η 錄量幢流排也料雜卿元18 _增加’故本發明僅==重細流排15，即可將所有的權重值於運算前先儲存於運鼻早兀11内。由於運料元_2僅朗― 權重值能儲存在適當的運算單元u内，本發明固匯二排14，為了讓號，並透過位址嶋16給晴單it u位址，該=了進行編來列斷在權重資額流排上㈣料該儲存在_處理單元:u位址= 定處理單元η内記憶體的某一段位址，因此並不7° =，而不是指有利於簡化纽設計與降低成本。 ’齡址線’ 由於倒傳遞類神經網路的運算乃為多層的架構，除了輸入層每 ΓΓ入樹—層運算後犧，再加上每1都是進行她^ 模式，因此可以使用相同的運算單元（ΡΕ)來進疒_ 延异用叫進行隱藏層第-個節點的運算，同樣也使用找同層的運弁。亦即使點的運算，各運算單摘代表的網路節點可以關二來第個知僅需要數量等同最大單層隱藏層個數的運算單元，即η '來類神經網路運算，大幅的縮減了硬體的使用量。棘η 多層的倒傳遞 =運算單元陣列運篁〜單—層後，會將運算結果繼續使用於下一層的運算， ^ ，因此本發明將運算結 200923803 果直接傳送到輪人資料匯流排進行下-層的運算，形成-環形架構3，如圖二所不，由於資料直接可輸人進行運算，不需經由控制單元18再做處理，將有助於縮短運算的時間。請參閱圖四，為本發明之運算單元模型示意圖，由於本發明需要進行倒傳遞類神_路的運算，基於速度及祕元件侧健財量，將採用如圖四之運算單元4。如此一來，僅需要簡單的控制堆疊架構的 s己憶體411及先進先出（FIF〇)仔列架構的記憶體412的寫入與讀取，以 f、及累加器内部資料的清除’即可完成運算工作。另一方面，由於控制堆疊架構的記憶體411及先進先出仵列架構的記麵412 , (Multiplier) 42、加法器（Accumulator) 43相互之間的匯流排各自獨立，十分適合採用官線的方式，將運算拆解成若干個可以獨立執行的階段，所有資料緊鄰著在各個階段中進行運算，如此一來可以大幅的提昇運算效能。由於倒傳遞類神經網路在逆向運算的過程中，靠近輸出層的修正後權重鋪最先制·算絲’但在前向運算的過針卻反而是#近輸入層的權 J重值最先被使用。為了使學習過程計算出的修正後權重值能以正確的順序立即存入對應的運算單元十，本發明於每個運算單元内部設計一個堆疊架構的記憶體411來存放漁值，縮短等待權重值更新的時間。且由於堆叠僅需要進行讀寫的控制，外部不需要額外指定記憶體位址，可以減少控制單元的複雜度及硬體成本。在逆向計算的過程中，同樣需要使用權重值來進行計算，愈靠近輸出層的權重值愈先被使用。為了充分的利用現有硬體，共用運算單元陣列的 200923803 硬體對逆向過程巾的部分公式進行計算。且由於勒運算的過程中，愈靠近輸出層的權重值愈先被計算，其權重值順序與前向運算完全不同，且分屬於不同的運算單元巾，因而無法直接制運算單元堆疊时放的權重值進行計算，故在運算單元内部設置了一個先進先出仵列架構的記憶體412 來存放權重值。由於輪入資料匯流排44 一次僅能傳送一筆資料，故前一層的輸出僅需要-次輸出-科可，因此#運算單元_運算完料，會各自將結果存於單Μ的暫存ϋ (Register) 45巾，接著控解元會將移位(_訊號致月b«此時所有運算單元陣列將形同一組移位暫存器，將運算結果—個接一個的往前傳遞出來。運算單元4 _外設計—個暫存器45用於儲存計算結果，其存在的目的在於將運算的結果與運算過㈣資__來，亦即當運算的結果依雜贿遞的触巾，運算單元仍_不受影響進行下一層運算，避免閒置的時間。請參閱圖五，為本發明之學習區塊硬體架構示意圖，由圖中可知，由於運算單凡陣刺特性，—個雜僅會回傳—個輪岐果，纽於此除了共構的部分以外，其他的·並杨驗SI·架構。故本㈣設計了一個硬體學習區塊5連接於運算單元陣列，專門進行倒傳遞類神經網路學習部分的運算，且由於設計採職料流的概念，資料由各堆疊或彳宁列一筆筆讀出與存人’ 神經網路節點數量的改變並不會影響學躯塊5内的架構，仍可順利進行運算。該學龍塊5内運算崎程可以簡略劃分成計算 L計算~以及物三個部分，其中標號的部分用以區分整體流程的順 11 200923803 序’其中重要的訊號所代表的運算數值如下列表1所示表1重點訊號說明訊號訊號數値 —---η a① Σ^Γ'-Ο ^ ~ k b② b③ —---- e； ④ υ；(}-υ；) ~ ~ ⑤ ⑧ ηΥ;-'δ] ~~ ⑨ αΔ< (，(〇 ⑩ ----- < (，-1) +，(卜 ΐχγ = <(i) 、賢种岬袅肩路T各層的輸入與輸出分別存入學習區塊5内的各層輸入堆疊(¥__)51與各層輸出堆疊仏 ayerBus ), Address Bus, Learning Block, Control Unit, and Multiplexer (Mux); with a looped multi-data bus structure The operation of the neural network is passed, so that it has the complete function of recalling and learning. It allows the user to adjust the number of arithmetic units in the array according to the reverse transmission architecture, without having to re-plan and design the whole system. Such design and development, extending the application of neural network to low-level embedded systems, will drive a new generation of applications; the present invention can improve the previous neural network hardware architecture, with fewer logic components The number and flexibility of f, at the same time, can achieve better performance. [Embodiment] Please refer to FIG. 1 , FIG. 2 and FIG. 3 , which are schematic diagrams of an implementation architecture of a hardware neural network learning and recalling architecture, a schematic diagram of a node represented by an arithmetic unit, and a ring structure diagram. The hardware-based neural network learning and recall structure of the present invention is composed of a computing unit (Pr〇cess Unit, PE) U, an activation function (12), a control bus (Control Bus) 13, and an input data bus ( Input Data Bus ) 14. Weight Data Bus 15. Address Bus 16. Learning Block 17. Control Unit 18. Multiplexer (Mux) 19 is composed; the invention is based on a Single Instruction-Bus Multiple Data-Bus (SIMD) architecture, and all the operating units 11 are linked into a one-dimensional arithmetic unit (PE). Array 2; all of the arithmetic units 11 are connected to the same control bus 13 and perform the same operation at the same time, wherein the input data bus 14 is a transmission type neural network The input value χ and the J value in the reverse process; due to the hardware co-construction, in the process of the reverse operation, the present invention still uses the arithmetic unit array 2 for the input data bus arrangement in the process of reverse operation, and also needs to be transferred after the correction. At the same time, the value of her, _ _ 立立立 = = = / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / __If all are connected to the illusion s, when the computing unit η record building flow row is also expected to be mixed _ _ 18 _ increase ' so the invention heavy flow row 15, you can put all the weight values before the operation Stored in the nose of the nose. Since the transport element 2 only has a weight value that can be stored in the appropriate arithmetic unit u, the present invention consolidates the second row 14 in order to give the number and pass the address 嶋16 to the clear single order address, which = The compilation is performed on the weighted flow line (4). The storage is in the _processing unit: u address = the address of a certain segment of the memory in the processing unit η, so it is not 7° =, rather than beneficial Simplify design and reduce costs. The 'age line' is due to the multi-layer architecture, except that the input layer is smashed into the tree-layer operation, and every 1 is performed in the ^ mode, so the same can be used. The operation unit (ΡΕ) is used to enter the 疒 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Even if the operation of the point is performed, the network node represented by each operation single can be closed to the first operation unit that only needs the number of the largest single layer hidden layer, that is, the η 'category neural network operation, which is greatly reduced. The amount of hardware used. The inverse transfer of the spine η multilayer = operation unit array operation ~ single layer, the operation result will continue to be used for the operation of the next layer, ^, so the present invention directly transfers the operation result 200923803 to the wheel data bus - The operation of the layer forms a ring structure 3, as shown in Fig. 2. Since the data can be directly input to the user for calculation, it is not necessary to perform processing through the control unit 18, which will help shorten the calculation time. Please refer to FIG. 4 , which is a schematic diagram of the operation unit model of the present invention. Since the present invention needs to perform the inverse transfer type _ road operation, based on the speed and the security of the component side, the operation unit 4 of FIG. 4 will be used. In this way, it is only necessary to simply control the writing and reading of the memory 412 of the stack architecture and the memory 412 of the first-in first-out (FIF) architecture, with f, and the clearing of the internal data of the accumulator' You can complete the calculations. On the other hand, since the memory 411 of the stacked architecture and the surface 412 of the FIFO structure, the busbars of the (Multiplier) 42 and the Accumulator 43 are independent of each other, which is suitable for the official line. In this way, the operation is disassembled into a number of stages that can be executed independently, and all the data are immediately adjacent to the operation in each stage, so that the performance can be greatly improved. Because the inverse transfer-like neural network is in the process of reverse operation, the modified weights close to the output layer are laid out first-handedly, and the calculation is performed. However, the forward-handed operation of the needle is instead the weight of the near-input layer is the first. used. In order to enable the corrected weight values calculated by the learning process to be immediately stored in the correct operation unit in the correct order, the present invention designs a stacked structure memory 411 inside each operation unit to store the fish value and shorten the waiting weight value. Updated time. Moreover, since the stack only needs to perform read/write control, the external does not need to specify an additional memory address, which can reduce the complexity and hardware cost of the control unit. In the process of reverse calculation, the weight value is also used to calculate, and the closer to the output layer, the weight value is used first. In order to make full use of the existing hardware, the 200923803 hardware sharing the arithmetic unit array calculates the partial formula of the reverse process towel. And in the process of the Le operation, the weight value closer to the output layer is calculated first, and the weight value sequence is completely different from the forward operation, and belongs to different operation unit tissues, so that it cannot be directly placed when the operation unit is stacked. The weight value is calculated, so a memory 412 of a FIFO structure is set inside the arithmetic unit to store the weight value. Since the round-in data bus 44 can only transmit one data at a time, the output of the previous layer only needs to be output-time-coke, so the #operation unit_ is calculated and stored in the temporary storage of each unit ( Register) 45 towel, then the control will be shifted (_ signal to the month b« at this time all the array of arithmetic units will be shaped into the same group of shift registers, the results of the operation - one by one forward. Unit 4 _ external design - a temporary register 45 is used to store the calculation result, the purpose of which exists is to calculate the result of the operation and the operation (4) __, that is, when the result of the operation is based on the touch of the bribe, the operation The unit is still _ unaffected to perform the next layer operation, avoiding the idle time. Please refer to FIG. 5 , which is a schematic diagram of the learning block hardware architecture of the present invention, as can be seen from the figure, due to the operation of the single thorn characteristics, Will be returned - a round effect, in addition to the co-construction part, and other Yang and SI architecture. Therefore, this (four) designed a hardware learning block 5 connected to the array of computing units, specifically for reverse transmission Class neural network learning part of the operation And because of the concept of designing the job stream, the data is read by the stack or the pen and the deposit of the 'the number of neural network nodes does not affect the structure within the body 5, and can still proceed smoothly. The operation of the dragon block 5 can be divided into three parts: the calculation L calculation and the three parts. The part of the label is used to distinguish the whole process from the sequence of 200923803. The important signals represented by the signals are as follows: The key signals in Table 1 shown in Table 1 indicate the number of signal signals -—-- η a1 Σ^Γ'-Ο ^ ~ k b2 b3 —---- e; 4 υ; (}-υ;) ~ ~ 5 8 ΥΥ;-'δ] ~~ 9 αΔ< (,(〇10 ----- < (,-1) +,(卜ΐχγ = <(i), 贤贤岬袅肩路T input of each layer The input stack (¥__) 51 and the output stack of each layer are stored in the learning block 5 and output 仏ayer

Output Staek) % (流程①），其中各層的輸人與輸出值即為運算單元陣列的輸入值與、U化函數計算之輸出值。當前向運算完成後，會先由各層輸出隹疊中讀出輸出層的輸出值，並與訓練樣本的目標值進行運算求以並存入求出的同時會被傳送到輸入資料匯流排由運算單元万佇列中（流程⑤）。5Σοτ.νΟ 陣列進行* -g-gr ,. . en 層的h Μ程⑥以 1傳入學習區塊5内供後續再求—— 田L 存八件列後，開始自各層輸入堆疊51讀出各層的輸入值，並配合運算的時機自〜撤Aw㈣中讀以與“進行運算，最後 12 200923803 將求出新的AW (流程⑨）。將新求出的“與舊的權重值W進行運算’可求《正後的權重值，並將結果存人你_中，W存人制關時也將被傳送到《資龍流排上，同步更新·單講顺對應的權重值。 »運算的過程巾，愈靠近輸人層的輸人與輸丨絲愈先被計算出來1在逆_咐巾，爾W-娜賴算，且愈靠近輸⑽驗值愈先被使用。出於順序上的考量，本發用兩個堆叠將前向運算各層的輸人與輸出值分別存人其中，以供逆向運算的過程使由於脈即可計算出—㈣，但後續運算幾個時脈才需要讀出一個為了使運算流程更為賴，本發鶴每個時脈算㈣5先存到作列中，方便後續運算可以逐_讀人計算減少控制單福複雜程度。在運算的過程中，每完成-層的計算件列内的憎全部被讀出，隨後又存入前一層舻，以丁歹j的大小僅需等同於最大單層節點數即可。系先進行權重值初始化時會將隨機產钱權重值存放於運算單元陣列中，同時也將相同的值储存於學習區塊的你先進先出記憶體（wF】⑼ 53 ίI* 予習的過程中修正後的權重值被計算出來，儲存到w先進先出記憶的同時，也將透過權重資料匯流排同步更新到運算單元陣列内對應的隹叠中即虽逆向運算完成時，運算單元堆疊内存放的權重值也完成更新， T以立即進行前向運算，並於前向運算的過程中，再自w先贱出記憶體 "賣出權重值更新到運算單元陣列中對應的仔列以逆向運算使用。該^先進先出記憶體（△〜FIFO) 54在系統初始化的過程中，内部々數值將會全部清除為〇，並於隨後的運算中健存權重值的修正量。 13 200923803 為了節省控制單元的設計，資料的存放順序關係採用堆疊或佇列架構儲存。但依運算的需求，某些資料需要重複的進行讀取。為了節省時間及 pi化控制單元的設計’讓資料不需重新加載，故有些堆疊或仔列這些會設計-個保存位址的功能，保留目前記憶體_取位址，當重啟的訊號送達時’會回到保留位址重新將資料讀出。這些部分包括：運算單元的堆疊、學習區塊内存放各層輸入的堆疊，以及學習區塊内存放權重值的仔列。 4參咖六，為本發明之控制單元整誠程示意圖，由财可知，該〇彳工制早7L的目的是控制整_流程，即於適#的時機傳送控觀號，由於所有數值皆使用側及堆疊儲存，大幅了簡化了控制單摘複雜程度，因此控制早疋只需要控制各仔列堆疊的存取清除、累加器清除及各單元輸入錢的選料可完成整體運算。當控鮮元接㈣起始訊號（包括：權重值初始化、批次學習、進行回想三種），則由起始狀態進入下-階的狀態並進仃動作’當完成指定動作時會回傳—完成訊號。 ο，前向運算過程中僅會使用到運算單元堆疊⑽存的權重值，為縮短等领存的時間，堆疊更新完成即回傳完成訊號，並可立即進行前向運算。 =有獨立_重健流排，目此堆疊更新完後仍可於前向運算的過程持辆^仃ΡΕ符列的更新’並於逆向運算之開始前，先判斷仲列的更新是否已、二7^成，等待更新完成後才進行逆向運算。控制單元的流程概略可分為以下階段： ⑴初始化：僅在學習之初執行—次，目岐瓶數錢的權重值值分別存放到運算單元_的堆疊鱗列巾，以及進行各單元 14 200923803 的重置作業； ⑵當财缝值私料單元_的堆疊後，使用早w歹J進仃倒傳遞類神經網路的前向運算並將結果傳入活化函數計算出各層的輪出； (3)Output Staek) % (flow 1), where the input and output values of each layer are the input value of the arithmetic unit array and the output value calculated by the U-function. After the current direction is completed, the output value of the output layer is read out by the output of each layer, and the target value of the training sample is calculated and found and stored in the input data bus. The unit is in the queue (flow 5). 5Σοτ.νΟ Array performs * -g-gr , . . en layer h process 6 is passed to learning block 5 in 1 for subsequent re-seeking - field L after storing eight columns, starting from each layer input stack 51 read Enter the input values of each layer, and match the timing of the operation from ~Aw (4) to read with "to perform the operation, and finally 12 200923803 will find the new AW (flow 9). The newly obtained "with the old weight value W" The operation 'can be used to find the right weight value, and the result will be stored in your _, and the W depositor will also be transferred to the "Zilong stream", synchronous update, single-speaking corresponding weight value. »The process towel of the operation, the closer to the input layer, the more the input and the input silk are calculated. The first is in the inverse _ 咐 ,, 尔 W-na Lai, and the closer to the input (10) the value is used first. For the sake of order, the present invention uses two stacks to store the input and output values of each layer of the forward operation separately, so that the process of the reverse operation can be calculated by the pulse - (4), but the subsequent operations are several The clock only needs to read out one in order to make the operation process more reliant. Each time the clock is calculated and stored in the column, so that the subsequent operations can reduce the complexity of the control unit. In the process of calculation, all the defects in the calculation block column of each completion-layer are read out, and then stored in the previous layer, so that the size of D-J is only equivalent to the maximum single-layer node number. When the weight value is initialized first, the random money weight value will be stored in the operation unit array, and the same value will be stored in the learning block of your first-in first-out memory (wF)(9) 53 ίI* The corrected weight value is calculated and stored in the w first-in first-out memory, and is also updated synchronously to the corresponding stack in the arithmetic unit array through the weight data bus, that is, when the reverse operation is completed, the operation unit stack is stored. The weight value is also updated, T is immediately forwarded, and in the forward operation, the memory is updated from the w"sell weight value to the corresponding row in the array of operations to reverse The operation uses the first-in first-out memory (△~FIFO) 54. During the system initialization process, the internal 々 value will be completely cleared to 〇, and the correction amount of the weight value will be saved in the subsequent operation. 13 200923803 The design of the control unit is saved, and the order of storing the data is stored in a stacked or parallel architecture. However, depending on the requirements of the operation, some data needs to be read repeatedly. Save time and design of the pi control unit 'Let the data not need to be reloaded, so some stack or train these will design - save the address function, keep the current memory _ take the address, when the restart signal is delivered' It will return to the reserved address and read the data again. These parts include: stacking of arithmetic units, stacking of input of each layer in the learning block, and storage of weights in the learning block. The control unit of the invention is a schematic diagram of the whole process. It is known from Cai Cai that the purpose of the completion of the 7L is to control the whole process, that is, the timing of the transmission of the control number, since all the values are used on the side and stacked storage, The simplification of the control single picking complexity is simplified, so the control needs to control the access clearing of each stack of columns, the accumulator clearing, and the selection of the input money of each unit to complete the overall operation. When the control unit is connected to the (four) start signal (Including: weight value initialization, batch learning, recalling three kinds), then from the initial state to the lower-order state and enter the action 'When the specified action is completed, it will be returned - the completion signal ο. In the forward operation process, only the weight value stored in the arithmetic unit stack (10) is used. In order to shorten the accumulated time, the stack update is completed, that is, the completion signal is returned, and the forward operation can be performed immediately. The health flow row, after the stack is updated, the process of the forward operation can still be updated in the process of the forward operation. Before the start of the reverse operation, it is judged whether the update of the secondary column has been completed or not. The reverse operation is performed after waiting for the update to be completed. The flow summary of the control unit can be divided into the following stages: (1) Initialization: Execute only at the beginning of the learning-time, the weight value of the bottle number is stored in the stacking scale of the operation unit_ Rows, and carry out the reset operation of each unit 14 200923803; (2) After the stacking of the financial value unit _, use the early w歹J to advance the forward operation of the neural network and pass the result to the activation The function calculates the rotation of each layer; (3)

⑷ 逆向運算：當前向運算完成，學f區塊進行倒傳遞類神經網路的逆向運算’最終的目的為修正權重值；又可以簡略劃分成計恥、計算心以及输三個部分，其中在計算*程中需要使用運算單元陣列進行計算；權重值更新：在逆向運算的難中，修正後的權重值被計算出來的同時，將之存入運算單元陣列的堆疊中；堆疊更新後，緊接著進行佇列的權重值更新； ⑴完成上述階段的運算後，使用不同的訓練樣本反覆的進行階段 ⑵到階段⑷運算，直到學習完成即可結束類神經網路的訓練。各流程的順序_如®七所示。其中在逆向運算的過程中，當輪出層第一個5計算出來後，會陸續將5傳入運算單元陣列進行運算求出 : ’並接著算出前—層W再傳人運算單轉列運算，直至所有隱藏層的5計算完畢為止。心本發明所餘之硬體類神經網路學習與㈤想架構，與其他習用技術相互比較時，更具備下列優點： 15 200923803 1. 本發明之硬麵神經網路學習與回想架構，係同時具有回想與功能。 2. 本發明之硬體類神經網路學習與回想架構，可讓使用者可以根據倒傳遞架構的不同，調整陣列中運算單元的數量，而不需要再重新規劃及設計整個系統’期望透過這樣的設計開發，將類神經_的應用延伸到低階的嵌入式系統t，將可以帶動新一代的應用。 3. 本發明之硬體類神經網路學習與回想架構，將可改善以往類神經網路硬體架構，在透雜少的騎猶數目且兼具雜的同時，還能達到更佳的執行效能者。 4. 本發明之硬體類神經網路學習與回縣構，係具有簡化系統複雜度、適用範圍廣、設置成本低廉及體積小等優點。上列詳細說明係針對本發明之—可行實施例之具體說明，惟該實施例並非用以關本發明之補範JU，凡未本發明技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。练上所述’本案不但在技術思想上確屬創新並能較習用物品增進上述多項功效’麟充分符合新雜及進步性之法定發明專件，麦依法提出申凊’懇請貴局核准本件發明專利申請案，以勵發明，至感德便。【圖式簡單說明】圖-為本發明硬體類神經網路學習與回想架構之實施架構示意圖；圖二為本發明之運算單元所代表的節點示意圖；圖三為本發明之環形架構示意圖； 16 200923803 圖四為本發明之運算單元模塑示意圖；圖五為本發明之學習區塊硬體架構示意圖；圖六為本發明之控制單元整體流程示意圖；以及圖七為本發明之硬體運算流程說明示意圖。【主要元件符號說明】 I 硬體類神經網路學習與回想架構 II 運算單元（Process Unit, PE ) () 12 活化函數（ActivationFunction) 13 控制匯流排（Control Bus) 14 輸入資料匯流排（Input Data Bus) 15 權重資料匯流排（Weight Data Bus) 16 位址匯流排（Address Bus ) 17 學習區塊（Learning Block ) 18 控制單元（Control Unit) £ \ XJ 19 多工器（Mux) ' 2 運算單元陣列 ' 3 環形架構 4 運算單元 411 堆疊（Stack)架構的記憶體 412 先進先出（FIFO)佇列架構的記憶體 42 乘法器（Multiplier) 17 200923803 加法器（Accumulator) 43 44 輸入資料匯流排 45 暫存器（Register) 5 學習區塊 51 各層輸入堆疊（Layer Input Stack) 52 各層輸出堆疊（Layer Output Stack) 53 w先進先出記憶體（wFIFO) 54 Aw先進先出記憶體（Δ\νFIFO) 18(4) Reverse operation: the current operation is completed, and the inverse operation of the inverse transfer type neural network is performed in the f-block. The final purpose is to correct the weight value; it can also be divided into three parts: shame, calculation, and loss. In the calculation, the calculation of the array of arithmetic units is required; the update of the weight value: in the difficulty of the reverse operation, the corrected weight value is calculated and stored in the stack of the arithmetic unit array; Then, the weight value update of the queue is performed; (1) After the operation of the above stage is completed, the phase (2) to the phase (4) operation are repeated using different training samples, and the training of the neural network is terminated until the learning is completed. The order of each process is shown in ®7. In the process of reverse operation, when the first 5 of the round-out layer is calculated, the 5 input arithmetic unit arrays are successively calculated: 'And then the front-layer W re-transfer operation single-column operation is calculated. Until the 5 of all hidden layers is calculated. The hardware-based neural network learning and (5) architecture of the present invention have the following advantages when compared with other conventional technologies: 15 200923803 1. The hard-face neural network learning and recall architecture of the present invention is simultaneously With recall and function. 2. The hardware-based neural network learning and recall architecture of the present invention allows the user to adjust the number of arithmetic units in the array according to different reverse transmission architectures without having to re-plan and design the entire system. The design and development, extending the application of the neural-like to the low-level embedded system t, will drive a new generation of applications. 3. The hardware-like neural network learning and recall architecture of the present invention can improve the hardware network architecture of the past-like neural network, and achieve better execution while having a small number of riding and judging. Effective person. 4. The hardware-based neural network learning and the back county structure of the present invention have the advantages of simplifying system complexity, wide application range, low installation cost, and small volume. The detailed description of the present invention is intended to be illustrative of the preferred embodiments of the present invention, and is not intended to be a limitation of the present invention. The patent scope of this case. Practicing the above-mentioned 'this case is not only innovative in terms of technical thinking, but also able to enhance the above-mentioned multiple functions compared with the customary items. 'Lin is fully in line with the new and progressive legal specials, and Mai has filed a lawsuit. 'Please ask your office to approve the invention. Patent application, in order to invent, to the sense of virtue. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 2 is a schematic diagram of an implementation structure of a hardware-like neural network learning and recalling architecture of the present invention; FIG. 2 is a schematic diagram of a node represented by an arithmetic unit of the present invention; 16 200923803 FIG. 4 is a schematic view showing the molding of the arithmetic unit of the present invention; FIG. 5 is a schematic diagram of the hardware structure of the learning block of the present invention; FIG. 6 is a schematic diagram of the overall flow of the control unit of the present invention; Schematic diagram of the process description. [Major component symbol description] I Hardware-like neural network learning and recall architecture II (Process Unit, PE) () 12 Activation function (ActivationFunction) 13 Control bus (Control Bus) 14 Input data bus (Input Data Bus) 15 Weight Data Bus 16 Address Bus 17 Learning Block 18 Control Unit £ \ XJ 19 Multiplexer (Mux) ' 2 Arithmetic Unit Array '3 Ring Architecture 4 Operation Unit 411 Stack Architecture Memory 412 First In First Out (FIFO) Array Architecture Memory 42 Multiplier 17 200923803 Adder (Accumulator) 43 44 Input Data Bus 45 Register 5 Learning Block 51 Layer Input Stack 52 Layer Output Stack 53 w First In First Out Memory (wFIFO) 54 Aw First In First Out Memory (Δ\νFIFO) 18

Claims

200923803 X. Patent application scope: 1. A hardware-like neural network learning and recall architecture, including: - arithmetic unit _, system bribe yuan, and cake function, control bus, input data bus The weight data bus and the address bus are connected; all the arithmetic units are connected to the same control plane, and the same operation is performed at the same time: - the learning block, the history is placed The control unit and the array of arithmetic units are connected with the control bus, the input data, and the weight data; the operation of the neural network and the network learning part; the control unit's learning and learning The block, the control bus, and the address bus are connected, and at the same time, the multi-jhH is connected with the input data bus and the weight data bus; the control unit is used to control the overall architecture process, that is, appropriate The timing transmits the control signal; the present invention performs the operation of the inverse-transfer-like neural network by the loop-and-column multi-data bus-storage architecture to make it have a complete function of recalling and learning. Ο 2_ The hardware-like neural network learning and recall architecture described in item 1 of the scope of the patent application, wherein: the input data bus is used to transmit the input values x and inverse of each layer in the neural network. 5 values to the operation process. Police 3. The hardware-like neural network learning and recall architecture described in claim 1 of the patent scope, wherein the weight data bus is used to store all weight values in the array of arithmetic units. 4. The hardware-like neural network learning and recalling architecture described in claim 1 wherein the address bus is used to give an arithmetic unit address, and the arithmetic unit address is only used to determine The data on the weight data bus is stored in that processing unit. 19 200923803 Shenyue patents, la Cai, 1 item, the hard-faced god _ road learning and back to the reverse-transfer-like neural network operation is a multi-layer architecture, except for the round: t: into the previous layer after the operation The output, plus each layer is the same nose, ' ® This can use the same unit to perform different layers of operations. 6. 4 towel% patent (4) The frequency-based neural network learning and recall architecture described in item ,, where ~ computing single 7L 'contains - stack architecture memory and - first-in first-out (four) architecture memory 0 〇7· Such as the towel kiss special her her fortune _ _ road learning and recall structure, which stack memory, FIFO, memory, multiplier and adder, between them The bus bars are independent, and the pipeline can be used to disassemble the operation into several stages that can be executed independently, and all the data is compressed in each stage for calculation. 8. The hardware-like neural network learning and recall architecture described in claim 6 of the patent application, wherein the *history stacking structure of the "hidden stacking structure" will be used to store the weight value to shorten the waiting time for updating the weight value. 〇 9. The hardware-based neural network learning and recalling architecture described in claim 6 wherein: the memory of the first-in-first-out architecture is used to store in a reverse operation process; Weights. 10. The hardware-like neural network learning and recall structure described in claim 1 of the patent scope, wherein the arithmetic unit can be designed with a temporary register for storing calculation results for using the result of the operation and the operation process. The data areas are separated. When the results of the operations are sequentially transferred back, the arithmetic unit can still perform the next layer operation without being affected. 20 200923803 11. The hardware-like neural network learning and recall architecture described in item 1 of the patent π patent, wherein the flow of operations in the learning block can be roughly divided into calculation $, calculation "and calculation of ice three Part. 12_If the material of the 纤围丨丨丨之之硬硬硬硬硬丨丨 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , The secret state will enter the lower-order state and perform the action. When the specified action is completed, a completion signal will be returned. Gas S 13. The hardware as described in the middle of the material range _ 鳞学学The structure of the control unit can be divided into the following stages: (1) Initialization: Execution only at the beginning of the learning-time, the purpose is to store the weight value of the random number in the stack and the array of the arithmetic unit array respectively. , and perform the reset operation of each unit; (2) Forward operation: each time the ownership weight value is stored in the stack of the operation unit array, the operation unit array is used to perform the forward operation of the inverse transfer type neural network and the result is transmitted. 〇 The function calculates the output of each layer; (1) Reverse operation: the current operation is completed, and the learning block performs the inverse operation of the inverse-transfer-like neural network. The final purpose is to correct the weight value; it can also be divided into calculations 5. Calculation Δνν and calculate your three parts, in which the calculation of j needs to use the arithmetic unit array for calculation; (4) weight value update: in the reverse operation of the process towel, after the correction _ the heavy value is calculated, and save it In the stack of the arithmetic unit array, after the stack is updated, the weight value update of the queue is performed next to 21 200923803; (5) after the operation of the above stage is completed, the stage (2) to the stage (4) are repeated using different training samples. The operation, until the completion of the learning can end the training of the neural network.

twenty two