TWI902585B - Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction system - Google Patents
Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction systemInfo
- Publication number
- TWI902585B TWI902585B TW113150647A TW113150647A TWI902585B TW I902585 B TWI902585 B TW I902585B TW 113150647 A TW113150647 A TW 113150647A TW 113150647 A TW113150647 A TW 113150647A TW I902585 B TWI902585 B TW I902585B
- Authority
- TW
- Taiwan
- Prior art keywords
- computation
- target
- units
- task
- node
- Prior art date
Links
Landscapes
- Multi Processors (AREA)
Abstract
Description
本發明涉及一種處理器硬體架構,尤其是一種基於粗粒度可重構陣列(Coarse-Grained Reconfigurable Array, CGRA)處理器的三維硬體架構系統及其控制方法。This invention relates to a processor hardware architecture, and more particularly to a three-dimensional hardware architecture system based on a coarse-grained reconfigurable array (CGRA) processor and its control method.
隨著人工智慧技術的快速發展,尤其是在機器學習和深度學習領域,對處理器硬體架構提出了更高的性能和彈性要求。傳統的處理器架構,如中央處理器(CPU)和圖形處理器(GPU),在面對大規模並行計算和頻繁的數據訪問時,可能會遇到性能瓶頸和能耗問題。With the rapid development of artificial intelligence technology, especially in the fields of machine learning and deep learning, higher performance and flexibility requirements are being placed on processor hardware architecture. Traditional processor architectures, such as central processing units (CPUs) and graphics processing units (GPUs), may encounter performance bottlenecks and energy consumption problems when faced with large-scale parallel computing and frequent data access.
為了解決上述問題,研究人員提出了多種領域專用集成電路(Domain-Specific Architecture, DSA)和可重構硬體架構。其中,粗粒度可重構架構(Coarse-Grained Reconfigurable Architecture,CGRA)處理器因其靈活性和能效優勢而受到廣泛關注。CGRA通常由大量的處理單元(Processing Element, PE)和互連網絡組成,可根據不同應用的需求動態調整硬體配置,實現高效的並行計算。To address these challenges, researchers have proposed various domain-specific architectures (DSAs) and reconfigurable hardware architectures. Among them, coarse-grained reconfigurable architecture (CGRA) processors have attracted widespread attention due to their flexibility and energy efficiency advantages. CGRAs typically consist of a large number of processing elements (PEs) and interconnected networks, allowing for dynamic adjustments to hardware configurations based on the needs of different applications, thus achieving efficient parallel computing.
然而,當前的CGRA架構大多採用二維網格(2D Mesh)拓撲,處理單元之間的互連方式較為簡單,主要與相鄰的上下左右四個方向進行數據傳輸。這種架構在處理某些具有複雜數據依賴關係的應用時,可能面臨數據訪問效率低下、處理單元利用率不高等問題。However, most current CGRA architectures use a 2D mesh topology, which simplifies the interconnection between processing units, primarily transmitting data in the four adjacent directions (up, down, left, and right). This architecture may suffer from inefficient data access and low utilization of processing units when handling applications with complex data dependencies.
為解決上述技術問題,本公開提出一種三維粗粒度可重構陣列架構系統及其控制方法。本公開的架構包含交錯堆疊的運算陣列與儲存陣列,透過垂直方向的互連以及靈活的資源配置機制,實現更有效率的神經網路運算。To address the aforementioned technical issues, this disclosure proposes a three-dimensional coarse-grained reconfigurable array architecture system and its control method. The architecture of this disclosure includes staggered stacked computation and storage arrays, achieving more efficient neural network computation through vertical interconnection and a flexible resource allocation mechanism.
本公開的一或多個實施例提供一種三維粗粒度可重構陣列架構系統,用於實現一神經網路模型。所述三維粗粒度可重構陣列架構系統包括:多個第一陣列,其中每個第一陣列包括:多個第一運算單元,用以執行對應的所述神經網路模型的節點的神經網路運算任務;多個第一儲存單元,用以儲存對應的神經網路運算任務的資料,其中每個第一陣列的所述多個第一運算單元的數量多於所述多個第一儲存單元的數量;及多個第一切換器,用以執行對應的路由任務,其中每個第一切換器彼此不直接連接,其中所述多個第一運算單元、多個第一儲存單元以及所述多個第一切換器分布在所述第一陣列的陣列平面上;多個第二陣列,其中每個第二陣列包括:多個第二運算單元,用以執行對應的所述神經網路模型的節點的神經網路運算任務;多個第二儲存單元,用以儲存對應的神經網路運算任務的資料,其中所述多個第二儲存單元的數量多於所述多個第二運算單元的數量;及多個第二切換器,用以執行對應的路由任務,其中每個第二切換器彼此不直接連接,其中所述多個第二運算單元、多個第二儲存單元以及所述多個第二切換器分布在所述第二陣列的陣列平面上;一輸入/輸出介面,用以接收輸入資料和傳送處理結果;一配置控制器,電性連接至所述多個第一陣列及所述多個第二陣列,其中每個第一陣列及每個第二陣列交錯地堆疊,其中與每個第一陣列於對應所述陣列平面的垂直方向相鄰的陣列是所述第二陣列,其中與每個第二陣列於對應所述陣列平面的垂直方向相鄰的陣列是所述第一陣列。所述配置控制器用以:監視所述多個第一陣列及所述多個第二陣列各自的工作狀態;根據對應所述神經網路模型的運算圖資訊,動態管理所述多個第一運算單元、所述多個第一儲存單元、所述多個第一切換器、所述多個第二儲存單元、所述多個第二運算單元、所述多個第二切換器的啟用及禁用,以執行所述神經網路模型的多個神經網路運算任務。One or more embodiments of this disclosure provide a three-dimensional coarse-grained reconfigurable array architecture system for implementing a neural network model. The three-dimensional coarse-grained reconfigurable array architecture system includes: a plurality of first arrays, wherein each first array includes: a plurality of first computation units for performing neural network computation tasks for corresponding nodes of the neural network model; a plurality of first storage units for storing data of the corresponding neural network computation tasks, wherein the number of the plurality of first computation units in each first array is greater than the number of the plurality of first storage units; and a plurality of... First switches, used to perform corresponding routing tasks, wherein each first switch is not directly connected to each other, wherein the plurality of first computation units, the plurality of first storage units, and the plurality of first switches are distributed on the array plane of the first array; a plurality of second arrays, wherein each second array includes: a plurality of second computation units, used to perform neural network computation tasks for corresponding nodes of the neural network model; a plurality of second storage units; A unit for storing data for a corresponding neural network computation task, wherein the number of the plurality of second storage units is greater than the number of the plurality of second computation units; and a plurality of second switches for performing corresponding routing tasks, wherein each second switch is not directly connected to each other, wherein the plurality of second computation units, the plurality of second storage units, and the plurality of second switches are distributed on the array plane of the second array; an input/ An output interface for receiving input data and transmitting processing results; a configuration controller electrically connected to the plurality of first arrays and the plurality of second arrays, wherein each first array and each second array are stacked alternately, wherein the array adjacent to each first array in the vertical direction corresponding to the array plane is a second array, and the array adjacent to each second array in the vertical direction corresponding to the array plane is a first array. The configuration controller is used to: monitor the working status of the plurality of first arrays and the plurality of second arrays; and dynamically manage the enabling and disabling of the plurality of first computation units, the plurality of first storage units, the plurality of first switches, the plurality of second storage units, the plurality of second computation units, and the plurality of second switches according to the computation graph information corresponding to the neural network model, so as to execute the plurality of neural network computation tasks of the neural network model.
本公開的一或多個實施例提供一種三維粗粒度可重構陣列架構系統的控制方法,其中所述三維粗粒度可重構陣列架構系統用於實現一神經網路模型。所述方法包括:經由一配置控制器,監視三維粗粒度可重構陣列架構系統的多個第一陣列及多個第二陣列各自的工作狀態,其中每個第一陣列包括:多個第一運算單元,用以執行對應的所述神經網路模型的節點的神經網路運算任務;多個第一儲存單元,用以儲存對應的神經網路運算任務的資料,其中每個第一陣列的所述多個第一運算單元的數量多於所述多個第一儲存單元的數量;及多個第一切換器,用以執行對應的路由任務,其中每個第一切換器彼此不直接連接,其中所述多個第一運算單元、多個第一儲存單元以及所述多個第一切換器分布在所述第一陣列的陣列平面上;其中每個第二陣列包括:多個第二運算單元,用以執行對應的所述神經網路模型的節點的神經網路運算任務;多個第二儲存單元,用以儲存對應的神經網路運算任務的資料,其中所述多個第二儲存單元的數量多於所述多個第二運算單元的數量;及多個第二切換器,用以執行對應的路由任務,其中每個第二切換器彼此不直接連接,其中所述多個第二運算單元、多個第二儲存單元以及所述多個第二切換器分布在所述第二陣列的陣列平面上,其中所述配置控制器電性連接至所述多個第一陣列及所述多個第二陣列,其中每個第一陣列及每個第二陣列交錯地堆疊,其中與每個第一陣列於對應所述陣列平面的垂直方向相鄰的陣列是所述第二陣列,其中與每個第二陣列於對應所述陣列平面的垂直方向相鄰的陣列是所述第一陣列;以及根據對應所述神經網路模型的運算圖資訊,藉由所述配置控制器動態管理所述多個第一運算單元、所述多個第一儲存單元、所述多個第一切換器、所述多個第二儲存單元、所述多個第二運算單元、所述多個第二切換器的啟用及禁用,以執行所述神經網路模型的多個神經網路運算任務。One or more embodiments of this disclosure provide a control method for a three-dimensional coarse-grained reconfigurable array architecture system, wherein the three-dimensional coarse-grained reconfigurable array architecture system is used to implement a neural network model. The method includes: monitoring the operating status of a plurality of first arrays and a plurality of second arrays of the three-dimensional coarse-grained reconfigurable array architecture system via a configuration controller, wherein each first array includes: a plurality of first computation units for performing neural network computation tasks corresponding to nodes of the neural network model; a plurality of first storage units for storing data of the corresponding neural network computation tasks, wherein the number of the plurality of first computation units in each first array is greater than the number of the plurality of first storage units; and a plurality of first slices. Each first switch is not directly connected to the others, and the plurality of first computation units, the plurality of first storage units, and the plurality of first switches are distributed on the array plane of the first array; wherein each second array includes: a plurality of second computation units for performing neural network computation tasks for the corresponding nodes of the neural network model; and a plurality of second storage units for storing data for the corresponding neural network computation tasks, wherein the number of the plurality of second storage units is greater than the number of the first array. The system comprises: a plurality of second operation units; and a plurality of second switches for performing corresponding routing tasks, wherein each second switch is not directly connected to the others; wherein the plurality of second operation units, the plurality of second storage units, and the plurality of second switches are distributed on an array plane of the second array; wherein the configuration controller is electrically connected to the plurality of first arrays and the plurality of second arrays, wherein each first array and each second array is stacked alternately, wherein the configuration controller is perpendicular to each first array in the direction corresponding to the array plane. The adjacent array is the second array, wherein the array adjacent to each second array in the vertical direction corresponding to the array plane is the first array; and based on the computation graph information corresponding to the neural network model, the configuration controller dynamically manages the enabling and disabling of the plurality of first computation units, the plurality of first storage units, the plurality of first switches, the plurality of second storage units, the plurality of second computation units, and the plurality of second switches to perform multiple neural network computation tasks of the neural network model.
基於上述,本公開提供的三維粗粒度可重構陣列架構系統及其控制方法,可以達成以下技術效果:(1)透過三維堆疊的異質陣列架構,顯著縮短資料傳輸路徑,提升資料傳輸效率;(2)透過運算陣列與儲存陣列的功能區隔以及交錯堆疊配置,使系統能夠根據神經網路模型的運算特性,更有效率地分配運算資源;(3)透過配置控制器的動態管理機制,實現運算單元、儲存單元及切換器的最佳化配置,有效提升硬體資源使用率。Based on the above, the three-dimensional coarse-grained reconfigurable array architecture system and its control method provided in this disclosure can achieve the following technical effects: (1) Through the three-dimensional stacked heterogeneous array architecture, the data transmission path is significantly shortened and the data transmission efficiency is improved; (2) Through the functional separation and staggered stacking configuration of the computing array and the storage array, the system can allocate computing resources more efficiently according to the computing characteristics of the neural network model; (3) Through the dynamic management mechanism of the configuration controller, the optimal configuration of the computing unit, storage unit and switch is realized, effectively improving the utilization rate of hardware resources.
爲讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。To make the above features and advantages of the present invention more apparent and understandable, specific examples are given below, and detailed explanations are provided in conjunction with the accompanying drawings.
現在將詳細參照本公開/揭露的實施例,在附圖中示出所述實施例的範例。盡可能地在圖式及說明中使用相同的參考編號來指代相同的元件或類似的元件。Reference will now be made to the embodiments disclosed herein, examples of which are shown in the accompanying drawings. The same reference numerals are used as far as possible in the drawings and description to refer to the same or similar elements.
應理解的是,本公開中所使用的術語“系統”和“控制器”常常可互換地使用。本公開中的術語“和/或”僅為描述相關聯物件的關聯關係,這意味著可能存在四種關係,例如A和/或B,這可意味著四種情形:A、B、A和B、A或B。另外,本公開中的字元“/”大體上指示相關聯物件處於“或”關係。It should be understood that the terms "system" and "controller" used in this disclosure are often used interchangeably. The term "and/or" in this disclosure describes the relationship between related objects only, meaning that there may be four relationships, such as A and/or B, which could mean four cases: A, B, A and B, or A or B. Additionally, the character "/" in this disclosure generally indicates that related objects are in an "or" relationship.
圖1為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的方塊示意圖。Figure 1 is a block diagram of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖1,在一實施例中,三維粗粒度可重構陣列架構系統100包括:輸入/輸出介面130、配置控制器110、多個第一陣列121及多個第二陣列122。所述第一陣列121與所述第二陣列122以交錯方式堆疊,以形成三維結構。Referring to Figure 1, in one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system 100 includes: an input/output interface 130, a configuration controller 110, a plurality of first arrays 121, and a plurality of second arrays 122. The first arrays 121 and the second arrays 122 are stacked in an alternating manner to form a three-dimensional structure.
在一實施例中,本發明的三維粗粒度可重構陣列架構系統100可以實作為以下具體的半導體裝置:可重構式神經網路處理器(Reconfigurable Neural Network Processor),此處理器特別適用於邊緣運算設備(Edge Computing Devices),可動態調整其內部運算資源以處理不同規模的神經網路模型;智慧型影像處理器(Intelligent Image Processor),用於即時影像辨識與分析的場景,其可依據處理任務的需求,彈性配置運算單元與儲存單元的使用方式;人工智慧加速器(AI Accelerator),專門針對深度學習推論(Deep Learning Inference)任務進行最佳化,透過異質陣列的交錯堆疊設計,大幅提升資料存取效率;可程式化張量處理器(Programmable Tensor Processor),支援多種矩陣運算與向量運算,適合用於機器學習模型的訓練與推論階段;以及嵌入式系統協處理器(Embedded System Co-processor),作為主處理器的運算輔助單元,可針對特定應用場景進行客製化配置。In one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system 100 of this invention can be implemented as the following specific semiconductor devices: a reconfigurable neural network processor, particularly suitable for edge computing devices, capable of dynamically adjusting its internal computing resources to handle neural network models of different sizes; an intelligent image processor for real-time image recognition and analysis scenarios, which can flexibly configure the usage of computing and storage units according to the needs of the processing task; and an AI accelerator specifically designed for deep learning inference. The system optimizes inference tasks by using a staggered stacking design of heterogeneous arrays to significantly improve data access efficiency; it also features a programmable tensor processor that supports various matrix and vector operations, making it suitable for the training and inference phases of machine learning models; and an embedded system co-processor that serves as an auxiliary unit to the main processor and can be customized for specific application scenarios.
在一實施例中,本發明的三維粗粒度可重構陣列架構系統100可以被整合至以下電子裝置,例如:1.) 智慧手機及穿戴式裝置。由於需支援多樣化應用程序,如遊戲、攝影、AI 推論(影像增強、語音助理等)以及5G通訊。此類裝置可加入可重構陣列架構系統之加速晶片以針對特定工作負載(如影像處理、AI 推論)進行硬體加速,同時保持低功耗。2.) 無人機。無人載具需要進行複雜度較高之任務,如導航控制、物體追蹤、環境感知和影像處理,此類運算需求會因任務類型而變化。可重構陣列架構系統可以根據任務需求重構其計算架構,提供所需的運算能力並降低能耗。In one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system 100 of this invention can be integrated into the following electronic devices, such as: 1.) Smartphones and wearable devices. Due to the need to support diverse applications, such as games, photography, AI inference (image enhancement, voice assistants, etc.), and 5G communication, such devices can incorporate an accelerator chip of the reconfigurable array architecture system to perform hardware acceleration for specific workloads (such as image processing, AI inference) while maintaining low power consumption. 2.) Unmanned aerial vehicles (UAVs). UAVs require the performance of highly complex tasks, such as navigation control, object tracking, environmental perception, and image processing; these computational requirements vary depending on the task type. Reconfigurable array architecture systems can reconfigure their computing architecture to meet task requirements, providing the necessary computing power while reducing energy consumption.
3.)工業級自動化控制器(Industrial Automation Controllers),用於即時處理大量感測器資料並執行複雜的控制演算法;4.) 資料中心伺服器(Data Center Servers),特別是針對大規模神經網路模型訓練與推論的運算伺服器;5.) 醫療影像診斷設備(Medical Imaging Diagnostic Equipment),用於處理高解析度醫療影像的即時分析與診斷;以及6.) 車用智慧駕駛輔助系統(Advanced Driver-Assistance Systems, ADAS),整合於車載電腦中,用於處理來自多個感測器的即時資料分析與決策。應當注意的是,上述的電子裝置僅為示例性質,本公開不限於此,任何可應用粗粒度可重構陣列架構系統100的電子裝置都可適用本公開。3.) Industrial Automation Controllers, used to process large amounts of sensor data in real time and execute complex control algorithms; 4.) Data Center Servers, especially computing servers for training and inference of large-scale neural network models; 5.) Medical Imaging Diagnostic Equipment, used for real-time analysis and diagnosis of high-resolution medical images; and 6.) Advanced Driver-Assistance Systems (ADAS), integrated into the vehicle's computer, used for real-time data analysis and decision-making from multiple sensors. It should be noted that the above-described electronic device is merely an example and this disclosure is not limited thereto. Any electronic device that can be applied to the coarse-grained reconfigurable array architecture system 100 is also applicable to this disclosure.
在本實施例中,輸入/輸出介面130用以接收外部輸入資料及輸出運算結果。所述輸入資料可以包括神經網路模型的參數、運算指令、待處理的資料或是相關於神經網路模型的任何資料。所述運算結果可以包括神經網路運算的單一節點或是整體神經網路的最終輸出結果。In this embodiment, the input/output interface 130 is used to receive external input data and output computation results. The input data may include parameters of the neural network model, computation instructions, data to be processed, or any data related to the neural network model. The computation results may include the final output result of a single node of the neural network computation or the entire neural network.
配置控制器110電性連接至所述輸入/輸出介面130、所述多個第一陣列121及所述多個第二陣列122。所述配置控制器110負責監視所述多個第一陣列121及所述多個第二陣列122各自的工作狀態,並根據對應神經網路模型的運算圖資訊,動態管理每個第一陣列121、每個第二陣列中的各組件的啟用及禁用狀態。The configuration controller 110 is electrically connected to the input/output interface 130, the plurality of first arrays 121, and the plurality of second arrays 122. The configuration controller 110 is responsible for monitoring the working status of the plurality of first arrays 121 and the plurality of second arrays 122, and dynamically managing the enabling and disabling status of each component in each first array 121 and each second array based on the computational graph information of the corresponding neural network model.
在一實施例中(可參見圖4A-4D),每個第一陣列121包括多個第一運算單元、多個第一儲存單元及多個第一切換器,其中所述多個第一運算單元、多個第一儲存單元以及所述多個第一切換器分布在所述第一陣列的陣列平面上。具體來說,所述多個第一運算單元用以執行對應的神經網路模型的節點的神經網路運算任務,所述多個第一儲存單元用以儲存對應的神經網路運算任務的資料。值得注意的是,在本實施例中,每個第一陣列121的所述多個第一運算單元的數量多於所述多個第一儲存單元的數量,此設計特別適合處理運算密集的任務。另外,所述多個第一切換器用以執行對應的路由任務,其中每個第一切換器彼此不直接連接。In one embodiment (see Figures 4A-4D), each first array 121 includes multiple first computation units, multiple first storage units, and multiple first switches, wherein the multiple first computation units, multiple first storage units, and multiple first switches are distributed on the array plane of the first array. Specifically, the multiple first computation units are used to perform neural network computation tasks for the nodes of the corresponding neural network model, and the multiple first storage units are used to store the data of the corresponding neural network computation tasks. It is worth noting that in this embodiment, the number of the multiple first computation units in each first array 121 is greater than the number of the multiple first storage units; this design is particularly suitable for handling computationally intensive tasks. In addition, the plurality of first switches are used to perform corresponding routing tasks, wherein each first switch is not directly connected to each other.
相應地,在本實施例中(可參見圖4A-4D),每個第二陣列122包括多個第二運算單元、多個第二儲存單元及多個第二切換器,其中所述多個第二運算單元、多個第二儲存單元以及所述多個第二切換器分布在所述第二陣列的陣列平面上。具體來說,所述多個第二運算單元用以執行對應的神經網路模型的節點的神經網路運算任務,所述多個第二儲存單元用以儲存對應的神經網路運算任務的資料。不同於第一陣列121,在第二陣列122中,所述多個第二儲存單元的數量多於所述多個第二運算單元的數量,此設計特別適合處理需要大量資料儲存的任務。類似於第一陣列121,所述多個第二切換器用以執行對應的路由任務,其中每個第二切換器彼此不直接連接,以最佳化資料傳輸效率。換句話說,在本實施例中,第一陣列也可稱為運算強化陣列,其節點運算的總能力較強;第二陣列也可稱為儲存強化陣列,其儲存資料的總能力較強。Accordingly, in this embodiment (see Figures 4A-4D), each second array 122 includes multiple second operation units, multiple second storage units, and multiple second switches, wherein the multiple second operation units, multiple second storage units, and multiple second switches are distributed on the array plane of the second array. Specifically, the multiple second operation units are used to perform neural network operation tasks of the corresponding nodes of the neural network model, and the multiple second storage units are used to store the data of the corresponding neural network operation tasks. Unlike the first array 121, in the second array 122, the number of multiple second storage units is greater than the number of multiple second operation units. This design is particularly suitable for handling tasks that require large amounts of data storage. Similar to the first array 121, the plurality of second switches are used to perform corresponding routing tasks, wherein each second switch is not directly connected to each other to optimize data transmission efficiency. In other words, in this embodiment, the first array can also be called a computation-enhanced array, which has a stronger overall node computation capability; the second array can also be called a storage-enhanced array, which has a stronger overall data storage capability.
在一實施例中,所述第一陣列121和第二陣列122之間採用垂直互連的資料傳輸機制。具體來說,每個第一陣列的第一切換器垂直連接到相鄰的第二陣列的第二運算單元或第二儲存單元,而每個第二陣列的第二切換器則垂直連接到相鄰的第一陣列的第一運算單元或第一儲存單元。In one embodiment, a vertically interconnected data transmission mechanism is used between the first array 121 and the second array 122. Specifically, the first switch of each first array is vertically connected to the second operation unit or the second storage unit of the adjacent second array, while the second switch of each second array is vertically connected to the first operation unit or the first storage unit of the adjacent first array.
在本實施例中,在每個第一陣列121中,每個第一運算單元與每個第一儲存單元彼此不直接連接。相反地,每個第一運算單元與每個第一儲存單元之間至少要透過一個中介的第一切換器連接。In this embodiment, in each first array 121, each first operation unit and each first storage unit are not directly connected to each other. Instead, each first operation unit and each first storage unit are connected through at least one intermediary first switch.
同理,在每個第二陣列122中,每個第二運算單元與每個第二儲存單元彼此也不直接連接。每個第二運算單元與每個第二儲存單元之間同樣至少要透過一個中介的第二切換器連接。這種設計再次強化了陣列內部的資料傳輸管理能力,使得資料傳輸路徑更加多樣化。Similarly, in each second array 122, each second operation unit and each second storage unit are not directly connected to each other. Each second operation unit and each second storage unit must be connected through at least one intermediary second switch. This design further enhances the data transmission management capabilities within the array, making data transmission paths more diverse.
在垂直方向上,本公開引入了跨層用的垂直連接機制。每個第一陣列121的第一切換器垂直連接到相鄰的第二陣列122的第二運算單元或第二儲存單元。這種垂直連接允許第一陣列121的第一資料經由所述第一切換器直接傳送到相鄰第二陣列122的第二運算單元或第二儲存單元,實現較佳效率的跨層傳輸。In the vertical direction, this disclosure introduces a vertical connection mechanism for cross-layer communication. A first switch in each first array 121 is vertically connected to a second processing unit or a second storage unit of an adjacent second array 122. This vertical connection allows first data from the first array 121 to be directly transmitted via the first switch to the second processing unit or the second storage unit of the adjacent second array 122, achieving more efficient cross-layer transmission.
相似地,每個第二陣列122的第二切換器垂直連接到相鄰的第一陣列121的第一儲存單元或第一運算單元。通過這種垂直連接,第二陣列122的第二資料可以經由所述第二切換器直接傳送到相鄰第一陣列121的第一運算單元或第一儲存單元,完成跨層的資料交互。Similarly, the second switch of each second array 122 is vertically connected to the first storage unit or the first computing unit of the adjacent first array 121. Through this vertical connection, the second data of the second array 122 can be directly transmitted to the first computing unit or the first storage unit of the adjacent first array 121 via the second switch, completing cross-layer data exchange.
值得一提的是,在本實施例中,跨層資料傳輸過程中始終由切換器作為發起端,將資料主動送至相鄰層的運算單元或儲存單元。It is worth mentioning that in this embodiment, the switch always acts as the initiator during the cross-layer data transmission process, actively sending the data to the computing unit or storage unit of the adjacent layer.
在一實施例中,所述運算單元(包含第一運算單元及第二運算單元)可採用可重構運算單元(Reconfigurable Processing Unit)來實現。具體來說,該運算單元能夠根據不同的運算需求動態調整其運算模式,例如在執行矩陣乘法運算時,可重構運算單元可配置為脈動陣列(Systolic Array)架構;在執行卷積運算時,則可重構為二維運算陣列架構。透過此種彈性配置方式,運算單元特別適合處理深度學習中的不同類型運算需求,包括高維度矩陣乘法、卷積運算及向量運算等任務。In one embodiment, the computational unit (including a first computational unit and a second computational unit) can be implemented using a reconfigurable processing unit. Specifically, the computational unit can dynamically adjust its computational mode according to different computational needs. For example, when performing matrix multiplication, the reconfigurable processing unit can be configured as a systolic array architecture; when performing convolution operations, it can be reconfigured as a two-dimensional computational array architecture. Through this flexible configuration, the computational unit is particularly suitable for handling different types of computational needs in deep learning, including high-dimensional matrix multiplication, convolution operations, and vector operations.
在本實施例中,所述儲存單元(包含第一儲存單元及第二儲存單元)可採用靜態隨機存取記憶體(Static Random-Access Memory)實現。該記憶體具有高速讀寫特性,適合儲存神經網路運算任務的資料,包括權重參數、運算中間結果及運算輸入資料。進一步來說,儲存單元可依據資料存取模式劃分為多個記憶體組(Memory Bank),每個記憶體組可獨立進行讀寫操作,此設計能有效提升資料存取並行度。In this embodiment, the storage unit (including the first storage unit and the second storage unit) can be implemented using static random-access memory. This memory has high-speed read and write characteristics, making it suitable for storing data for neural network computing tasks, including weight parameters, intermediate computing results, and computing input data. Furthermore, the storage unit can be divided into multiple memory banks according to the data access mode, and each memory bank can perform read and write operations independently. This design can effectively improve the parallelism of data access.
在另一實施例中,所述切換器(包含第一切換器及第二切換器)採用分散式路由架構,包含多個路由節點(Routing Node)。每個路由節點具備資料緩衝功能,可暫存待轉發的資料封包,並根據目的地資訊選擇最佳傳輸路徑。透過此種分散式路由架構,系統可實現更靈活的資料傳輸調度,減少資料傳輸衝突。切換器亦可實作多層交叉開關(Multi-layer Crossbar)架構,能夠同時支援多個資料流的傳輸需求。該架構包含多個交叉開關層,每層負責特定方向的資料傳輸。透過合理配置交叉開關的連接狀態,系統可建立多條獨立的資料傳輸通道,提高資料傳輸的並行程度。In another embodiment, the switch (including a first switch and a second switch) adopts a distributed routing architecture, comprising multiple routing nodes. Each routing node has a data buffering function, which can temporarily store data packets to be forwarded and select the optimal transmission path based on destination information. Through this distributed routing architecture, the system can achieve more flexible data transmission scheduling and reduce data transmission conflicts. The switch can also implement a multi-layer crossbar architecture, which can simultaneously support the transmission needs of multiple data streams. This architecture includes multiple crossbar layers, each responsible for data transmission in a specific direction. By properly configuring the connection status of the crossbars, the system can establish multiple independent data transmission channels, improving the parallelism of data transmission.
在一實施例中,所述切換器可根據配置控制器110所設定的路由任務來處理資料的傳輸。具體來說,所述路由任務包含:目的地運算單元或儲存單元的識別資訊、資料傳輸的優先等級資訊、資料封包的大小資訊,以及資料傳輸路徑的中繼點資訊。所述切換器根據這些路由資訊,建立對應的資料傳輸通道,並確保資料能夠按照指定的傳輸路徑完成傳送。In one embodiment, the switch can process data transmission according to a routing task set by the configuration controller 110. Specifically, the routing task includes: identification information of the destination computing unit or storage unit, data transmission priority information, data packet size information, and relay information of the data transmission path. Based on this routing information, the switch establishes a corresponding data transmission channel and ensures that data can be transmitted according to the specified transmission path.
圖2為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的配置控制器的方塊示意圖。Figure 2 is a block diagram of the configuration controller of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖2,在本實施例中,配置控制器110包括工作負載配置儲存電路單元112及任務控制處理器111。Referring to Figure 2, in this embodiment, the configuration controller 110 includes a workload configuration storage circuit unit 112 and a task control processor 111.
在本實施例中,配置控制器110通過以下機制來控制資料在各組件間的傳輸。首先,配置控制器110根據工作負載配置儲存電路單元112中儲存的運算圖資訊,判斷各節點間的資料相依性。例如,配置控制器110的任務控制處理器111可建立一個資料傳輸規劃表,該規劃表包含:資料傳輸的時序安排、傳輸路徑的切換器配置序列,以及各個儲存單元的存取時序。In this embodiment, the configuration controller 110 controls the data transmission between components through the following mechanism. First, the configuration controller 110 determines the data dependencies between nodes based on the computational graph information stored in the workload configuration storage circuit unit 112. For example, the task control processor 111 of the configuration controller 110 can establish a data transmission schedule, which includes: the timing arrangement of data transmission, the switch configuration sequence of transmission paths, and the access timing of each storage unit.
具體來說,當需要執行特定節點的運算任務時,配置控制器110首先向對應的切換器發送路由配置指令,設定切換器的路由狀態。這些路由狀態決定了資料封包在切換器網路中的轉發方向。同時,配置控制器110也會向相關的儲存單元發送讀取或寫入指令,控制資料的存取時序。當資料傳輸路徑建立完成後,配置控制器110隨即啟動對應的運算單元,使其開始執行指定的運算任務。Specifically, when a specific node's computational task needs to be executed, the configuration controller 110 first sends a routing configuration command to the corresponding switch to set the switch's routing status. These routing statuses determine the forwarding direction of data packets in the switch network. Simultaneously, the configuration controller 110 also sends read or write commands to the relevant storage units to control the data access timing. Once the data transmission path is established, the configuration controller 110 immediately starts the corresponding computational unit, enabling it to begin executing the specified computational task.
在資料傳輸過程中,配置控制器110持續監控各個切換器的工作狀態,包括當前的資料傳輸進度以及是否發生資料傳輸衝突。若檢測到潛在的傳輸衝突,配置控制器110可即時調整資料傳輸的優先順序,或重新規劃傳輸路徑,以確保資料傳輸的可靠性和效率。During data transmission, the configuration controller 110 continuously monitors the operating status of each switch, including the current data transmission progress and whether data transmission conflicts have occurred. If a potential transmission conflict is detected, the configuration controller 110 can immediately adjust the data transmission priority or replan the transmission path to ensure the reliability and efficiency of data transmission.
在一實施例中,所述運算單元與所述儲存單元除了執行其主要的運算及儲存功能外,亦被配置為具備資料轉發能力。具體而言,當運算單元或儲存單元接收到非針對自身的資料時,該單元能夠自主判斷該資料的目的地資訊,並直接將該資料傳送至下一個目標單元,而無需經由切換器進行中轉處理。舉例來說,運算單元或儲存單元在執行資料轉發時,會先讀取資料中所包含的目的地標記。若該目的地標記顯示該資料並非發送給自身,該單元即會根據預先由配置控制器所設定的路由資訊,選擇最適當的傳輸方向,並將該資料完整地傳送至下一個節點。在此傳輸過程中,所述運算單元或儲存單元僅負責資料的轉發,不會對資料內容進行任何形式的修改或處理。透過此種運算單元及儲存單元的直接轉發機制,系統得以建立更直接的資料傳輸路徑。In one embodiment, the computing unit and the storage unit, in addition to performing their main computing and storage functions, are also configured to have data forwarding capabilities. Specifically, when a computing unit or storage unit receives data that is not addressed to itself, the unit can independently determine the destination information of the data and directly transmit the data to the next target unit without needing to go through a switch for relay processing. For example, when the computing unit or storage unit performs data forwarding, it first reads the destination marker contained in the data. If the destination marker indicates that the data is not sent to itself, the unit will select the most appropriate transmission direction according to the routing information pre-set by the configuration controller and transmit the data completely to the next node. During this transmission process, the computing unit or storage unit is only responsible for forwarding data and does not modify or process the data content in any way. Through this direct forwarding mechanism of computing and storage units, the system can establish a more direct data transmission path.
透過上述控制機制,配置控制器110能夠確保系統中的資料傳輸既符合運算任務的相依性要求,又可達到較高的組件資源利用效率。同時,該控制機制的動態調整特性也使系統能夠適應不同規模和類型的神經網路運算需求。Through the aforementioned control mechanism, the controller 110 can ensure that data transmission in the system meets the dependency requirements of the computing tasks while achieving high component resource utilization efficiency. At the same time, the dynamic adjustment characteristics of this control mechanism also enable the system to adapt to the computing needs of neural networks of different scales and types.
值得注意的是,雖然運算單元、儲存單元及切換器在第一陣列121和第二陣列122中具有相同的功能,但其在兩種陣列中的數量配置卻有所不同。在第一陣列121中,運算單元的數量多於儲存單元的數量;而在第二陣列122中,則是儲存單元的數量多於運算單元的數量。藉由此種差異化的數量配置,系統可在執行深度學習運算時實現運算效能與記憶體存取效率的最佳平衡,同時透過配置控制器110的動態管理,實現資源利用率的最佳化。It is worth noting that although the computing units, storage units, and switches have the same functions in the first array 121 and the second array 122, their quantity configuration differs between the two arrays. In the first array 121, the number of computing units exceeds the number of storage units; while in the second array 122, the number of storage units exceeds the number of computing units. This differentiated quantity configuration allows the system to achieve an optimal balance between computing performance and memory access efficiency when performing deep learning operations, while simultaneously optimizing resource utilization through the dynamic management of the configuration controller 110.
在本實施例中,所述工作負載配置儲存電路單元112用以儲存神經網路模型的運算圖資訊。In this embodiment, the workload configuration storage circuit unit 112 is used to store computational graph information of the neural network model.
具體來說,在一實施例中,該運算圖資訊包括:(1)多個節點各自的相依性層級資訊,用以指示所述多個節點在所述神經網路模型中的執行順序。所述相依性層級資訊使相同執行層級的節點可平行執行,而不同執行層級的節點則依序執行;(2)所述多個節點之間的連接數量資訊,用以指示每個節點所連接的相鄰節點的總個數;(3)所述多個節點之間的資料傳輸量資訊,用以指示在所述資料傳輸路徑上的資料大小;以及(4)所述多個節點各自的運算量資訊,用以指示每個節點所執行的節點運算的運算量。在另一實施例中,運算圖資訊更可包括每個節點的母節點的資訊。以下會利用圖8來說明運算圖資訊的獲取細節。Specifically, in one embodiment, the computation graph information includes: (1) dependency level information for each of the plurality of nodes, indicating the execution order of the plurality of nodes in the neural network model. The dependency level information allows nodes at the same execution level to be executed in parallel, while nodes at different execution levels are executed sequentially; (2) connection quantity information between the plurality of nodes, indicating the total number of adjacent nodes connected to each node; (3) data transfer quantity information between the plurality of nodes, indicating the data size on the data transfer path; and (4) computation quantity information for each of the plurality of nodes, indicating the computation quantity of the node operations performed by each node. In another embodiment, the computation graph information may further include information about the parent node of each node. Figure 8 will be used below to illustrate the details of obtaining computation graph information.
在一實施例中,所述任務控制處理器111根據所述運算圖資訊,動態管理所述多個第一運算單元、所述多個第一儲存單元、所述多個第一切換器、所述多個第二儲存單元、所述多個第二運算單元、所述多個第二切換器的啟用及禁用狀態,以執行所述神經網路模型的多個神經網路運算任務。In one embodiment, the task control processor 111 dynamically manages the enabling and disabling states of the plurality of first computation units, the plurality of first storage units, the plurality of first switches, the plurality of second storage units, the plurality of second computation units, and the plurality of second switches based on the computation graph information, in order to execute multiple neural network computation tasks of the neural network model.
在一實施例中,所述任務控制處理器111首先根據所述相依性層級資訊來決定節點的執行順序,接著根據所述連接數量資訊來選擇合適的運算單元配置(使被選擇的運算單元所連接的切換器的數量匹配對應的節點的連接數量的程度),並依據所述資料傳輸量資訊來規劃資料傳輸路徑,最後考量所述運算量資訊及每個運算單元的運算能力來分配運算資源,藉此實現運算資源的最佳化配置。In one embodiment, the task control processor 111 first determines the execution order of nodes based on the dependency hierarchy information, then selects an appropriate computing unit configuration based on the connection quantity information (to the extent that the number of switches connected to the selected computing unit matches the number of connections of the corresponding node), plans the data transmission path based on the data transfer volume information, and finally allocates computing resources considering the computing volume information and the computing power of each computing unit, thereby achieving optimal allocation of computing resources.
在一實施例中,所述配置控制器110可以採用專用積體電路(Application-Specific Integrated Circuit, ASIC)、現場可程式邏輯閘陣列(Field-Programmable Gate Array, FPGA)或可程式邏輯裝置(Programmable Logic Device, PLD)來實現。這些實現方式具有高效能、低延遲的特性,特別適合用於即時運算資源配置的場景。In one embodiment, the configuration controller 110 can be implemented using an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a programmable logic device (PLD). These implementations offer high performance and low latency, making them particularly suitable for scenarios requiring real-time computing resource configuration.
在另一實施例中,所述工作負載配置儲存電路單元112可以採用靜態隨機存取記憶體(Static Random-Access Memory, SRAM)、動態隨機存取記憶體(Dynamic Random-Access Memory, DRAM)或快閃記憶體(Flash Memory)來實現。選擇合適的記憶體類型時,需要考慮存取速度、功耗及成本等因素。In another embodiment, the workload configuration storage circuit unit 112 can be implemented using static random-access memory (SRAM), dynamic random-access memory (DRAM), or flash memory. When selecting a suitable memory type, factors such as access speed, power consumption, and cost need to be considered.
在又一實施例中,所述任務控制處理器111可以採用精簡指令集運算(Reduced Instruction Set Computing, RISC)架構或超長指令字(Very Long Instruction Word, VLIW)架構的處理器核心來實現。這些處理器架構具有指令執行效率高、功耗較低的特點,適合用於運算資源的即時調度。In another embodiment, the task control processor 111 may be implemented using a processor core with a Reduced Instruction Set Computing (RISC) architecture or a Very Long Instruction Word (VLIW) architecture. These processor architectures are characterized by high instruction execution efficiency and low power consumption, making them suitable for real-time scheduling of computing resources.
在本實施例中,資料傳輸路徑可分為兩種類型:(1)同平面路徑:當空閒運算單元與前置運算單元位於同一個陣列平面時的傳輸路徑;(2)非同平面路徑:當空閒運算單元與前置運算單元位於不同陣列平面時的傳輸路徑。In this embodiment, the data transmission path can be divided into two types: (1) Coplanar path: the transmission path when the idle operation unit and the preceding operation unit are located on the same array plane; (2) Non-coplanar path: the transmission path when the idle operation unit and the preceding operation unit are located on different array planes.
系統(如,任務控制處理器111)會優先選擇組件總個數較少的傳輸路徑。具體來說,當最短的非同平面路徑所包含的組件總個數少於最短的同平面路徑所包含的組件總個數時,系統會選擇對應最短的非同平面路徑的特定空閒運算單元作為目標運算單元。也就是說,本公開的架構提供了相較於傳統的平面傳輸路徑的其他較佳的非平面傳輸路徑,以強化資料傳輸的速度,進而改善整體神經網路運算任務的效率。The system (e.g., task control processor 111) prioritizes the transmission path with the fewer total number of components. Specifically, when the shortest non-plane path contains fewer components than the shortest plane path, the system selects a specific idle computation unit corresponding to the shortest non-plane path as the target computation unit. In other words, the architecture disclosed herein provides superior non-plane transmission paths compared to traditional plane transmission paths to enhance data transmission speed and thereby improve the efficiency of the overall neural network computation task.
在一實施例中,任務控制處理器111例如可執行以下任務調度。In one embodiment, the task control processor 111 may, for example, perform the following task scheduling.
任務分析階段:根據相依性層級資訊,依照每個執行層級(相依性層級)的順序來選擇要被配置的任務節點,以開始執行對應的神經網路運算任務;確保被分配給相同任務節點的運算單元可平行執行對應的節點運算;依據相依性層級資訊對多個神經網路運算任務進行排序。Task analysis phase: Based on dependency hierarchy information, select the task nodes to be configured according to the order of each execution level (dependency level) to start executing the corresponding neural network computation task; ensure that the computation units assigned to the same task node can execute the corresponding node computation in parallel; and sort multiple neural network computation tasks according to dependency hierarchy information.
資源分配階段:評估每個目標任務節點的運算量資訊及每個空閒運算單元的運算能力;判斷空閒運算單元是否足夠被配置給目標任務節點;根據運算需求以及最短傳輸路徑,選擇並啟用合適的目標運算單元。Resource allocation phase: Evaluate the computational load information of each target task node and the computational capacity of each idle computing unit; determine whether there are enough idle computing units to be allocated to the target task nodes; select and enable appropriate target computing units based on computational requirements and the shortest transmission path.
動態調整階段:監視每個運算單元的工作狀態;當運算單元完成節點運算時,選擇最短傳輸路徑的儲存單元儲存運算結果;將完成運算的單元重置為新的空閒運算單元;持續監視並調整,直到所有任務節點都被配置完畢。在一實施例中,當運算單元從忙碌狀態轉變為空閒運算單元時,配置控制器110會根據系統當前的工作負載狀況,決定該空閒運算單元要進入的電源管理狀態。若預測近期內該運算單元可能會被重新配置使用,則讓其進入待命狀態(Standby Mode),以確保快速的切換回運作狀態;若預測較長時間內不會被使用,則可讓其進入睡眠狀態(Sleep Mode)甚至深度禁用狀態(Deep Shutdown Mode),以降低整體系統的功耗。此種動態電源管理機制可有效平衡系統的運算效能與能源效率。更詳細來說,在一實施例中,第一種狀態為待命狀態(Standby Mode),在此狀態下系統會持續供應時脈訊號與電源,並維持運算單元的基本狀態資訊,使其能夠快速切換至運作狀態,但相對具有中等程度的功耗。第二種狀態為睡眠狀態(Sleep Mode),此狀態會關閉時脈供應並降低核心電壓,僅保留必要的組態資訊,雖然具有較低的功耗,但需要較長的喚醒時間。第三種狀態為深度禁用狀態(Deep Shutdown Mode),此狀態會完全切斷電源供應並清除所有狀態資訊,可達到最低功耗,但需要最長的重啟時間,特別適用於預期長期不會使用的運算單元。配置控制器110在決定運算單元的電源狀態時,會綜合考量多個因素,包括目前的工作負載預測、系統的功耗預算、運算單元的地理位置(考慮散熱影響)以及各狀態所需的喚醒時間要求,以在效能與功耗之間取得最佳平衡。所謂的「啟用」特定運算單元可指將特定運算單元設定到忙碌/工作狀態,以執行對應的節點運算;所謂的「禁用」特定運算單元,可指將特定運算單元從忙碌/工作狀態設定為「不啟用」狀態或設定為待命、睡眠或深度禁用狀態,以節省不同程度的功耗。Dynamic adjustment phase: Monitor the working status of each computing unit; when a computing unit completes node calculation, select the storage unit with the shortest transmission path to store the calculation result; reset the completed computing unit as a new idle computing unit; continuously monitor and adjust until all task nodes are configured. In one embodiment, when a computing unit changes from a busy state to an idle computing unit, the configuration controller 110 determines the power management state that the idle computing unit should enter based on the current system workload. If it is predicted that the computing unit may be reconfigured and used in the near future, it is put into standby mode to ensure a quick switch back to operating status. If it is predicted that it will not be used for a longer period of time, it can be put into sleep mode or even deep shutdown mode to reduce the overall system power consumption. This dynamic power management mechanism can effectively balance the system's computing performance and energy efficiency. More specifically, in one embodiment, the first state is standby mode, in which the system continuously supplies clock signals and power, and maintains the basic status information of the computing unit, enabling it to quickly switch back to operating status, but with a relatively moderate level of power consumption. The second state is Sleep Mode, which shuts down the clock supply and reduces the core voltage, retaining only essential configuration information. While it has lower power consumption, it requires a longer wake-up time. The third state is Deep Shutdown Mode, which completely cuts off the power supply and clears all status information, achieving the lowest power consumption but requiring the longest restart time. It is particularly suitable for computing units that are not expected to be used for a long time. When determining the power state of a computing unit, the configuration controller 110 considers multiple factors, including current workload predictions, system power consumption estimates, the geographical location of the computing unit (considering heat dissipation), and the wake-up time requirements of each state, to achieve the best balance between performance and power consumption. "Enabling" a specific computation unit can refer to setting a specific computation unit to a busy/working state to perform corresponding node operations; "disabling" a specific computation unit can refer to setting a specific computation unit from a busy/working state to a "disabled" state or setting it to a standby, sleep, or deeply disabled state to save power consumption to varying degrees.
資源最佳化階段:根據節點之間的連接數量資訊選擇運算單元配置;確保連接數量較多的目標運算單元具有較多的切換器連接;動態調整資料傳輸路徑以最小化傳輸延遲。Resource optimization phase: Select the computing unit configuration based on the number of connections between nodes; ensure that target computing units with a large number of connections have a large number of switch connections; dynamically adjust the data transmission path to minimize transmission latency.
圖3A為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一交錯堆疊方式的示意圖。圖3B為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第二交錯堆疊方式的示意圖。Figure 3A is a schematic diagram of a first staggered stacking configuration of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 3B is a schematic diagram of a second staggered stacking configuration of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖3A,在一實施例中,三維粗粒度可重構陣列架構系統採用第一交錯堆疊方式排列其陣列結構。在三維座標系統(X、Y、Z)中,第一陣列121(1)位於最上層,第二陣列122(1)位於其正下方,第一陣列121(2)則位於第二陣列122(1)下方。每個陣列均平行於XY平面,沿Z軸方向進行堆疊。Referring to Figure 3A, in one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system arranges its array structure using a first staggered stacking method. In the three-dimensional coordinate system (X, Y, Z), the first array 121(1) is located at the top, the second array 122(1) is located directly below it, and the first array 121(2) is located below the second array 122(1). Each array is parallel to the XY plane and stacked along the Z-axis.
參照圖3B,在另一實施例中,系統採用第二交錯堆疊方式。其中,第二陣列122(1)位於最上層,第一陣列121(1)位於其下方,第二陣列122(2)則位於第一陣列121(1)的下方。此種排列方式使相鄰層之間始終維持第一陣列121與第二陣列122的交替配置。Referring to Figure 3B, in another embodiment, the system employs a second staggered stacking method. In this method, the second array 122(1) is located at the top layer, the first array 121(1) is located below it, and the second array 122(2) is located below the first array 121(1). This arrangement ensures that adjacent layers always maintain an alternating configuration of the first array 121 and the second array 122.
在本實施例中,相鄰陣列之間透過垂直互連結構建立資料傳輸路徑。當系統需要在不同層級的陣列之間傳輸資料時,可直接透過垂直路徑完成資料傳送,無需經過同層陣列的多次轉發,增進了資料傳輸效率和運算時效。In this embodiment, data transmission paths are established between adjacent arrays through a vertical interconnection structure. When the system needs to transmit data between arrays at different levels, the data transmission can be completed directly through the vertical path without having to go through multiple forwardings by arrays at the same level, thus improving data transmission efficiency and computation timeliness.
這兩種交錯堆疊方式的主要差異在於最上層陣列的類型選擇。配置控制器110可根據運算任務的需求特性,選擇合適的堆疊方式來配置系統架構。The main difference between these two staggered stacking methods lies in the type of the top-level array. The configuration controller 110 can select the appropriate stacking method to configure the system architecture based on the characteristics of the computing task requirements.
圖4A為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一陣列及第二陣列的第一佈局的示意圖。Figure 4A is a schematic diagram of the first layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖4A,在一實施例中,三維粗粒度可重構陣列架構系統包括第一陣列121及第二陣列122,其中各單元呈現特定的排列及互連方式。Referring to Figure 4A, in one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system includes a first array 121 and a second array 122, wherein each unit presents a specific arrangement and interconnection.
具體來說,第一陣列121採用四單元配置,包括兩個運算單元(P)、一個儲存單元(M)及一個切換器(S)。所述兩個運算單元分別位於陣列的左上角及右下角,儲存單元位於左下角,切換器位於右上角。切換器(S)與其左側的運算單元(P)及下方的運算單元(P)形成直線連接,同時與左下角的儲存單元(M)形成斜向連接。Specifically, the first array 121 adopts a four-unit configuration, including two operation units (P), one storage unit (M), and one switch (S). The two operation units are located at the upper left and lower right corners of the array, respectively, the storage unit is located at the lower left corner, and the switch is located at the upper right corner. The switch (S) is connected in a straight line to the operation unit (P) to its left and the operation unit (P) below it, and is also connected diagonally to the storage unit (M) at the lower left corner.
在第二陣列122中,同樣採用四單元配置,包括一個運算單元(P)、兩個儲存單元(M)及一個切換器(S)。其中,儲存單元分別位於左上角及右下角,運算單元位於右上角,切換器位於左下角。切換器(S)與其右側的儲存單元(M)及上方的運算單元(P)形成直線連接,同時與右上角的運算單元(P)形成斜向連接。In the second array 122, a four-unit configuration is also used, including one operation unit (P), two storage units (M), and one switch (S). The storage units are located in the upper left and lower right corners, the operation unit is located in the upper right corner, and the switch is located in the lower left corner. The switch (S) is connected in a straight line to the storage unit (M) to its right and the operation unit (P) above it, and is also connected diagonally to the operation unit (P) in the upper right corner.
此種佈局設計使得每個切換器(S)均與三個其他單元建立連接,形成分支結構。所述連接包括兩條直線連接及一條斜向連接,此配置使得資料傳輸路徑具有固定且可預測的特性。This layout design allows each switch (S) to establish connections with three other units, forming a branch structure. The connections include two straight connections and one diagonal connection, a configuration that makes the data transmission path fixed and predictable.
圖4B為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第二佈局的示意圖。Figure 4B is a schematic diagram of the second layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖4B,在一實施例中,三維粗粒度可重構陣列架構系統的第一陣列121及第二陣列122採用九單元的擴展式(3x3的矩陣單元再加入4個插入單元)配置結構(也可稱,奇數陣列)。Referring to Figure 4B, in one embodiment, the first array 121 and the second array 122 of the three-dimensional coarse-grained reconfigurable array architecture system adopt a nine-element extended (3x3 matrix elements plus 4 insertion elements) configuration structure (also known as an odd array).
在第一陣列121中,四個運算單元(P)分別位於左上角、右上角、左下角及右下角位置。四個儲存單元(M)則分別位於上方中央、左側中央、右側中央及下方中央位置。四個切換器(S)以十字交叉方式排列於中央區域,並在中心位置設置一個運算單元(P)。4個插入單元為切換器(S),每個切換器(S)與其相鄰的兩個運算單元(P)及兩個儲存單元(M)建立連接。此種配置使每個切換器(S)均與四個相鄰單元建立連接,形成網格式拓撲結構。中央的運算單元(P)則與四個切換器(S)形成十字型連接。In the first array 121, four operation units (P) are located at the top left, top right, bottom left, and bottom right corners, respectively. Four storage units (M) are located at the top center, left center, right center, and bottom center, respectively. Four switches (S) are arranged in a cross shape in the central area, with one operation unit (P) positioned at the center. The four insertion units are switches (S), and each switch (S) is connected to its two adjacent operation units (P) and two storage units (M). This configuration ensures that each switch (S) is connected to four adjacent units, forming a grid topology. The central operation unit (P) is connected to the four switches (S) in a cross shape.
第二陣列122的放置各組件的位置採用相似的配置,但各位置的組件類型與第一陣列121不同。具體來說,四個儲存單元(M)位於陣列的四個角落,四個運算單元(P)環繞中央的儲存單元(M)配置。四個切換器(S)分別位於上方中央、左側中央、右側中央及下方中央位置。每個切換器(S)與其相鄰的兩個運算單元(P)及兩個儲存單元(M)形成連接。中央的儲存單元(M)則與四個運算單元(P)建立十字型連接。The second array 122 uses a similar configuration for its components, but the types of components at each position differ from those in the first array 121. Specifically, four storage units (M) are located at the four corners of the array, and four operation units (P) are arranged around the central storage unit (M). Four switches (S) are located at the top center, left center, right center, and bottom center, respectively. Each switch (S) is connected to its two adjacent operation units (P) and two storage units (M). The central storage unit (M) is connected to the four operation units (P) in a cross shape.
在一實施例中,三維粗粒度可重構陣列架構系統採用奇數陣列配置,其中相鄰兩層分別為儲存陣列(如,第二陣列)及運算陣列(如,第一陣列)。每個陣列中的切換器(S)、運算單元(P)及儲存單元(M)依據特定規則進行互連。In one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system employs an odd-numbered array configuration, where two adjacent layers consist of a storage array (e.g., the second array) and an operation array (e.g., the first array). The switches (S), operation units (P), and storage units (M) in each array are interconnected according to specific rules.
在本實施例中,每個切換器(S)至少與一個運算單元(P)及一個儲存單元(M)建立連接。具體來說,該切換器(S)所連接的運算單元(P)及儲存單元(M)採用交錯方式進行配置,藉此形成均勻分布的連接架構。In this embodiment, each switch (S) is connected to at least one computing unit (P) and one storage unit (M). Specifically, the computing unit (P) and storage unit (M) connected to the switch (S) are configured in an interleaved manner to form a uniformly distributed connection architecture.
參照儲存陣列的設計特徵,每個切換器(S)所連接的儲存單元(M)數量大於運算單元(P)的數量。此外,經由下列規則,儲存陣列的佈局可轉換為資料陣列:將切換器(S)改為儲存單元(M)、將儲存單元(M)改為運算單元(P)、將運算單元(P)改為切換器(S)。Referring to the design characteristics of the storage array, the number of storage units (M) connected to each switch (S) is greater than the number of operation units (P). Furthermore, the layout of the storage array can be converted into a data array by the following rules: changing a switch (S) to a storage unit (M), changing a storage unit (M) to an operation unit (P), and changing an operation unit (P) to a switch (S).
相對地,在運算陣列中,每個切換器(S)所連接的儲存單元(M)數量不超過運算單元(P)的數量。此種連接配置使得運算陣列特別適合執行運算密集的任務,而儲存陣列則適合處理需要大量資料儲存的場景。In contrast, in a computing array, the number of storage units (M) connected to each switch (S) does not exceed the number of computing units (P). This connection configuration makes computing arrays particularly suitable for performing computationally intensive tasks, while storage arrays are suitable for handling scenarios that require large amounts of data storage.
圖4C為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第三佈局的示意圖。Figure 4C is a schematic diagram of the third layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖4C,在一實施例中,三維粗粒度可重構陣列架構系統的第一陣列121及第二陣列122採用中心擴展式架構配置。該配置基於4x4矩陣陣列結構,並在中心位置增設額外的組件。Referring to Figure 4C, in one embodiment, the first array 121 and the second array 122 of the three-dimensional coarse-grained reconfigurable array architecture system adopt a centrally expanded architecture configuration. This configuration is based on a 4x4 matrix array structure with additional components added at the center.
第一陣列121的配置為:The configuration of the first formation 121 is as follows:
矩陣邊界配置:四個運算單元(P)分別位於左側及右側的中間位置,四個儲存單元(M)位於四個角落,四個切換器(S)位於上、下、左、右四個中心點位置。Matrix boundary configuration: Four computational units (P) are located in the middle of the left and right sides respectively, four storage units (M) are located in the four corners, and four switches (S) are located at the four center points of the top, bottom, left, and right.
中心配置:一個額外的運算單元(P)位於陣列的正中心位置。Central configuration: An additional computational unit (P) is located at the exact center of the array.
連接架構:每個切換器(S)分別與兩個運算單元(P)及一個儲存單元(M)建立連接,而中心運算單元(P)則與四個切換器(S)形成十字型連接。Connection architecture: Each switch (S) is connected to two operation units (P) and one storage unit (M), while the central operation unit (P) is connected to the four switches (S) in a cross shape.
第二陣列122的配置為:The configuration of the second array 122 is as follows:
矩陣邊界配置:四個儲存單元(M)分別位於上方及下方的中間位置,四個切換器(S)位於四個角落,四個運算單元(P)位於上、下、左、右四個中心點位置。Matrix boundary configuration: Four storage units (M) are located at the top and bottom center, four switches (S) are located at the four corners, and four operation units (P) are located at the top, bottom, left, and right center points.
中心配置:一個額外的切換器(S)位於陣列的正中心位置。Central configuration: An additional switch (S) is located at the very center of the array.
連接架構:每個角落的切換器(S)分別與一個運算單元(P)及兩個儲存單元(M)建立連接,而中心切換器(S)則與四個運算單元(P)形成十字型連接。Connection structure: Each corner switch (S) is connected to one operation unit (P) and two storage units (M) respectively, while the central switch (S) forms a cross-shaped connection with the four operation units (P).
此種配置將原有4x4矩陣架構的邊界連接與中心輻射狀連接相結合,在保持規則的連接結構的同時,提供了更多的資料傳輸路徑選擇。應注意的是,在一實施例中,由於第一陣列的中心運算單元所連接的切換器的數量最多,可將該中心運算單元的運算能力進行增強,例如N倍於其他運算單元的運算能力(N大於1),以強化消化神經網路運算任務的效率。This configuration combines the boundary connections of the original 4x4 matrix architecture with the central radial connections, providing more data transmission path options while maintaining a regular connection structure. It should be noted that in one embodiment, since the central computing unit of the first array is connected to the largest number of switches, its computing power can be enhanced, for example, by N times the computing power of other computing units (N greater than 1), to improve the efficiency of handling neural network computing tasks.
此外,這個佈局的其他特徵例如為:In addition, other features of this layout include:
第一陣列的外圍位置的兩個運算單元會相鄰,但不互連;第二陣列的外圍位置的兩個儲存單元會相鄰,但不互連。The two operation units at the outermost position of the first array will be adjacent but not connected; the two storage units at the outermost position of the second array will be adjacent but not connected.
圖4D為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第四佈局的示意圖。Figure 4D is a schematic diagram of the fourth layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖4D,在一實施例中,三維粗粒度可重構陣列架構系統的第一陣列121及第二陣列122採用延伸式佈局配置,該配置係基於圖4A所示之基本單元佈局(第一佈局)進行延伸擴展。Referring to Figure 4D, in one embodiment, the first array 121 and the second array 122 of the three-dimensional coarse-grained reconfigurable array architecture system adopt an extended layout configuration, which is an extension of the basic unit layout (first layout) shown in Figure 4A.
第一陣列121的佈局配置包括:四組切換器(S)依矩陣方式排列,形成2x2的切換網路架構。每組切換網路架構(為2x2的單元構成)的中央的切換器(S)連接8個周邊單元,包括4個運算單元(P)及4個儲存單元(M)。具體來說,運算單元(P)與切換器(S)之間採用直線連接,而儲存單元(M)與切換器(S)之間採用斜向連接。相鄰的切換器(S)之間通過運算單元(P)建立連接,形成水平及垂直方向的資料傳輸通道。The layout of the first array 121 includes four sets of switches (S) arranged in a matrix to form a 2x2 switching network architecture. The central switch (S) of each switching network architecture (composed of 2x2 units) connects to eight peripheral units, including four computing units (P) and four storage units (M). Specifically, the computing units (P) are connected to the switches (S) by straight lines, while the storage units (M) are connected to the switches (S) by diagonal lines. Adjacent switches (S) are connected through computing units (P), forming horizontal and vertical data transmission channels.
第二陣列122實作類似的延伸式佈局,其佈局配置包括:The second array 122 implements a similar extended layout, and its layout configuration includes:
中央配置一個儲存單元(M),周圍環繞四個切換器(S)。每個切換器(S)與3個運算單元(P)建立直線連接,同時與兩個儲存單元(M)建立斜向連接。此外,在四個角落也配置有切換器(S),以連接到鄰近的儲存單元(M)。這些切換器(S)透過中央儲存單元(M)及周邊運算單元(P)相互連接,構成完整的資料傳輸網路。A central storage unit (M) is configured, surrounded by four switches (S). Each switch (S) is linearly connected to three processing units (P) and diagonally connected to two storage units (M). Additionally, switches (S) are located at the four corners to connect to adjacent storage units (M). These switches (S) are interconnected through the central storage unit (M) and the peripheral processing units (P), forming a complete data transmission network.
此延伸式佈局保持了基本單元佈局的連接特性,同時通過增加單元數量擴展了系統的運算及儲存容量。This extended layout maintains the connectivity of the basic unit layout while expanding the system's computing and storage capacity by increasing the number of units.
圖5為根據本公開的一實施例所繪示的對應第一佈局的第一陣列及第二陣列的立體架構的示意圖。Figure 5 is a schematic diagram of the three-dimensional structure of the first and second arrays corresponding to the first layout, according to an embodiment of the present disclosure.
參照圖5,在一實施例中,三維粗粒度可重構陣列架構系統採用立體堆疊架構,該架構基於圖4A所示的第一佈局實現。系統包括相互交錯堆疊的第一陣列121及第二陣列122,並設有垂直連接架構B51、B52以實現跨層資料傳輸。Referring to Figure 5, in one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system adopts a three-dimensional stacked architecture based on the first layout shown in Figure 4A. The system includes a first array 121 and a second array 122 that are stacked in an interleaved manner, and is provided with vertical connection structures B51 and B52 to realize cross-layer data transmission.
第一陣列121及第二陣列122在XY平面上各自維持其基本佈局特徵。在第一陣列121(1)中,一個切換器(S)位於右上方,與其相鄰的兩個運算單元(P)分別位於左側及下方,一個儲存單元(M)位於左下方。類似地,第一陣列121(2)保持相同的平面佈局。在第二陣列122(1)中,一個切換器(S)位於左下方,與其相鄰的兩個儲存單元(M)分別位於右側及上方,一個運算單元(P)位於右上方。The first array 121 and the second array 122 each maintain their basic layout features in the XY plane. In the first array 121(1), a switch (S) is located in the upper right, two adjacent operation units (P) are located on the left and below respectively, and a storage unit (M) is located in the lower left. Similarly, the first array 121(2) maintains the same planar layout. In the second array 122(1), a switch (S) is located in the lower left, two adjacent storage units (M) are located on the right and above respectively, and an operation unit (P) is located in the upper right.
在Z軸方向上,系統實現了兩種垂直連接架構,例如:In the Z-axis direction, the system implements two vertical connection architectures, for example:
(1) 垂直連接架構B51:包含一個上層儲存單元(M),通過垂直互連直接連接至下層切換器(S),再連接至另一個下層儲存單元(M)。也就是說,在多個堆疊的陣列中,在左下方的組件會呈現M-S-M-S…的垂直連接架構。(1) Vertical interconnection architecture B51: It includes an upper storage unit (M) that is directly connected to a lower switch (S) via vertical interconnection, and then connected to another lower storage unit (M). That is, in multiple stacked arrays, the components in the lower left corner will present a vertical interconnection architecture of M-S-M-S…
(2) 垂直連接架構B52:包含一個上層切換器(S),通過垂直互連直接連接至下層運算單元(P),再連接至另一個下層切換器(S)。也就是說,在多個堆疊的陣列中,在左下方的組件會呈現M-S-M-S…的垂直連接架構。(2) Vertical connection architecture B52: It includes an upper-level switch (S) that is directly connected to the lower-level computing unit (P) via vertical interconnection, and then connected to another lower-level switch (S). That is to say, in multiple stacked arrays, the components in the lower left corner will present an M-S-M-S… vertical connection architecture.
這些垂直連接架構在相鄰陣列之間建立了直接的資料傳輸通道。具體來說,當第一陣列121(1)的切換器(S)需要與第二陣列122(1)的運算單元(P)通信時,資料可以直接通過垂直連接架構B52傳輸,無需經過水平方向的多次轉發。同樣地,當需要在不同層級的儲存單元(M)之間傳輸資料時,可以通過垂直連接架構B51實現直接傳輸。These vertical connection architectures establish direct data transmission channels between adjacent arrays. Specifically, when the switch (S) of the first array 121(1) needs to communicate with the computing unit (P) of the second array 122(1), the data can be transmitted directly through the vertical connection architecture B52 without multiple forwardings in the horizontal direction. Similarly, when data needs to be transmitted between storage units (M) at different levels, direct transmission can be achieved through the vertical connection architecture B51.
在整體架構中,相鄰的第一陣列121和第二陣列122形成功能互補的配對。通過垂直連接架構B51和B52,系統能夠在保持各陣列平面佈局特徵的同時,實現跨層資料傳輸。這種立體堆疊架構不僅維持了原有平面佈局的連接特性,還通過垂直連接架構提供了額外的資料傳輸路徑,從而強化了系統的資料傳輸效能。In the overall architecture, the adjacent first array 121 and second array 122 form a functionally complementary pair. Through the vertical connection structures B51 and B52, the system can achieve cross-layer data transmission while maintaining the planar layout characteristics of each array. This three-dimensional stacking architecture not only maintains the connection characteristics of the original planar layout, but also provides additional data transmission paths through the vertical connection structure, thereby enhancing the system's data transmission performance.
值得一提的是,在本實施例中,在一個陣列要進行跨層傳輸時,會先經過該陣列的切換器,來建立跨層傳輸路徑。因為可藉由路由任務的設定,來保證資料傳輸路徑的正確性。It is worth mentioning that in this embodiment, when an array needs to perform cross-layer transmission, it will first go through the array's switch to establish a cross-layer transmission path. This is because the correctness of the data transmission path can be ensured by configuring the routing task.
圖6為根據本公開的一實施例所繪示的對應第三佈局的第一陣列及第二陣列的立體架構的示意圖。Figure 6 is a schematic diagram of the three-dimensional structure of the first and second arrays corresponding to the third layout, according to an embodiment of the present disclosure.
請參照圖6,在圖6所示的實施例中,垂直連接架構呈現兩種交錯排列的模式:Please refer to Figure 6. In the embodiment shown in Figure 6, the vertical connection structure presents two staggered arrangement patterns:
(1) 垂直連接架構B61:上層儲存單元(M)通過垂直互連連接至下層運算單元(P),再連接至另一個下層儲存單元(M)。在堆疊的陣列中,垂直連接架構B61呈現出M-P-M-P...的交錯排列。(1) Vertical connection architecture B61: The upper storage unit (M) is vertically interconnected to the lower computing unit (P), and then connected to another lower storage unit (M). In the stacked array, the vertical connection architecture B61 presents an alternating arrangement of M-P-M-P...
(2) 垂直連接架構B62:上層切換器(S)通過垂直互連連接至下層儲存單元(M),再連接至另一個下層切換器(S)。在堆疊的陣列中,垂直連接架構B62呈現出S-M-S-M...的交錯排列。(2) Vertical interconnection architecture B62: The upper-level switch (S) is vertically interconnected to the lower-level storage unit (M), and then to another lower-level switch (S). In the stacked array, the vertical interconnection architecture B62 presents an alternating arrangement of S-M-S-M...
圖6所示的堆疊結構,在垂直方向上實現了運算單元(P)、儲存單元(M)、切換器(S)的交錯排列。相鄰層之間通過垂直連接架構B61和B62直接相連,形成緊密的立體網路。The stacked structure shown in Figure 6 implements an alternating arrangement of computation units (P), storage units (M), and switches (S) in the vertical direction. Adjacent layers are directly connected through vertical connection structures B61 and B62, forming a tight three-dimensional network.
在資料傳輸過程中,當第一陣列121(1)、121(2)需要將資料傳輸至第二陣列122時,可以直接通過垂直連接架構B61實現跨層傳輸。同理,當第二陣列122(1)需要將資料傳輸至第二陣列122時,可以直接通過垂直連接架構B61實現跨層傳輸。During data transmission, when the first array 121(1) and 121(2) need to transmit data to the second array 122, cross-layer transmission can be achieved directly through the vertical connection architecture B61. Similarly, when the second array 122(1) needs to transmit data to the second array 122, cross-layer transmission can be achieved directly through the vertical connection architecture B61.
圖7為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的控制方法的流程圖。Figure 7 is a flowchart illustrating a control method for a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure.
參照圖7,本發明提供一種三維粗粒度可重構陣列(Three-dimensional Coarse-Grained Reconfigurable Array, 3D-CGRA)架構系統的控制方法。該方法包括以下步驟:Referring to Figure 7, this invention provides a control method for a three-dimensional coarse-grained reconfigurable array (3D-CGRA) architecture system. The method includes the following steps:
步驟S710,經由配置控制器,監視三維粗粒度可重構陣列架構系統的多個第一陣列及多個第二陣列各自的工作狀態。其中每個第一陣列包括多個第一運算單元、多個第一儲存單元以及多個第一切換器,而每個第二陣列則包括多個第二運算單元、多個第二儲存單元以及多個第二切換器。Step S710, via the configuration controller, monitors the operating status of multiple first arrays and multiple second arrays of the three-dimensional coarse-grained reconfigurable array architecture system. Each first array includes multiple first computation units, multiple first storage units, and multiple first switches, while each second array includes multiple second computation units, multiple second storage units, and multiple second switches.
具體來說,配置控制器110持續即時追蹤各陣列中的運算單元、儲存單元以及切換器的使用情況。在一實施例中,工作狀態包括:閒置、運算中、存取中、工作中、已完成節點運算。此外,儲存單元的工作狀態更可包括該儲存單元所記錄的資料/資訊。通過監視它們的工作狀態,配置控制器110能夠掌握系統內部資源的即時分配現況以及資料的儲存情況,為後續的動態配置管理提供了依據。Specifically, the configuration controller 110 continuously and in real-time tracks the usage status of computing units, storage units, and switches in each array. In one embodiment, the operating status includes: idle, computing, accessing, working, and node computing completed. Furthermore, the operating status of a storage unit can include the data/information recorded by that storage unit. By monitoring their operating status, the configuration controller 110 can grasp the real-time allocation status of internal system resources and the data storage status, providing a basis for subsequent dynamic configuration management.
步驟S720,配置控制器110根據對應神經網路模型的運算圖(Computation Graph)資訊,動態管理多個第一運算單元、多個第一儲存單元、多個第一切換器、多個第二儲存單元、多個第二運算單元、多個第二切換器的啟用及禁用,以執行神經網路模型的多個神經網路運算任務。Step S720: Configure controller 110 to dynamically manage the enabling and disabling of multiple first computation units, multiple first storage units, multiple first switches, multiple second storage units, multiple second computation units, and multiple second switches based on the computation graph information of the corresponding neural network model, in order to execute multiple neural network computation tasks of the neural network model.
在此步驟中,配置控制器110依據神經網路模型本身的結構和運算順序,調度系統中的各類硬體資源,以基於當前的陣列堆疊架構來決定啟用或禁用特定的運算單元、儲存單元和切換器,以使神經網路運算任務可以依照原本的神經網路架構的順序來執行。此外,透過設定切換器的路由任務,可確保跨層的資料傳輸能夠順暢進行。In this step, the configuration controller 110 schedules various hardware resources in the system based on the structure and computational order of the neural network model itself. It determines whether to enable or disable specific computational units, storage units, and switches based on the current array stack architecture, so that neural network computational tasks can be executed in the original order of the neural network architecture. Furthermore, by configuring the routing tasks of the switches, smooth data transmission across layers can be ensured.
圖8為根據本公開的一實施例所繪示的獲取對應神經網路模型的運算圖資訊的示意圖。Figure 8 is a schematic diagram illustrating the acquisition of computational graph information corresponding to a neural network model according to an embodiment of the present disclosure.
參照圖8,在一實施例中,系統基於已知的神經網路模型的架構資料及相應的各種參數(這些資料可被記錄在工作負載配置儲存電路單元112),產生對應的資料傳輸路徑CT81及運算圖資訊TB81。資料傳輸路徑CT81以有向圖形式呈現,節點間的箭頭指示資料流動方向,用以表達節點間的資料相依關係。Referring to Figure 8, in one embodiment, the system generates corresponding data transmission paths CT81 and computational graph information TB81 based on the known neural network model architecture data and corresponding parameters (this data can be recorded in the workload configuration storage circuit unit 112). The data transmission path CT81 is presented in the form of a directed graph, with arrows between nodes indicating the direction of data flow to express the data dependencies between nodes.
在本實施例中,圖中節點A至I代表神經網路模型中的運算節點。每個箭頭代表一個資料傳輸路徑,指示運算結果的傳遞方向。例如,節點A的運算結果需傳送至節點B及節點C,表示節點B和節點C的運算需依賴節點A的運算結果。In this embodiment, nodes A to I in the diagram represent computational nodes in the neural network model. Each arrow represents a data transmission path, indicating the direction of transmission of computation results. For example, if the computation result of node A needs to be transmitted to nodes B and C, it means that the computations of nodes B and C depend on the computation result of node A.
系統依據CT81所示的資料傳輸路徑,產生運算圖資訊(如,表格TB81所示)。該運算圖資訊TB81包含以下資訊:The system generates computation graph information (as shown in Table TB81) based on the data transmission path shown in CT81. This computation graph information TB81 includes the following information:
(1) 相依性層級:指示節點在運算序列中的執行順序;(2) 節點:標記每個相依性層級內的運算節點;(3) 連接數量:記錄每個節點的相鄰節點總數;(4) 運算量:表示每個節點所處理的節點運算的運算複雜度;(5) 資料傳輸量:記錄節點間所輸出的資料大小;(6) 母節點:標示每個節點的資料來源節點。(1) Dependency level: indicates the execution order of nodes in the operation sequence; (2) Node: marks the operation node in each dependency level; (3) Number of connections: records the total number of adjacent nodes of each node; (4) Computational complexity: indicates the computational complexity of the node operation processed by each node; (5) Data transfer volume: records the data output between nodes; (6) Parent node: marks the data source node of each node.
如圖8的例子,系統(如,配置控制器110或任務控制處理器111)根據資料傳輸路徑分析出五個相依性層級:As shown in the example in Figure 8, the system (e.g., configuration controller 110 or task control processor 111) analyzes the data transmission path to identify five dependency levels:
第1層級:節點A,無前置相依節點;第2層級:節點B和C,依賴節點A的運算結果;第3層級:節點D、E和F,依賴節點B、C的運算結果;第4層級:節點G和H,依賴節點C、D、E的運算結果;第5層級:節點I,依賴節點F、G和H的運算結果。Level 1: Node A, with no preceding dependent nodes; Level 2: Nodes B and C, depending on the computation results of node A; Level 3: Nodes D, E, and F, depending on the computation results of nodes B and C; Level 4: Nodes G and H, depending on the computation results of nodes C, D, and E; Level 5: Node I, depending on the computation results of nodes F, G, and H.
系統根據運算圖資訊,為每個節點分配適當的運算資源:The system allocates appropriate computing resources to each node based on the computation graph information:
(1) 連接數量資訊用於判斷節點所需的資料傳輸通道數量;(2) 運算量資訊(OP1-OP9)用於評估所需的運算單元數量;(3) 資料傳輸量資訊(DT1-DT9)用於規劃資料傳輸路徑或指派適合的儲存單元來暫存運算結果;(4) 母節點資訊用於確保資料相依性的正確性、用來判斷運算結果是否需要暫存、規劃資料存取路徑、優化儲存單元的資源分配。(1) Connection quantity information is used to determine the number of data transmission channels required by the node; (2) Computation quantity information (OP1-OP9) is used to evaluate the number of computing units required; (3) Data transmission quantity information (DT1-DT9) is used to plan the data transmission path or assign suitable storage units to temporarily store the computation results; (4) Parent node information is used to ensure the correctness of data dependencies, to determine whether the computation results need to be temporarily stored, to plan the data access path, and to optimize the resource allocation of storage units.
在這個例子中,配置控制器110根據相依性層級資訊,安排節點的執行順序:In this example, the configuration controller 110 arranges the execution order of nodes based on dependency hierarchy information:
首先,配置行無相依性的節點A。接著,待節點A完成節點運算後,可並行執行節點B和C(因為屬於同一個層級)。節點D需等待節點B的節點運算完成(因為節點D需要節點B的運算結果),節點E需等待節點B和節點C的節點運算完成,節點F需等待節點B的節點運算完成。節點G需等待節點C和節點D的節點運算完成,節點H需等待節點E的節點運算完成。最後執行節點I,需等待節點F、節點G和節點H的節點運算全部完成。First, configure node A to be independent of other nodes. Next, after node A completes its node operations, nodes B and C can be executed concurrently (because they belong to the same level). Node D must wait for node B to complete its node operations (because node D needs the result of node B's operation), node E must wait for nodes B and C to complete their node operations, and node F must wait for node B to complete its node operations. Node G must wait for nodes C and D to complete their node operations, and node H must wait for node E to complete its node operations. Finally, node I is executed, waiting for nodes F, G, and H to complete their node operations.
此運算圖資訊使系統能夠有效管理運算資源的分配及調度,確保神經網路運算的正確執行順序。This computation graph information enables the system to effectively manage the allocation and scheduling of computational resources, ensuring the correct execution order of neural network computations.
更詳細來說,在一實施例中,任務控制處理器111根據工作負載配置儲存電路單元112中儲存的運算圖資訊,執行運算資源的動態配置。具體來說,任務控制處理器111首先依據相依性層級資訊,決定對應多個節點的多個神經網路運算任務。接著,任務控制處理器111監控多個第一陣列121及多個第二陣列122的工作狀態,從中識別尚未被啟用的空閒運算單元。More specifically, in one embodiment, the task control processor 111 performs dynamic configuration of computing resources based on the computation graph information stored in the workload configuration storage circuit unit 112. Specifically, the task control processor 111 first determines multiple neural network computation tasks corresponding to multiple nodes based on dependency hierarchy information. Then, the task control processor 111 monitors the operating status of multiple first arrays 121 and multiple second arrays 122, identifying idle computing units that have not yet been activated.
當開始進行任務配置時,任務控制處理器111依據相依性層級資訊,從多個神經網路運算任務中依序選擇尚未處理的目標神經網路運算任務。對於每個目標神經網路運算任務,任務控制處理器111執行一系列配置步驟:首先,從空閒運算單元中選擇適合執行目標任務節點的目標運算單元;其次,從第一儲存單元及第二儲存單元中選擇對應的目標儲存單元;最後,從第一切換器及第二切換器中選擇合適的目標切換器,以建立目標運算單元與目標儲存單元之間的連接。When task configuration begins, the task control processor 111 sequentially selects unprocessed target neural network computation tasks from multiple neural network computation tasks based on dependency hierarchy information. For each target neural network computation task, the task control processor 111 performs a series of configuration steps: first, it selects a target computation unit suitable for executing the target task node from the available computation units; second, it selects a corresponding target storage unit from the first storage unit and the second storage unit; and finally, it selects a suitable target switch from the first switch and the second switch to establish a connection between the target computation unit and the target storage unit.
完成資源選擇後,任務控制處理器111依序啟用這些選定的硬體資源:啟用目標運算單元以執行目標節點運算、啟用目標儲存單元以儲存相關資料,並啟用目標切換器以設定目標路由任務,從而建立完整的資料傳輸路徑。After completing the resource selection, the task control processor 111 sequentially enables the selected hardware resources: enables the target computation unit to perform target node computation, enables the target storage unit to store relevant data, and enables the target switch to set the target routing task, thereby establishing a complete data transmission path.
值得一提的是,就圖8的例子而言,一個神經網路運算任務可定義為:在相同相依性層級中,具有相同母節點的一組節點運算。例如:第一個神經網路運算任務:執行節點A的運算;第二個神經網路運算任務:同時執行節點B和C的運算(共同母節點為A);第三個神經網路運算任務:執行節點D、E和F的運算(母節點為B或C);第四個神經網路運算任務:執行節點G和H的運算;第五個神經網路運算任務:執行節點I的運算。It is worth mentioning that, in the example of Figure 8, a neural network computation task can be defined as: a set of node computations with the same parent node in the same dependency level. For example: the first neural network computation task: performing the computation of node A; the second neural network computation task: simultaneously performing the computations of nodes B and C (with A as the common parent node); the third neural network computation task: performing the computations of nodes D, E, and F (with B or C as the parent node); the fourth neural network computation task: performing the computations of nodes G and H; and the fifth neural network computation task: performing the computation of node I.
這些被劃分到同一運算任務的節點會被指定為該運算任務的任務節點。任務控制處理器111為這些任務節點配置對應數量的運算單元,使其能夠平行執行節點運算。所有神經網路運算任務則依據其相依性層級資訊進行排序,確保運算順序的正確性。這種任務設計能確保運算任務的執行順序符合資料相依性要求,同時支援可並行執行的節點運算。These nodes that are divided into the same computing task will be designated as the task nodes of the computing task. The task control processor 111 configures a corresponding number of computing units for these task nodes so that they can execute node operations in parallel. All neural network computing tasks are sorted according to their dependency level information to ensure the correctness of the order of operations. This task design ensures that the execution sequence of computing tasks meets data dependency requirements and supports node operations that can be executed in parallel.
在另一實施例中,任務控制處理器111在選擇硬體資源時,會根據運算圖資訊中的資料傳輸量資訊及連接數量資訊進行評估。具體來說,任務控制處理器111首先依據目標運算單元各自的資料傳輸量資訊,選擇對應的目標儲存單元。此選擇機制確保具有較大資料傳輸需求的運算單元能夠配置到足夠的儲存資源(或是讓其鄰近的儲存單元是具有足夠的儲存空間)。In another embodiment, the task control processor 111 evaluates hardware resources based on data transfer volume and connection quantity information in the computation graph information. Specifically, the task control processor 111 first selects the corresponding target storage unit based on the data transfer volume information of each target computing unit. This selection mechanism ensures that computing units with large data transfer requirements can be allocated sufficient storage resources (or that adjacent storage units have sufficient storage space).
其次,任務控制處理器111根據目標任務節點之間的連接數量資訊,從多個空閒運算單元中選擇適當的目標運算單元。舉例而言,若某目標任務節點具有較多的連接數量,則任務控制處理器111會為其選擇具有較多切換器連接的運算單元,以確保資料傳輸的效率。因此,當某特定目標運算單元需要處理較多連接數量的任務節點時,該運算單元會被配置更多的目標切換器連接。Secondly, the task control processor 111 selects an appropriate target computing unit from multiple available computing units based on the number of connections between target task nodes. For example, if a target task node has a large number of connections, the task control processor 111 will select a computing unit with more switch connections to ensure efficient data transmission. Therefore, when a particular target computing unit needs to handle a large number of task nodes, that computing unit will be configured with more target switch connections.
在執行目標節點運算時,任務控制處理器111需要處理多種類型的資料。在本實施例中,這些資料包括:運算參數,用於配置目標運算單元的運算模式;節點輸入資料,包含來自前置節點的運算結果或原始輸入資料;節點運算結果,即目標運算單元執行運算任務後產生的輸出資料。任務控制處理器111可視情況決定將節點輸入資料直接傳輸給對應後續節點的運算單元,或是暫存在儲存單元以等待對應後續節點的運算單元被配置。When performing calculations on the target node, the task control processor 111 needs to process various types of data. In this embodiment, this data includes: calculation parameters used to configure the calculation mode of the target calculation unit; node input data, including calculation results or raw input data from the preceding node; and node calculation results, i.e., the output data generated by the target calculation unit after performing the calculation task. The task control processor 111 may decide, as needed, to directly transmit the node input data to the calculation unit of the corresponding subsequent node, or to temporarily store it in a storage unit to wait for the calculation unit of the corresponding subsequent node to be configured.
以圖8所示的運算圖為例,當任務控制處理器111處理任務節點E的運算任務時,會考慮以下因素:Taking the computation graph shown in Figure 8 as an example, when the task control processor 111 processes the computation task of task node E, it will consider the following factors:
(1) 由於任務節點E具有3個連接(來自節點B和C的輸入連接,以及到節點H的輸出連接),相較於任務節點D(對應較少連接數量:2)的運算單元的配置,任務控制處理器111會為任務節點E選擇具有較多切換器連接的目標任務運算單元。(1) Since task node E has 3 connections (input connections from nodes B and C, and output connections to node H), compared to the configuration of the computing unit of task node D (corresponding to fewer connections: 2), the task control processor 111 will select a target task computing unit with more switch connections for task node E.
(2) 根據對應節點E的節點運算所需要處理的輸入資料及相應的運算參數的大小,任務控制處理器111可配置具有足夠儲存容量的目標儲存單元來儲存對應的輸入資料。(2) Based on the input data that needs to be processed for the node operation of the corresponding node E and the size of the corresponding operation parameters, the task control processor 111 can be configured with a target storage unit with sufficient storage capacity to store the corresponding input data.
(3) 任務控制處理器111確保所選擇的目標切換器能夠建立最短的完整的資料傳輸路徑,以接收來自對應節點B和C的結點運算結果,進而經由配置給節點E的目標任務運算單元來執行對應節點E的節點運算。(3) The task control processor 111 ensures that the selected target switch can establish the shortest complete data transmission path to receive the node operation results from the corresponding nodes B and C, and then perform the node operation of the corresponding node E through the target task operation unit configured for node E.
(4) 若對應節點E的目標任務運算單元的節點運算結果需要被暫時儲存,任務控制處理器111根據資料傳輸量資訊,決定在對應節點E的目標任務運算單元鄰近的具有足夠儲存空間的儲存單元,以暫存此節點運算結果。(4) If the node calculation result of the target task calculation unit of the corresponding node E needs to be temporarily stored, the task control processor 111 decides to temporarily store the node calculation result in a storage unit with sufficient storage space near the target task calculation unit of the corresponding node E based on the data transmission volume information.
在一實施例中,任務控制處理器111採用運算能力評估機制來分配運算資源。首先,任務控制處理器111會根據每個目標任務節點的運算量資訊及每個空閒運算單元[工作狀態為「空閒」(如,沒有執行運算)的運算單元]的運算能力,評估現有空閒運算單元是否足以滿足運算需求。In one embodiment, the task control processor 111 employs a computing power assessment mechanism to allocate computing resources. First, the task control processor 111 assesses whether the existing idle computing units are sufficient to meet the computing requirements based on the computing power information of each target task node and the computing power of each idle computing unit [a computing unit whose working status is "idle" (e.g., not performing any computing).
當確認空閒運算單元數量充足時,任務控制處理器111會考慮三個因素來選擇並啟用目標運算單元:When the number of idle computing units is confirmed to be sufficient, the task control processor 111 will consider three factors to select and enable the target computing unit:
(1) 目標任務節點的運算量資訊,用於評估所需的運算資源規模;(2) 空閒運算單元的運算能力,確保選擇的運算單元能夠有效處理指定任務;(3) 資料傳輸路徑中的中繼組件數量,用於最小化資料傳輸延遲。(1) The computational load information of the target task node, used to assess the required computational resource scale; (2) The computational capacity of the idle computing units, to ensure that the selected computing units can effectively handle the specified tasks; (3) The number of relay components in the data transmission path, used to minimize data transmission delay.
例如,當處理圖8中節點E的運算任務時,任務控制處理器111會:評估節點E的運算量OP5;檢查可用的空閒運算單元的運算能力;計算從節點B和C到各個候選運算單元的傳輸路徑長度。For example, when processing the computation task of node E in Figure 8, the task control processor 111 will: evaluate the computational load OP5 of node E; check the computational capacity of available idle computation units; and calculate the transmission path length from nodes B and C to each candidate computation unit.
基於這些資訊,任務控制處理器111選擇最佳的目標運算單元組合,其具有最短的傳輸路徑,以實現運算效能的最佳化。以下更利用圖9來說明。Based on this information, the task control processor 111 selects the optimal combination of target computational units with the shortest transmission path to optimize computational performance. This is further illustrated below using Figure 9.
圖9為根據本公開的一實施例所繪示的根據傳輸路徑選擇目標任務運算單元的示意圖。Figure 9 is a schematic diagram illustrating the selection of target task computation unit based on transmission path according to an embodiment of the present disclosure.
參照圖9,在一實施例中,任務控制處理器111根據傳輸路徑資訊選擇目標運算單元。圖9上半部展示系統架構的陣列配置,其中包含運算單元(P)、儲存單元(M)及切換器(S)的空間分布,以及它們之間的連接關係。圖9下半部則以表格TB91呈現傳輸路徑的相關資訊。在這個例子中,假設第二運算單元P1是剛完成節點運算的前置運算單元,第二運算單元P2、P3是正在進行節點運算的忙碌運算單元。也就是說,對於前置運算單元P1來說,可選的空閒運算單元有P4、P5、P6、P7、P8、P9。Referring to Figure 9, in one embodiment, the task control processor 111 selects the target computation unit based on the transmission path information. The upper part of Figure 9 shows the array configuration of the system architecture, including the spatial distribution of computation units (P), storage units (M), and switches (S), as well as their connections. The lower part of Figure 9 presents the relevant information of the transmission path in a table TB91. In this example, it is assumed that the second computation unit P1 is the preceding computation unit that has just completed node computation, and the second computation units P2 and P3 are busy computation units that are currently performing node computation. That is, for the preceding computation unit P1, the available idle computation units are P4, P5, P6, P7, P8, and P9.
在本實施例中,系統針對前置運算單元P1與各個空閒運算單元P4、P5、P6、P7、P8、P9之間的可能傳輸路徑進行分析。這些傳輸路徑可分為兩類:In this embodiment, the system analyzes the possible transmission paths between the front-end operation unit P1 and each of the idle operation units P4, P5, P6, P7, P8, and P9. These transmission paths can be divided into two categories:
(1) 同平面路徑:如傳輸路徑TP14,其中P1至P4的資料傳輸完全在同一陣列平面內進行;以及(1) Coplanar path: such as transmission path TP14, where data transmission from P1 to P4 occurs entirely within the same array plane; and
(2) 非同平面路徑:如傳輸路徑TP15至TP19,資料傳輸需跨越不同陣列平面。(2) Non-plane paths: such as transmission paths TP15 to TP19, data transmission needs to cross different array planes.
任務控制處理器111會計算每條傳輸路徑的長度。傳輸路徑的長度可利用等於該傳輸路徑上的總組件個數減一來獲取。舉例來說,因為傳輸路徑TP14(P1→S1→M5→S2→P4)經過5個組件,傳輸路徑TP14的長度為4(5-1=4),The task control processor 111 calculates the length of each transmission path. The length of a transmission path can be obtained by subtracting one from the total number of components on that path. For example, since transmission path TP14 (P1→S1→M5→S2→P4) passes through 5 components, the length of transmission path TP14 is 4 (5-1=4).
又例如,因為傳輸路徑TP15(P1→S1→M6→P5)經過4個組件,傳輸路徑TP15的長度為3(4-1=3)。For example, since the transmission path TP15 (P1→S1→M6→P5) passes through 4 components, the length of the transmission path TP15 is 3 (4-1=3).
系統會優先選擇具有較短傳輸路徑長度的目標運算單元。以表格TB91所示,除了TP18路徑長度為5外,其他非同平面路徑(TP15、TP16、TP17、TP19)的長度均為3,短於同平面路徑TP14的長度4。這表明在大多數情況下,非同平面的傳輸路徑能提供更高效的資料傳輸(通過的組件的總數量越少,傳輸的速度越快)。The system prioritizes target computation units with shorter transmission path lengths. As shown in Table TB91, except for TP18 (path length 5), the lengths of other non-plane paths (TP15, TP16, TP17, TP19) are all 3, shorter than the length of the plane path TP14 (4). This indicates that in most cases, non-plane transmission paths provide more efficient data transmission (the fewer components passed through, the faster the transmission speed).
當任務控制處理器111判定非同平面路徑TP15的組件總個數少於同平面路徑TP14時,會優先選擇對應非同平面路徑TP15的運算單元P5作為目標運算單元(因為傳輸路徑的長度較短)。也就是說,反應於判定最短的所述非同平面路徑所包含的組件總個數少於最短的所述同平面路徑所包含的組件總個數,任務控制處理器111選擇對應最短的所述非同平面路徑的特定空閒運算單元作為所述一或多個目標運算單元的其中之一,以配置給目標節點。此機制充分利用三維架構的特性,通過選擇較短的垂直傳輸路徑,有效降低資料傳輸延遲。When the task control processor 111 determines that the total number of components on the non-plane path TP15 is less than that on the plane path TP14, it will preferentially select the operation unit P5 corresponding to the non-plane path TP15 as the target operation unit (because the transmission path is shorter). That is, in response to the determination that the total number of components contained in the shortest non-plane path is less than the total number of components contained in the shortest plane path, the task control processor 111 selects a specific free operation unit corresponding to the shortest non-plane path as one of the one or more target operation units to be assigned to the target node. This mechanism fully utilizes the characteristics of the three-dimensional architecture, effectively reducing data transmission latency by selecting the shorter vertical transmission path.
應注意的是,當多個可能的傳輸路徑具有相同長度時,任務控制處理器111會進一步考慮其他因素,如空閒運算單元的運算能力、當前負載情況、被考量的空閒運算單元其周遭的其他空閒運算單元的數量、周遭儲存單元的儲存能力,以做出最終的選擇。這種基於傳輸路徑長度的選擇機制,有助於最小化資料傳輸延遲,提升系統整體效能。It should be noted that when multiple possible transmission paths have the same length, the task control processor 111 will further consider other factors, such as the computing power of the idle computing unit, the current load, the number of other idle computing units around the considered idle computing unit, and the storage capacity of the surrounding storage units, in order to make a final selection. This selection mechanism based on transmission path length helps to minimize data transmission latency and improve the overall system performance.
圖10A至圖10C為根據本公開的一實施例所繪示的根據運算圖資訊來配置運算單元以執行對應的神經網路運算任務的示意圖。Figures 10A to 10C are schematic diagrams illustrating, according to an embodiment of the present disclosure, configuring computation units based on computation graph information to perform corresponding neural network computation tasks.
參照圖10A,在一實施例中,任務控制處理器111根據運算圖資訊TB101配置運算資源,以執行神經網路運算任務。具體來說,對於相依性層級1的節點A,系統進行以下配置過程:Referring to Figure 10A, in one embodiment, the task control processor 111 configures computing resources according to the computation graph information TB101 to perform neural network computation tasks. Specifically, for node A at dependency level 1, the system performs the following configuration process:
首先,任務控制處理器111分析節點A的運算特徵:(1) 運算量為200,高於標準運算單元(P1-P8)的運算能力100。(2) 連接數量為2,表示需要將運算結果傳送至兩個後續節點。(3) 母節點為Null,表示可立即開始運算。First, the task control processor 111 analyzes the computational characteristics of node A: (1) The computational load is 200, which is higher than the computational capacity of the standard computational unit (P1-P8) of 100. (2) The number of connections is 2, indicating that the computational results need to be sent to two subsequent nodes. (3) The parent node is null, indicating that the computation can start immediately.
基於上述分析,如箭頭A101所示,任務控制處理器111在第一陣列121a中選擇兩個運算單元P1和P2,以配置給節點A來執行對應的節點運算。Based on the above analysis, as shown by arrow A101, the task control processor 111 selects two operation units P1 and P2 in the first array 121a to be assigned to node A to perform the corresponding node operation.
當配置完成後,運算單元P1和P2進入忙碌狀態(以網格圖樣表示),共同執行節點A的節點運算任務。任務控制處理器111確保這兩個運算單元能夠協同工作,共同處理節點A的節點運算需求,並將運算結果準備傳送給配置給後續節點B和C的其他運算單元。Once the configuration is complete, computation units P1 and P2 enter a busy state (represented by a grid pattern) and jointly execute the node computation tasks of node A. The task control processor 111 ensures that these two computation units can work together to process the node computation requirements of node A and prepares to send the computation results to other computation units configured for subsequent nodes B and C.
參照圖10B,接續圖10A的第一陣列121a。接著,在第一陣列121b中,運算單元P1及P2完成了節點運算,成為前置運算單元(以網點圖樣表示)。根據運算圖資訊TB101,任務控制處理器111判定後續的相依性層級2的節點為B、C。任務控制處理器111選擇運算單元P9來執行節點B的節點運算任務,選擇運算單元P3來執行節點C的節點運算任務(以粗框表示被選擇的目標任務運算單元)。這兩個運算單元和前置運算節點P1、P2的傳輸路徑最短。Referring to Figure 10B, continuing from the first array 121a of Figure 10A. Next, in the first array 121b, operation units P1 and P2 complete node operations and become the preceding operation units (represented by a dotted pattern). Based on the operation graph information TB101, the task control processor 111 determines that the subsequent nodes of dependency level 2 are B and C. The task control processor 111 selects operation unit P9 to execute the node operation task of node B, and selects operation unit P3 to execute the node operation task of node C (the selected target task operation units are indicated by thick boxes). The transmission path between these two operation units and the preceding operation nodes P1 and P2 is the shortest.
接著,如箭頭A102所示,系統進入第一陣列121c的狀態。在此狀態下,由於對應節點A的節點運算已完成,節點運算結果從運算單元P1、P2被整合且傳輸到運算單元P3、P9。運算單元P1和P2重置為空閒狀態(變回空心方塊)。運算單元P9及P3則轉為忙碌狀態(以網格圖樣表示),基於接收到的對應節點A的節點運算結果來分別執行對應節點B及節點C的節點運算任務。Next, as indicated by arrow A102, the system enters the state of the first array 121c. In this state, since the node operation for node A has been completed, the node operation result is integrated from operation units P1 and P2 and transmitted to operation units P3 and P9. Operation units P1 and P2 are reset to an idle state (becoming hollow blocks again). Operation units P9 and P3 then enter a busy state (represented by a grid pattern), and based on the received node operation result for node A, they execute the node operation tasks for nodes B and C respectively.
在一實施例中,當任務控制處理器111需要配置目標神經網路運算任務的資料傳輸路徑時,會執行系統化的資源配置流程。具體來說,任務控制處理器111首先監控多個第一陣列121及多個第二陣列122的當前工作狀態,並從中識別尚未被啟用的新的空閒運算單元。In one embodiment, when the task control processor 111 needs to configure the data transmission path for the target neural network computing task, it executes a systematic resource configuration process. Specifically, the task control processor 111 first monitors the current operating status of multiple first arrays 121 and multiple second arrays 122, and identifies new idle computing units that have not yet been activated.
當完成空閒運算單元的識別後,任務控制處理器111根據相依性層級資訊,確定目標神經網路運算任務的下一個運算任務。例如,參照圖10B的實施例,當系統處於第一陣列121c的狀態時,任務控制處理器111會將相依性層級3的節點D、E、F識別為新的目標神經網路運算任務。接著,任務控制處理器111評估新識別出的空閒運算單元及新的目標神經網路運算任務的特性,選擇適當的新目標運算單元。在上述例子中,任務控制處理器111選擇運算單元P7、P6、P1及P2作為新的目標運算單元,以執行節點D、E、F的節點運算任務。根據目前執行運算的目標運算單元(P9、P3)及新選擇的目標運算單元(P7、P6、P1、P2)的位置關係,任務控制處理器111規劃目標資料傳輸路徑。最後,任務控制處理器111設定對應該傳輸路徑上的目標切換器的路由任務,確保節點運算結果能夠正確地從當前的目標運算單元傳輸至新的目標運算單元,進而支援後續運算任務的執行。After identifying the idle computation units, the task control processor 111 determines the next computation task of the target neural network computation task based on the dependency level information. For example, referring to the embodiment in Figure 10B, when the system is in the first array 121c state, the task control processor 111 identifies nodes D, E, and F of dependency level 3 as new target neural network computation tasks. Then, the task control processor 111 evaluates the characteristics of the newly identified idle computation units and the new target neural network computation tasks, and selects an appropriate new target computation unit. In the example above, the task control processor 111 selects computation units P7, P6, P1, and P2 as new target computation units to execute the node computation tasks of nodes D, E, and F. Based on the positional relationship between the currently executing target computation units (P9, P3) and the newly selected target computation units (P7, P6, P1, P2), the task control processor 111 plans the target data transmission path. Finally, the task control processor 111 sets the routing task for the target switch corresponding to this transmission path, ensuring that the node computation results can be correctly transmitted from the current target computation unit to the new target computation unit, thereby supporting the execution of subsequent computation tasks.
如箭頭A103所示,系統進入第一陣列121d的狀態。此時,運算單元P9(執行節點B)及運算單元P3(執行節點C)的節點運算已完成(以網點圖樣表示),成為前置運算單元。根據運算圖資訊TB101,任務控制處理器111判定後續的相依性層級3的節點為D、E、F。基於運算能力和運算量,任務控制處理器111選擇運算單元P7執行節點D的節點運算任務,選擇運算單元P6執行節點F的節點運算任務,並選擇運算單元P1和P2執行節點E的節點運算任務。As indicated by arrow A103, the system enters the state of the first array 121d. At this time, the node operations of operation unit P9 (execution node B) and operation unit P3 (execution node C) have been completed (represented by a dotted pattern) and become the preceding operation units. According to the operation graph information TB101, the task control processor 111 determines that the nodes of the subsequent dependency level 3 are D, E, and F. Based on the computing power and computational load, the task control processor 111 selects operation unit P7 to execute the node operation task of node D, selects operation unit P6 to execute the node operation task of node F, and selects operation units P1 and P2 to execute the node operation task of node E.
接著,如箭頭A104所示,如箭頭A104所示,系統進入第一陣列121e的狀態。由於對應節點B、C的節點運算已完成,任務控制處理器111獲取對應的節點運算結果。由於任務控制處理器111判定節點D、E、F的節點運算不會使用到節點C的節點運算結果C且運算結果C會用在後續的節點H。因此,任務控制處理器111將運算結果C暫存到儲存單元M2。此外,對應節點B的節點運算結果會被傳輸到運算單元P1、P2、P6、P7。在此狀態下,運算單元P1、P2、P6、P7轉為忙碌狀態(以網格圖樣表示),分別執行對應的節點運算任務。具體來說,運算單元P7執行對應節點D的節點運算、運算單元P6執行對應節點F的節點運算,而運算單元P1和P2則共同執行對應節點E的節點運算。接著,運算單元P9及P3也會被重置為空閒運算單元。Next, as indicated by arrow A104, the system enters the state of the first array 121e. Since the node calculations for nodes B and C have been completed, the task control processor 111 obtains the corresponding node calculation results. Because the task control processor 111 determines that the node calculations for nodes D, E, and F will not use the node calculation result C of node C, and that the calculation result C will be used in the subsequent node H, the task control processor 111 temporarily stores the calculation result C in storage unit M2. Furthermore, the node calculation result for node B is transmitted to calculation units P1, P2, P6, and P7. In this state, computation units P1, P2, P6, and P7 become busy (represented by a grid pattern) and execute their respective node computation tasks. Specifically, computation unit P7 executes the node computation for node D, computation unit P6 executes the node computation for node F, and computation units P1 and P2 jointly execute the node computation for node E. Then, computation units P9 and P3 are reset to idle computation units.
參照圖10C,接續圖10B的例子。在第一陣列121f中,運算單元P1、P2(執行節點E的節點運算)、運算單元P7(執行節點D的節點運算)及運算單元P6(執行節點F的節點運算)完成了對應的節點運算(以網點圖樣表示),成為了前置運算單元。根據運算圖資訊TB101,任務控制處理器111判定後續的相依性層級4的節點為G、H,且判定節點G需要節點C和D的節點運算結果,而節點H需要節點E的節點運算結果。此外,節點F的節點運算結果也會被用在後續的節點I。因此,任務控制處理器111將對應節點F的節點運算結果F暫存到儲存單元M4,因為該結果將用於最後的對應節點I的節點運算。接著,任務控制處理器111選擇運算單元P9執行節點G的節點運算任務,選擇運算單元P3執行節點H的節點運算任務(以粗框表示)。Referring to Figure 10C, continuing the example from Figure 10B. In the first array 121f, operation units P1, P2 (performing node operations for node E), operation unit P7 (performing node operations for node D), and operation unit P6 (performing node operations for node F) have completed their corresponding node operations (represented by a dotted pattern) and become the preceding operation units. Based on the operation graph information TB101, the task control processor 111 determines that the subsequent nodes of dependency level 4 are G and H, and determines that node G requires the node operation results of nodes C and D, while node H requires the node operation result of node E. In addition, the node operation result of node F will also be used for the subsequent node I. Therefore, the task control processor 111 temporarily stores the node operation result F corresponding to node F in storage unit M4, because this result will be used for the final node operation of the corresponding node I. Next, the task control processor 111 selects operation unit P9 to execute the node operation task of node G, and selects operation unit P3 to execute the node operation task of node H (indicated by bold boxes).
如箭頭A105所示,系統進入第一陣列121g的狀態。在此狀態下,節點D和C的節點運算結果被傳輸到運算單元P9,節點E的節點運算結果被傳輸到運算單元P3。運算單元P9和P3轉為忙碌狀態(以網格圖樣表示),分別執行節點G和H的節點運算任務。同時,由於運算單元P1、P2、P6、P7已完成資料傳輸,這些運算單元被重置為空閒運算單元(以空心方塊表示)。As indicated by arrow A105, the system enters state 121g of the first array. In this state, the node operation results of nodes D and C are transmitted to operation unit P9, and the node operation result of node E is transmitted to operation unit P3. Operation units P9 and P3 enter a busy state (represented by a grid pattern) and execute the node operation tasks of nodes G and H respectively. At the same time, since operation units P1, P2, P6, and P7 have completed data transmission, these operation units are reset to idle operation units (represented by hollow blocks).
接著,接續前述狀態。如箭頭A106所示,系統進入第一陣列121h的狀態。在此狀態下,運算單元P9(執行節點G的節點運算)及運算單元P3(執行節點H的節點運算)完成了對應的節點運算(以網點圖樣表示),成為前置運算單元。根據運算圖資訊TB101,任務控制處理器111判定後續的相依性層級5的節點為I,且判定該節點需要節點F、G、H的節點運算結果,其運算量為300。因此,任務控制處理器111選擇運算單元P4、P5、P6共三個運算單元來執行節點I的節點運算任務(以粗框表示)。Next, continuing from the aforementioned state, as indicated by arrow A106, the system enters the state of the first array 121h. In this state, computation unit P9 (performing node operations for node G) and computation unit P3 (performing node operations for node H) complete their corresponding node operations (represented by a dotted pattern) and become the preceding computation units. Based on the computation graph information TB101, the task control processor 111 determines that the node of the subsequent dependency level 5 is I, and determines that this node requires the node operation results of nodes F, G, and H, with a computational load of 300. Therefore, the task control processor 111 selects three computation units, P4, P5, and P6, to perform the node operation task of node I (represented by a thick box).
接著,如箭頭A107所示,系統進入第一陣列121i的狀態。在此狀態下,節點F的節點運算結果從儲存單元M4、節點G的節點運算結果從運算單元P9、節點H的節點運算結果從運算單元P3被整合且傳輸到運算單元P4、P5、P6。這三個運算單元轉為忙碌狀態(以網格圖樣表示),共同執行節點I的節點運算任務。由於每個運算單元具有100的節點運算能力,三個運算單元的組合足以滿足節點I所需的300運算量。此外,完成資料傳輸後,前置運算單元P9和P3被重置為空閒運算單元(以空心方塊表示)。Next, as indicated by arrow A107, the system enters the state of the first array 121i. In this state, the node operation results of node F are integrated from storage unit M4, the node operation results of node G from operation unit P9, and the node operation results of node H from operation unit P3 and transmitted to operation units P4, P5, and P6. These three operation units enter a busy state (represented by a grid pattern) and jointly execute the node operation tasks of node I. Since each operation unit has a node operation capacity of 100, the combination of the three operation units is sufficient to meet the 300 operation requirements of node I. In addition, after the data transmission is completed, the preceding operation units P9 and P3 are reset to idle operation units (represented by hollow blocks).
至此,系統完成了運算圖資訊TB101中所有節點的節點運算任務。在整個過程中,任務控制處理器111根據相依性層級資訊,依序配置適當的節點運算資源,並妥善管理節點運算結果的傳輸與暫存,確保神經網路運算任務的正確執行。At this point, the system has completed the node computation tasks for all nodes in the computation graph information TB101. Throughout the process, the task control processor 111 configures appropriate node computation resources sequentially according to the dependency hierarchy information, and properly manages the transmission and temporary storage of node computation results to ensure the correct execution of the neural network computation tasks.
應當理解的是,上述實施例中,為便於闡述本公開所提供的控制方法,僅示例性地使用同一陣列平面的第一陣列121來實現運算圖資訊的多個神經網路運算任務。然而,此示例並不構成對本公開範圍的限制。在其他實施例中,任務控制處理器111可如同圖9所示,根據實際運算需求及資源狀態,選擇不同陣列平面的運算單元來執行節點運算任務,從而實現更靈活的資源配置。It should be understood that in the above embodiments, for the purpose of illustrating the control method provided by this disclosure, the first array 121 of the same array plane is used only as an example to implement multiple neural network computation tasks of computation graph information. However, this example does not constitute a limitation on the scope of this disclosure. In other embodiments, the task control processor 111 may, as shown in FIG9, select computation units of different array planes to perform node computation tasks according to actual computation needs and resource status, thereby achieving more flexible resource configuration.
在進一步描述本公開的實施例之前,先說明當系統面臨運算資源不足的情況下的處理機制。具體來說,當任務控制處理器111判定空閒運算單元的數量不足以同時處理所有目標任務節點時,會採用序列化的資源配置策略。Before further describing the embodiments of this disclosure, let's explain the handling mechanism when the system faces insufficient computing resources. Specifically, when the task control processor 111 determines that the number of idle computing units is insufficient to process all target task nodes simultaneously, it will adopt a serialized resource allocation strategy.
參照圖11A至圖11D所示的實施例,將展示一種情境:當系統需要執行具有200運算量的節點運算時,但每個運算單元僅具有100的運算能力,且系統中可用的運算單元數量有限(目前只有一個第二陣列的運算單元可供使用)。在此情況下,任務控制處理器111需要:(1) 優先分配有限的運算資源給部分目標任務節點;(2) 持續監控運算單元的工作狀態;(3) 為完成運算的結果選擇最佳的暫存位置;(4) 動態調配釋放的運算資源。Referring to the embodiments shown in Figures 11A to 11D, a scenario will be illustrated: when the system needs to perform node operations with a computational capacity of 200, but each computational unit only has a computational capacity of 100, and the number of computational units available in the system is limited (currently only one computational unit in the second row is available). In this case, the task control processor 111 needs to: (1) prioritize allocating limited computational resources to a portion of the target task nodes; (2) continuously monitor the working status of the computational units; (3) select the best temporary storage location for the results of the completed computation; and (4) dynamically allocate the released computational resources.
這種資源管理機制特別適用於處理大規模神經網路運算時,系統資源緊張的情況。以下將依序說明在此情境下,任務控制處理器111如何透過動態資源配置來確保所有運算任務的完成。This resource management mechanism is particularly suitable for situations where system resources are strained when handling large-scale neural network operations. The following sections will explain how the task control processor 111 ensures the completion of all computing tasks in this scenario through dynamic resource allocation.
圖11A至圖11D為根據本公開的另一實施例所繪示的根據運算圖資訊來配置運算單元以執行對應的神經網路運算任務的示意圖。Figures 11A to 11D are schematic diagrams illustrating, according to another embodiment of the present disclosure, the configuration of computation units based on computation graph information to perform corresponding neural network computation tasks.
參照圖11A,在本實施例中,任務控制處理器111面臨運算資源受限的情境。根據運算圖資訊TB111,相依性層級1的節點A需要200的運算量,然而第二陣列122a中的每個運算單元(P1-P4)僅具有100的運算能力,使得單一運算單元無法獨立完成節點A的運算任務。Referring to Figure 11A, in this embodiment, the task control processor 111 faces a situation of limited computing resources. According to the computation graph information TB111, node A at dependency level 1 requires 200 computations, but each computation unit (P1-P4) in the second array 122a only has 100 computations, making it impossible for a single computation unit to independently complete the computation task of node A.
因應此運算資源限制,任務控制處理器111首先對第二陣列122a中的可用資源進行評估。經評估後,任務控制處理器111注意到運算單元P1和P2不僅可透過切換器S5建立直接的資料傳輸通道,其組合運算能力(200)亦正好符合節點A的運算需求。基於此評估結果,任務控制處理器111選擇運算單元P1和P2來共同執行節點A的運算任務。In response to this limitation of computing resources, the task control processor 111 first evaluates the available resources in the second array 122a. After evaluation, the task control processor 111 notices that computing units P1 and P2 can not only establish a direct data transmission channel through the switch S5, but their combined computing power (200) also perfectly meets the computing needs of node A. Based on this evaluation result, the task control processor 111 selects computing units P1 and P2 to jointly execute the computing tasks of node A.
如箭頭A111所示,在完成資源配置後,運算單元P1和P2轉入忙碌狀態(以網格圖樣表示)開始執行運算,而運算單元P3和P4則維持空閒狀態(以空心方塊表示)以備後續使用。同時,儲存單元M1至M8亦保持空閒狀態,準備接收後續的節點運算結果。此種資源配置方式展現了系統在運算能力受限的情況下,如何透過協同多個運算單元的方式來完成高運算量的任務需求。As shown by arrow A111, after resource configuration is complete, computation units P1 and P2 enter a busy state (represented by a grid pattern) and begin execution, while computation units P3 and P4 remain idle (represented by hollow squares) for later use. Simultaneously, storage units M1 to M8 also remain idle, ready to receive the results of subsequent node computations. This resource configuration demonstrates how the system, under limited computing power, can complete high-computational-load tasks by coordinating multiple computation units.
參照圖11B,接續前述情境。在第二陣列122b中,承接第二陣列122a的運算配置,運算單元P1和P2完成了對應節點A的節點運算(以網點圖樣表示)。根據運算圖資訊TB111,任務控制處理器111判定相依性層級2包含節點B和C,其中節點B需要200運算量,節點C需要100運算量。然而,由於系統中每個運算單元僅具有100的運算能力,任務控制處理器111需要為節點B配置多個運算單元。Referring to Figure 11B, continuing the aforementioned scenario, in the second array 122b, following the computational configuration of the second array 122a, computational units P1 and P2 complete the node computation (represented by a dotted pattern) for the corresponding node A. Based on the computation graph information TB111, the task control processor 111 determines that dependency level 2 includes nodes B and C, where node B requires 200 computational units and node C requires 100 computational units. However, since each computational unit in the system only has a computational capacity of 100, the task control processor 111 needs to configure multiple computational units for node B.
基於此運算需求,任務控制處理器111選擇運算單元P3和P4來執行節點B的節點運算任務,並且選擇儲存單元M8儲存節點A的節點運算結果(因為尚未執行的節點C的節點運算需要節點A的節點運算結果)。在此配置下,運算單元P3和P4轉為忙碌狀態(以網格圖樣表示),共同執行節點B的節點運算任務。Based on this computational requirement, the task control processor 111 selects computation units P3 and P4 to execute the node computation task of node B, and selects storage unit M8 to store the node computation result of node A (because the node computation of node C, which has not yet been executed, requires the node computation result of node A). Under this configuration, computation units P3 and P4 become busy (represented by a grid pattern) and jointly execute the node computation task of node B.
如箭頭A112所示,系統進入第二陣列122c的狀態。在此狀態下,運算單元P3和P4持續執行節點B的節點運算,而運算單元P1則被選擇執行運算量為100的節點C的節點運算任務。此時,儲存單元M8所儲存的節點A的節點運算結果會被傳輸到運算單元P1,而其他未被分配的運算單元及儲存單元則維持空閒狀態,以備後續運算需求。As indicated by arrow A112, the system enters the second array 122c state. In this state, computation units P3 and P4 continue to execute node computations for node B, while computation unit P1 is selected to execute node computation tasks for node C with a computational load of 100. At this time, the node computation results for node A stored in storage unit M8 are transmitted to computation unit P1, while other unassigned computation units and storage units remain idle, ready for subsequent computation needs.
接著,如箭頭A113所示,系統進入第二陣列122d的狀態。根據運算圖資訊TB111,節點C的母節點為A,且需要100運算量。因此,任務控制處理器111將儲存單元M8中的節點A運算結果傳輸至運算單元P1,以讓運算單元P1進入忙碌狀態(以網格圖樣表示)以執行對應節點C的節點運算。Next, as indicated by arrow A113, the system enters the state of the second array 122d. According to the computation graph information TB111, the parent node of node C is A, and it requires 100 computations. Therefore, the task control processor 111 transmits the computation result of node A in the storage unit M8 to the computation unit P1, so that the computation unit P1 enters a busy state (represented by a grid pattern) to execute the node computation of the corresponding node C.
由於還有一個空閒運算單元,根據運算圖資訊TB111,任務控制處理器111可判定下一個要處理的節點為D。由於節點D的母節點為B且需要100運算量,任務控制處理器111選擇運算單元P2來執行對應節點D的節點運算任務。運算單元P3、P4則持續處於忙碌狀態以完成對應節點B的節點運算。Since there is one idle computation unit, based on the computation graph information TB111, the task control processor 111 can determine that the next node to be processed is D. Since the parent node of node D is B and requires 100 computations, the task control processor 111 selects computation unit P2 to execute the node computation task corresponding to node D. Computation units P3 and P4 remain busy to complete the node computation corresponding to node B.
如箭頭A114所示,系統進入第二陣列122e的狀態。此時,運算單元P3、P4完成對應節點B的節點運算(以網點圖樣表示),成為了前置運算單元。根據運算圖資訊TB111,由於節點B的節點運算結果將用於節點D、E和F的節點運算且目前沒有空閒運算單元可被配置,任務控制處理器111先將其暫存於儲存單元M7(以網格圖樣表示)。同時,由於節點C的節點運算結果需要用於後續的節點G的節點運算,任務控制處理器111也將其暫存於儲存單元M8(以網格圖樣表示)。此外,運算單元P1此時也已完成對應節點C的節點運算(以網點圖樣表示),成為前置運算單元。至此,對應相依性層級2的神經網路運算任務會被判定已經完成。任務控制處理器111會選擇下一個神經網路運算任務來處理。根據運算圖資訊TB111,任務控制處理器111判定相依性層級3包含節點D、E、F,其中節點E的運算量為200且需要節點B、C的節點運算結果,節點D的運算量為100且需要節點B的節點運算結果,節點F的運算量為100且需要節點B的節點運算結果。As indicated by arrow A114, the system enters the second array 122e state. At this time, computation units P3 and P4 complete the node computation for the corresponding node B (represented by a grid pattern) and become the preceding computation units. According to the computation graph information TB111, since the node computation result of node B will be used for the node computation of nodes D, E, and F, and there are currently no free computation units available for configuration, the task control processor 111 first temporarily stores it in storage unit M7 (represented by a grid pattern). At the same time, since the node computation result of node C needs to be used for the subsequent node computation of node G, the task control processor 111 also temporarily stores it in storage unit M8 (represented by a grid pattern). Furthermore, computation unit P1 has now completed the node computation (represented by a dot pattern) for node C, becoming the preceding computation unit. At this point, the neural network computation task corresponding to dependency level 2 is considered complete. The task control processor 111 selects the next neural network computation task to process. Based on the computation graph information TB111, the task control processor 111 determines that dependency level 3 includes nodes D, E, and F. Node E has a computational load of 200 and requires the node computation results of nodes B and C; node D has a computational load of 100 and requires the node computation result of node B; and node F has a computational load of 100 and requires the node computation result of node B.
運算單元P2則正在執行對應節點D的節點運算(以網格圖樣表示),進入了忙碌狀態。The computation unit P2 is currently performing node operations (represented by a grid pattern) for the corresponding node D and is in a busy state.
參照圖11C,接續圖11B的例子。在第二陣列122f中,由於此時儲存單元M7暫存了節點B的節點運算結果,且儲存單元M8暫存了節點C的節點運算結果,任務控制處理器111可配置運算單元P3、P4來執行對應節點E的節點運算任務,配置運算單元P1來執行對應節點F的節點運算任務。運算單元P2繼續執行對應節點D的節點運算任務。Referring to Figure 11C, continuing the example from Figure 11B. In the second array 122f, since storage unit M7 temporarily stores the node operation result of node B, and storage unit M8 temporarily stores the node operation result of node C, the task control processor 111 can configure operation units P3 and P4 to execute the node operation task corresponding to node E, and configure operation unit P1 to execute the node operation task corresponding to node F. Operation unit P2 continues to execute the node operation task corresponding to node D.
如箭頭A115所示,系統進入第二陣列122g的狀態。在此狀態下,節點B的節點運算結果及節點C的節點運算結果分別從儲存單元M7、M8傳輸至對應的運算單元。在節點B的節點運算結果傳輸到配置給節點F的運算單元P1、配置給節點E的運算單元P3、P4,並且後續的其他節點也不需要節點B的節點運算結果。任務控制處理器111就可將節點B的節點運算結果從儲存單元M7中刪除。另一方面,因為後續的節點G需要節點C的節點運算結果,因此儲存單元M8中的節點C的節點運算結果會被保留而不被刪除。As indicated by arrow A115, the system enters the state of the second array 122g. In this state, the node operation results of node B and node C are transmitted from storage units M7 and M8 to their respective operation units. The node operation results of node B are transmitted to operation unit P1 assigned to node F and operation units P3 and P4 assigned to node E, and subsequent nodes do not require the node operation results of node B. The task control processor 111 can then delete the node operation results of node B from storage unit M7. On the other hand, because the subsequent node G requires the node operation results of node C, the node operation results of node C in storage unit M8 are retained and not deleted.
運算單元P3、P4進入忙碌狀態(以網格圖樣表示)執行對應節點E的節點運算,運算單元P1進入忙碌狀態執行對應節點F的節點運算,同時運算單元P2則完成了對應節點D的節點運算(以網點圖樣表示),成為前置運算單元。由於節點D的節點運算結果需要用於後續的節點運算且目前沒有空閒運算單元,任務控制處理器111將其暫存於儲存單元M3(以網格圖樣表示)。Operation units P3 and P4 enter a busy state (represented by a grid pattern) to perform node operations for the corresponding node E. Operation unit P1 enters a busy state to perform node operations for the corresponding node F. At the same time, operation unit P2 completes the node operation for the corresponding node D (represented by a dot pattern) and becomes the preceding operation unit. Since the node operation result of node D needs to be used for subsequent node operations and there is currently no idle operation unit, the task control processor 111 temporarily stores it in storage unit M3 (represented by a grid pattern).
接著,如箭頭A116所示,系統進入第二陣列122h的狀態。此時,運算單元P3、P4完成了對應節點E的節點運算(以網點圖樣表示),而運算單元P1也完成了對應節點F的節點運算(以網點圖樣表示),這些運算單元都成為了前置運算單元,並且準備被重置為空閒運算單元。至此,相依性層級3完成,並且開始處理相依性層級4的神經網路運算任務(節點G、H)。Next, as indicated by arrow A116, the system enters the second array 122h state. At this time, computation units P3 and P4 have completed the node computation for the corresponding node E (represented by a dot pattern), and computation unit P1 has also completed the node computation for the corresponding node F (represented by a dot pattern). These computation units have become preceding computation units and are ready to be reset to idle computation units. At this point, dependency level 3 is completed, and the neural network computation task of dependency level 4 (nodes G and H) begins to be processed.
因為後續的節點G需要節點D的節點運算結果,節點D的節點運算結果會被儲存到儲存單元M3。另一方面,因為後續的節點I要節點F的節點運算結果,節點F的節點運算結果會被儲存到儲存單元M8。Because the subsequent node G needs the node operation result of node D, the node operation result of node D will be stored in storage unit M3. On the other hand, because the subsequent node I needs the node operation result of node F, the node operation result of node F will be stored in storage unit M8.
隨後,如箭頭A117所示,系統進入第二陣列122i的狀態。根據運算圖資訊TB111,任務控制處理器111判定相依性層級4的節點還包含尚未執行的節點G。因此,任務控制處理器111選擇運算單元P3、P4來執行對應節點G的節點運算任務。運算單元P2積蓄執行對應節點H的節點運算任務。Subsequently, as indicated by arrow A117, the system enters the state of the second array 122i. Based on the computation graph information TB111, the task control processor 111 determines that the nodes in dependency level 4 also include nodes G that have not yet been executed. Therefore, the task control processor 111 selects computation units P3 and P4 to execute the node computation tasks corresponding to node G. Computation unit P2 accumulates the node computation tasks for the corresponding node H.
參照圖11D,接續圖11C的例子。在第二陣列122j中,節點C、D的節點運算結果分別從儲存單元M8、M3傳輸至對應的運算單元。由於節點G需要節點C和節點D的節點運算結果,且其運算量為200,需要兩個運算單元共同執行,因此任務控制處理器111將運算單元P3、P4配置為執行對應節點G的節點運算任務(以網格圖樣表示)。同時,根據運算圖資訊TB111,任務控制處理器111判定節點H需要節點E的節點運算結果且運算量為100,因此選擇運算單元P2來執行對應節點H的節點運算任務。由於節點H的節點運算結果將用於後續的節點I的節點運算且目前沒有空閒運算單元,任務控制處理器111將其暫存於儲存單元M2(以網格圖樣表示)。此外,由於節點F的節點運算結果需要用於後續的節點I的節點運算,任務控制處理器111繼續將其暫存於儲存單元M8。Referring to Figure 11D, continuing the example from Figure 11C, in the second array 122j, the node operation results of nodes C and D are transmitted from storage units M8 and M3 to their corresponding operation units, respectively. Since node G requires the node operation results of nodes C and D, and its computational load is 200, it needs to be executed by two operation units. Therefore, the task control processor 111 configures operation units P3 and P4 to execute the node operation task of the corresponding node G (represented by a grid pattern). At the same time, based on the computation graph information TB111, the task control processor 111 determines that node H requires the node operation result of node E and its computational load is 100. Therefore, it selects operation unit P2 to execute the node operation task of the corresponding node H. Since the node operation result of node H will be used for the subsequent node operation of node I and there is currently no free operation unit, the task control processor 111 temporarily stores it in storage unit M2 (represented by a mesh pattern). In addition, since the node operation result of node F needs to be used for the subsequent node operation of node I, the task control processor 111 continues to temporarily store it in storage unit M8.
接著,如箭頭A118所示,系統進入第二陣列122k的狀態。在此狀態下,運算單元P2已經被重置,運算單元P3、P4完成了對應節點G的節點運算(以網點圖樣表示),成為前置運算單元。由於節點G的節點運算結果將用於後續的節點I的節點運算且目前沒有空閒運算單元足以執行節點I的節點運算,任務控制處理器111將其暫存於儲存單元M7。同時,節點H的節點運算仍暫存於儲存單元M2中,因為該結果仍需用於後續的節點I的節點運算。至此,相依性層級4完成,並且準備開始處理相依性層級5的神經網路運算任務(節點I)。Next, as indicated by arrow A118, the system enters the state of the second array 122k. In this state, operation unit P2 has been reset, and operation units P3 and P4 have completed the node operation (represented by a dot pattern) for the corresponding node G, becoming the preceding operation units. Since the node operation result of node G will be used for the subsequent node operation of node I and there is currently no free operation unit sufficient to execute the node operation of node I, the task control processor 111 temporarily stores it in storage unit M7. At the same time, the node operation of node H is still temporarily stored in storage unit M2, because the result still needs to be used for the subsequent node operation of node I. At this point, dependency level 4 is complete, and we are ready to begin processing neural network computation tasks (node I) at dependency level 5.
接續第二陣列122k的狀態。如箭頭A119所示,系統進入第二陣列122l的狀態。在此狀態下,任務控制處理器111根據運算圖資訊TB111判定相依性層級5的節點I需要節點F、G、H的節點運算結果,且其運算量為300。由於每個運算單元的運算能力為100,因此需要三個運算單元共同執行節點I的節點運算任務。Continuing from the state of the second array 122k. As indicated by arrow A119, the system enters the state of the second array 122l. In this state, the task control processor 111 determines, based on the computation graph information TB111, that node I at dependency level 5 requires the node computation results of nodes F, G, and H, and its computational load is 300. Since the computational capacity of each computation unit is 100, three computation units are required to jointly execute the node computation task of node I.
具體來說,節點F的節點運算結果從儲存單元M8、節點G的節點運算結果從儲存單元M7、節點H的節點運算結果從儲存單元M2傳輸至對應的運算單元。任務控制處理器111將運算單元P1、P2、P3配置為執行對應節點I的節點運算任務(以網格圖樣表示)。Specifically, the node operation results of node F are transmitted from storage unit M8, the node operation results of node G are transmitted from storage unit M7, and the node operation results of node H are transmitted from storage unit M2 to the corresponding operation units. The task control processor 111 configures operation units P1, P2, and P3 to execute the node operation tasks (represented by a grid pattern) of the corresponding node I.
在資料傳輸完成後,節點F的節點運算結果已傳輸至運算單元P1且後續沒有其他節點需要使用,可從儲存單元M8中刪除;節點G的節點運算結果已傳輸至運算單元P2且後續沒有其他節點需要使用,可從儲存單元M7中刪除;節點H的節點運算結果已傳輸至運算單元P3且後續沒有其他節點需要使用,可從儲存單元M2中刪除。After the data transfer is complete, the node calculation result of node F has been transferred to the calculation unit P1 and no other node needs to use it, so it can be deleted from the storage unit M8; the node calculation result of node G has been transferred to the calculation unit P2 and no other node needs to use it, so it can be deleted from the storage unit M7; the node calculation result of node H has been transferred to the calculation unit P3 and no other node needs to use it, so it can be deleted from the storage unit M2.
至此,系統完成了運算圖資訊TB111中所有節點的節點運算任務配置。上述的例子,描述了在運算資源受限的情況下,如何透過動態的資源配置及運算結果的暫存管理,來逐步完成複雜的神經網路運算任務。At this point, the system has completed the node computation task configuration for all nodes in the computation graph information TB111. The above example describes how to gradually complete complex neural network computation tasks under limited computing resources through dynamic resource configuration and temporary storage management of computation results.
在一實施例中,本發明的三維粗粒度可重構陣列架構系統100還可以實現動態資源能力調整機制。具體而言,配置控制器110可根據運算需求,動態調整每個儲存單元與運算單元的性能參數。In one embodiment, the three-dimensional coarse-grained reconfigurable array architecture system 100 of the present invention can also implement a dynamic resource capability adjustment mechanism. Specifically, the configuration controller 110 can dynamically adjust the performance parameters of each storage unit and computing unit according to computing needs.
在儲存單元方面,配置控制器110可調整以下參數:儲存容量配置(Storage Capacity Allocation),透過動態分割或合併儲存空間來調整單一儲存單元的容量大小;存取頻寬(Access Bandwidth),藉由調整儲存單元的工作時脈頻率與資料匯流排寬度,來改變其資料存取速度;快取配置(Cache Configuration),動態調整快取的大小與組織方式,以最佳化特定運算任務的資料存取模式。Regarding storage units, the configuration controller 110 can adjust the following parameters: Storage Capacity Allocation, which adjusts the capacity of a single storage unit by dynamically partitioning or merging storage spaces; Access Bandwidth, which changes the data access speed by adjusting the clock frequency and data bus width of the storage unit; and Cache Configuration, which dynamically adjusts the size and organization of the cache to optimize the data access mode for a specific computing task.
在運算單元方面,配置控制器110可調整以下參數:運算精度(Computational Precision),例如在不需要高精度運算時,將浮點運算切換為定點運算,以提升處理效能;工作頻率(Operating Frequency),根據運算任務的複雜度動態調整運算單元的工作時脈;運算模式(Processing Mode),例如將單一運算單元重新配置為多個較小的運算單元,以提升平行處理能力。Regarding the computing units, the configuration controller 110 can adjust the following parameters: computational precision, for example, switching floating-point operations to fixed-point operations when high-precision operations are not required to improve processing performance; operating frequency, dynamically adjusting the operating clock of the computing unit according to the complexity of the computing task; and processing mode, for example, reconfiguring a single computing unit into multiple smaller computing units to improve parallel processing capabilities.
在實務實作上,這些動態調整可透過以下技術來實現:動態電壓與頻率調整(DVFS, Dynamic Voltage and Frequency Scaling),用於即時調整工作電壓與頻率;可重構運算陣列(Reconfigurable Processing Array),支援運算單元的動態分割與合併;適應性記憶體控制器(Adaptive Memory Controller),能夠動態調整記憶體存取模式與頻寬配置;以及動態資源分配引擎(Dynamic Resource Allocation Engine),負責根據工作負載特性來決定最佳的資源配置策略。也就是說,當原本的運算單元的運算能力不足以應付特定節點的運算量時,系統可動態地調高運算單元的運算能力,以使該運算單元可以被配置給特定節點使用。如此一來,並不需要用複數個運算單元來執行該特定節點的節點運算任務,降低了在資料整合上的複雜度,進而提升了整體的工作效率。In practical implementation, these dynamic adjustments can be achieved through the following technologies: Dynamic Voltage and Frequency Scaling (DVFS), used to adjust the operating voltage and frequency in real time; a Reconfigurable Processing Array, supporting the dynamic partitioning and merging of processing units; an Adaptive Memory Controller, capable of dynamically adjusting memory access modes and bandwidth configurations; and a Dynamic Resource Allocation Engine, responsible for determining the optimal resource allocation strategy based on workload characteristics. In other words, when the computing power of the original processing unit is insufficient to handle the computational load of a specific node, the system can dynamically increase the computing power of the processing unit so that it can be allocated to a specific node. In this way, it is not necessary to use multiple computation units to perform node computation tasks for a specific node, which reduces the complexity of data integration and thus improves overall work efficiency.
基於上述,本公開的一或多個實施例所提供的三維粗粒度可重構陣列架構系統及其控制方法,可藉由以下技術特徵達成系統效能的提升:Based on the above, the three-dimensional coarse-grained reconfigurable array architecture system and its control method provided by one or more embodiments of this disclosure can achieve improved system performance through the following technical features:
1. 透過異質陣列的交錯堆疊配置,建立三維立體的資料傳輸架構,使得資料傳輸路徑得以最佳化。具體來說,當資料需要在不同功能性陣列之間傳輸時,系統可選擇最短的垂直傳輸路徑,避免了傳統二維架構中繞行多個組件的傳輸延遲。1. By using a staggered stacking configuration of heterogeneous arrays, a three-dimensional data transmission architecture is established, optimizing the data transmission path. Specifically, when data needs to be transmitted between different functional arrays, the system can select the shortest vertical transmission path, avoiding the transmission delays that occur in traditional two-dimensional architectures where data has to travel around multiple components.
2. 藉由動態資源分配機制,系統能夠因應不同運算量需求靈活調配運算資源。舉例而言,當遇到需要較大運算量的節點運算時,系統可配置多個運算單元共同執行該運算任務,確保運算效能的最佳化。2. Through a dynamic resource allocation mechanism, the system can flexibly allocate computing resources according to different computing demands. For example, when encountering node operations requiring a large amount of computing power, the system can configure multiple computing units to jointly execute the computing task, ensuring optimal computing performance.
3. 透過運算結果的智慧暫存管理,系統能夠在運算資源受限的情況下,仍然維持高效的運算流程。當某節點的運算結果需要提供給多個後續節點使用,且當前沒有足夠的運算單元時,系統會將該結果暫存於儲存單元,直到所有相關節點完成運算後才釋放。3. Through intelligent temporary storage management of computation results, the system can maintain an efficient computation process even when computing resources are limited. When the computation result of a node needs to be provided to multiple subsequent nodes, and there are not enough computation units at present, the system will temporarily store the result in the storage unit until all related nodes have completed the computation before releasing it.
4. 運用相依性層級資訊進行任務調度,使得系統能夠在確保運算正確性的前提下,最大化運算資源的使用效率。系統會根據節點間的相依關係,適時將已完成運算的單元重置為空閒狀態,供後續運算任務使用。4. Utilizing dependency hierarchy information for task scheduling enables the system to maximize the efficiency of computing resources while ensuring computational accuracy. The system will reset completed units to an idle state in a timely manner based on the dependencies between nodes, making them available for subsequent computational tasks.
雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above by way of embodiments, it is not intended to limit the present invention. Anyone with ordinary skill in the art may make some modifications and refinements without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be determined by the appended patent application.
100:粗粒度可重構陣列架構系統 110:配置控制器 111:任務控制處理器 112:工作負載配置儲存電路單元 121、121(1)、121(2)、121a-121i:第一陣列 122、122(1)、122(2)、122a-122l:第二陣列 130:輸入/輸出介面 TB81、TB101、TB111:運算圖資訊 CT81:資料傳輸路徑 TB91:傳輸路徑資訊 P、P1-P9:運算單元 M、M1-M8:儲存單元 S、S1-S5:切換器 A、B、C、D、E、F、G、H、I:節點 A101-A107、A111-A119:箭頭 B51、B52、B61、B62:垂直連接架構 TP14-TP19:傳輸路徑 S710、S720:步驟100: Coarse-grained reconfigurable array architecture system; 110: Configuration controller; 111: Task control processor; 112: Workload configuration storage circuit unit; 121, 121(1), 121(2), 121a-121i: First array; 122, 122(1), 122(2), 122a-122l: Second array; 130: Input/output interface; TB81, TB101, TB111: Operation graph information; CT81: Data transmission path; TB91: Transmission path information; P, P1-P9: Operation unit; M, M1-M8: Storage unit; S, S1-S5: Switch. A, B, C, D, E, F, G, H, I: Nodes A101-A107, A111-A119: Arrows B51, B52, B61, B62: Vertical connection structure TP14-TP19: Transmission paths S710, S720: Steps
圖1為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的方塊示意圖。 圖2為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的配置控制器的方塊示意圖。 圖3A為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一交錯堆疊方式的示意圖。 圖3B為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第二交錯堆疊方式的示意圖。 圖4A為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一陣列及第二陣列的第一佈局的示意圖。 圖4B為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第二佈局的示意圖。 圖4C為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第三佈局的示意圖。 圖4D為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的第一及第二陣列的第四佈局的示意圖。 圖5為根據本公開的一實施例所繪示的對應第一佈局的第一陣列及第二陣列的立體架構的示意圖。 圖6為根據本公開的一實施例所繪示的對應第三佈局的第一陣列及第二陣列的立體架構的示意圖。 圖7為根據本公開的一實施例所繪示的三維粗粒度可重構陣列架構系統的控制方法的流程圖。 圖8為根據本公開的一實施例所繪示的獲取對應神經網路模型的運算圖資訊的示意圖。 圖9為根據本公開的一實施例所繪示的根據傳輸路徑選擇目標任務運算單元的示意圖。 圖10A至圖10C為根據本公開的一實施例所繪示的根據運算圖資訊來配置運算單元以執行對應的神經網路運算任務的示意圖。 圖11A至圖11D為根據本公開的另一實施例所繪示的根據運算圖資訊來配置運算單元以執行對應的神經網路運算任務的示意圖。Figure 1 is a block diagram of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 2 is a block diagram of a configuration controller for a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 3A is a schematic diagram of a first staggered stacking configuration of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 3B is a schematic diagram of a second staggered stacking configuration of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 4A is a schematic diagram of a first layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 4B is a schematic diagram of the second layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 4C is a schematic diagram of the third layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 4D is a schematic diagram of the fourth layout of the first and second arrays of a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 5 is a schematic diagram of the three-dimensional architecture of the first and second arrays corresponding to the first layout according to an embodiment of the present disclosure. Figure 6 is a schematic diagram of the three-dimensional architecture of the first and second arrays corresponding to the third layout according to an embodiment of the present disclosure. Figure 7 is a flowchart illustrating a control method for a three-dimensional coarse-grained reconfigurable array architecture system according to an embodiment of the present disclosure. Figure 8 is a schematic diagram illustrating the acquisition of computation graph information corresponding to a neural network model according to an embodiment of the present disclosure. Figure 9 is a schematic diagram illustrating the selection of target task computation units based on transmission paths according to an embodiment of the present disclosure. Figures 10A to 10C are schematic diagrams illustrating the configuration of computation units to perform corresponding neural network computation tasks based on computation graph information according to an embodiment of the present disclosure. Figures 11A to 11D are schematic diagrams illustrating the configuration of computation units to perform corresponding neural network computation tasks based on computation graph information according to another embodiment of the present disclosure.
100:粗粒度可重構陣列架構系統 100: Coarse-grained reconfigurable array architecture system
110:配置控制器 110: Configure the controller
121:第一陣列 121: First Formation
122:第二陣列 122: Second Formation
130:輸入/輸出介面 130: Input/Output Interface
Claims (24)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW113150647A TWI902585B (en) | 2024-12-25 | 2024-12-25 | Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW113150647A TWI902585B (en) | 2024-12-25 | 2024-12-25 | Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| TWI902585B true TWI902585B (en) | 2025-10-21 |
Family
ID=98264118
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW113150647A TWI902585B (en) | 2024-12-25 | 2024-12-25 | Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction system |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI902585B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114398308A (en) * | 2022-01-18 | 2022-04-26 | 上海交通大学 | Near memory computing system based on data-driven coarse-grained reconfigurable array |
| US20230205293A1 (en) * | 2021-12-29 | 2023-06-29 | SambaNova Systems, Inc. | High-bandwidth power estimator for ai accelerator |
| US20240248863A1 (en) * | 2023-01-19 | 2024-07-25 | SambaNova Systems, Inc. | Method and apparatus for data transfer between accessible memories of multiple processors in a heterogeneous processing system using one memory to memory transfer operation |
| US20240370402A1 (en) * | 2022-02-09 | 2024-11-07 | SambaNova Systems, Inc. | Configuration Data Store in a Reconfigurable Data Processor Having Two Access Modes |
-
2024
- 2024-12-25 TW TW113150647A patent/TWI902585B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230205293A1 (en) * | 2021-12-29 | 2023-06-29 | SambaNova Systems, Inc. | High-bandwidth power estimator for ai accelerator |
| CN114398308A (en) * | 2022-01-18 | 2022-04-26 | 上海交通大学 | Near memory computing system based on data-driven coarse-grained reconfigurable array |
| US20240370402A1 (en) * | 2022-02-09 | 2024-11-07 | SambaNova Systems, Inc. | Configuration Data Store in a Reconfigurable Data Processor Having Two Access Modes |
| US20240248863A1 (en) * | 2023-01-19 | 2024-07-25 | SambaNova Systems, Inc. | Method and apparatus for data transfer between accessible memories of multiple processors in a heterogeneous processing system using one memory to memory transfer operation |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12340101B2 (en) | Scaling out architecture for dram-based processing unit (DPU) | |
| CN112149816A (en) | Heterogeneous memory-computing fusion system and method supporting deep neural network inference acceleration | |
| CN112114942B (en) | Stream data processing method and computing device based on many-core processor | |
| CN110968543A (en) | Computing system and method in memory | |
| CN114492782A (en) | On-chip core compiling and mapping method and device of neural network based on reinforcement learning | |
| CN113407479B (en) | A multi-core architecture with embedded FPGA and data processing method thereof | |
| KR20200139829A (en) | Network on-chip data processing method and device | |
| US11645225B2 (en) | Partitionable networked computer | |
| CN112183015B (en) | Chip layout planning method for deep neural network | |
| Xue et al. | EdgeLD: Locally distributed deep learning inference on edge device clusters | |
| CN102681901A (en) | Segmental reconfigurable hardware task arranging method | |
| Bhatele et al. | Application-specific topology-aware mapping for three dimensional topologies | |
| CN119537294B (en) | A controller for accelerated computing and an accelerated computing system | |
| Rafie et al. | Performance evaluation of task migration in contiguous allocation for mesh interconnection topology | |
| CN113407238B (en) | A multi-core architecture with heterogeneous processors and data processing method thereof | |
| TWI902585B (en) | Three-dimensional coarse-particle reconfigurable array construction system and control method of three-dimensional coarse-particle reconfigurable array construction system | |
| Fiedler et al. | Improving task placement for applications with 2D, 3D, and 4D virtual Cartesian topologies on 3D torus networks with service nodes | |
| CN108304261B (en) | Job scheduling method and device based on 6D-Torus network | |
| US11372791B2 (en) | Embedding rings on a toroid computer network | |
| Zhu et al. | Core placement optimization of many-core brain-inspired near-storage systems for spiking neural network training | |
| CN119211106A (en) | System, method and apparatus for data routing between multiple computing nodes | |
| Yang et al. | Network Group Partition and Core Placement Optimization for Neuromorphic Multi-Core and Multi-Chip Systems | |
| CN113556242A (en) | A method and device for inter-node communication based on multi-processing nodes | |
| CN120075122B (en) | Communication scheduling method, electronic device, and medium for distributed large model training | |
| US20260003820A1 (en) | Versatile accelerator design for multiple deep neural network applications |