TWI869788B

TWI869788B - Processing circuit and computation scheduling method of artificial intelligence model

Info

Publication number: TWI869788B
Application number: TW112108662A
Authority: TW
Inventors: 俞清
Original assignee: 大陸商星宸科技股份有限公司
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2025-01-11
Also published as: TW202437087A

Abstract

A processing circuit of an artificial intelligence (AI) model is provided and includes a memory, a memory management circuit, and an operation circuit. The memory management circuit reads a tensor from the external memory and stores the tensor in the memory. The operation circuit is configured to perform the following operations: performing an operation of a first type on a first and second sub-tensors and of the tensor to generate a first and second intermediate data, respectively; performing an operation of a second type on the first intermediate data and the second intermediate data to generate a third intermediate data; performing an operation of the first type on a third sub-tensor of the tensor to generate a fourth intermediate data; and performing an operation of the second type on the first intermediate data, the second intermediate data, and the fourth intermediate data to generate a fifth intermediate data.

Description

Processing circuit and operation scheduling method of artificial intelligence model

本發明是關於人工智慧模型，尤其是關於人工智慧模型的處理電路與運算排程方法。The present invention relates to an artificial intelligence model, and more particularly to a processing circuit and a calculation scheduling method of the artificial intelligence model.

在一個晶片上系統（例如，系統單晶片（system-on-chip, SoC））中，記憶體頻寬總量往往是固定的，且被多個模組使用。當某個模組所佔用的記憶體頻寬過大時，會導致其他模組獲取記憶體發生阻塞，進而導致系統性能降低。人工智慧（artificial intelligence, AI）模型作為晶片上系統中的一個模組，經常需要處理大量的資料，對記憶體的頻寬需求大；因此，減小AI模型的頻寬需求成為一個重要的課題。In a system-on-chip (SoC), the total memory bandwidth is often fixed and used by multiple modules. When a module occupies too much memory bandwidth, it will cause other modules to be blocked from accessing memory, which will lead to reduced system performance. As a module in a system-on-chip, artificial intelligence (AI) models often need to process a large amount of data and have a large demand for memory bandwidth; therefore, reducing the bandwidth requirement of AI models has become an important topic.

鑑於先前技術之不足，本發明之一目的在於提供一種人工智慧模型的處理電路及運算排程方法，以改善先前技術的不足。In view of the deficiencies of the prior art, one purpose of the present invention is to provide a processing circuit and calculation scheduling method of an artificial intelligence model to improve the deficiencies of the prior art.

本發明之一實施例提供一種人工智慧模型的處理電路。該處理電路耦接一外部記憶體並且包含一記憶體、一記憶體管理電路以及一運算電路。記憶體管理電路用來從該外部記憶體讀取一張量，並將該張量儲存至該記憶體。運算電路被配置以進行以下操作：對該張量的一第一子張量進行一第一種類的運算，以產生一第一中間資料；對該張量的一第二子張量進行該第一種類的運算，以產生一第二中間資料；對該第一中間資料及該第二中間資料進行一第二種類的運算，以產生一第三中間資料；對該張量的一第三子張量進行該第一種類的運算，以產生一第四中間資料；以及，對該第一中間資料、該第二中間資料及該第四中間資料進行該第二種類的運算，以產生一第五中間資料。An embodiment of the present invention provides a processing circuit of an artificial intelligence model. The processing circuit is coupled to an external memory and includes a memory, a memory management circuit and an operation circuit. The memory management circuit is used to read a tensor from the external memory and store the tensor in the memory. The operation circuit is configured to perform the following operations: perform a first type of operation on a first sub-tensor of the tensor to generate a first intermediate data; perform the first type of operation on a second sub-tensor of the tensor to generate a second intermediate data; perform a second type of operation on the first intermediate data and the second intermediate data to generate a third intermediate data; perform the first type of operation on a third sub-tensor of the tensor to generate a fourth intermediate data; and perform the second type of operation on the first intermediate data, the second intermediate data, and the fourth intermediate data to generate a fifth intermediate data.

本發明之另一實施例提供一種人工智慧模型的處理電路。該處理電路耦接一外部記憶體並且包含一記憶體。該處理電路執行以下操作：從該外部記憶體讀取一張量及複數個核心參數，並將該張量及該些核心參數儲存至該記憶體，其中，該張量包含一第一子張量及一第二子張量，該些核心參數包含一向量核心參數；參考該向量核心參數的一第一部分對該第一子張量進行一第一向量運算，以產生一第一中間資料；以及，參考該向量核心參數的一第二部分對該第二子張量進行一第二向量運算，以產生一第二中間資料。該向量核心參數的該第一部分不等於該向量核心參數的該第二部分。Another embodiment of the present invention provides a processing circuit of an artificial intelligence model. The processing circuit is coupled to an external memory and includes a memory. The processing circuit performs the following operations: read a tensor and a plurality of core parameters from the external memory, and store the tensor and the core parameters to the memory, wherein the tensor includes a first sub-tensor and a second sub-tensor, and the core parameters include a vector core parameter; perform a first vector operation on the first sub-tensor with reference to a first part of the vector core parameter to generate a first intermediate data; and perform a second vector operation on the second sub-tensor with reference to a second part of the vector core parameter to generate a second intermediate data. The first part of the vector core parameter is not equal to the second part of the vector core parameter.

本發明之另一實施例提供一種人工智慧模型的運算排程方法。該人工智慧模型包含一第一運算子及一第二運算子。該運算排程方法包含：將一張量分為H個子張量，H係大於1之整數；將該第一運算子分為H個第一子運算子；將該第二運算子分為H個第二子運算子；確定該H個第一子運算子及該H個第二子運算子之間的一依賴關係；根據該依賴關係排序該H個第一子運算子及該H個第二子運算子，以得到一操作順序；以及，根據該操作順序決定執行該人工智慧模型之一處理電路何時從該處理電路所包含的一記憶體中刪除一目標資料，該目標資料係該H個第一子運算子及該H個第二子運算子的其中一者的一輸出資料。Another embodiment of the present invention provides a method for scheduling operations of an artificial intelligence model, wherein the artificial intelligence model includes a first operator and a second operator. The operation scheduling method includes: dividing a tensor into H sub-tensors, where H is an integer greater than 1; dividing the first operator into H first sub-operators; dividing the second operator into H second sub-operators; determining a dependency relationship between the H first sub-operators and the H second sub-operators; sorting the H first sub-operators and the H second sub-operators according to the dependency relationship to obtain an operation order; and, determining when a processing circuit that executes the artificial intelligence model deletes a target data from a memory included in the processing circuit according to the operation order, the target data being an output data of one of the H first sub-operators and the H second sub-operators.

本發明之實施例所體現的技術手段可以改善先前技術之缺點的至少其中之一，因此本發明相較於先前技術可以減少記憶體用量及/或降低對記憶體的頻寬需求。The technical means embodied in the embodiments of the present invention can improve at least one of the shortcomings of the prior art. Therefore, compared with the prior art, the present invention can reduce the memory usage and/or reduce the bandwidth requirement for the memory.

有關本發明的特徵、實作與功效，茲配合圖式作實施例詳細說明如下。The features, implementation and effects of the present invention are described in detail below with reference to the accompanying drawings.

以下說明內容之技術用語係參照本技術領域之習慣用語，如本說明書對部分用語有加以說明或定義，該部分用語之解釋係以本說明書之說明或定義為準。The technical terms used in the following descriptions refer to the customary terms in this technical field. If this manual explains or defines some of the terms, the interpretation of those terms shall be based on the explanation or definition in this manual.

本發明之揭露內容包含人工智慧模型的處理電路及運算排程方法。由於本發明之人工智慧模型的處理電路所包含之部分元件單獨而言可能為已知元件，因此在不影響該裝置發明之充分揭露及可實施性的前提下，以下說明對於已知元件的細節將予以節略。The disclosure of the present invention includes a processing circuit of an artificial intelligence model and a calculation scheduling method. Since some components included in the processing circuit of the artificial intelligence model of the present invention may be known components individually, the following description will be omitted for details of the known components without affecting the full disclosure and feasibility of the device invention.

圖1是AI網路的一個例子，可視為一個簡單的AI模型，或一個複雜的AI模型的一部分。AI網路100用來對輸入資料Din進行運算，以產生輸出資料Dout。圖1的AI網路100包含三個運算子（operator）：減法運算子110（「SUB」）、卷積（convolution）運算子120（「CONV」）及加法運算子130（「ADD」）。減法運算子110對張量（tensor）TS1（即，輸入資料Din）進行減法運算，以產生張量TS2。卷積運算子120對張量TS2進行卷積運算，以產生張量TS3。加法運算子130對張量TS3進行加法運算，以產生張量TS4（即，輸出資料Dout）。在圖1的例子中，張量TS1、張量TS2、張量TS3及張量TS4的大小（維度資訊）皆為[1,3,224,224]。FIG1 is an example of an AI network, which can be viewed as a simple AI model or as part of a complex AI model. The AI network 100 is used to operate on input data Din to generate output data Dout. The AI network 100 of FIG1 includes three operators: a subtraction operator 110 (“SUB”), a convolution operator 120 (“CONV”), and an addition operator 130 (“ADD”). The subtraction operator 110 performs a subtraction operation on the tensor TS1 (i.e., the input data Din) to generate a tensor TS2. The convolution operator 120 performs a convolution operation on the tensor TS2 to generate a tensor TS3. The addition operator 130 performs an addition operation on the tensor TS3 to generate a tensor TS4 (ie, the output data Dout). In the example of FIG1 , the sizes (dimensional information) of the tensors TS1 , TS2 , TS3 , and TS4 are all [1, 3, 224, 224].

圖2為本發明人工智慧模型的運算排程方法之一實施例的流程圖。圖2的流程由晶片開發工具（例如，電腦）執行，包含以下步驟。FIG2 is a flow chart of an embodiment of the method for scheduling computation of an artificial intelligence model of the present invention. The flow chart of FIG2 is executed by a chip development tool (eg, a computer) and includes the following steps.

步驟S210：將張量切分為H個子張量（或稱為塊（tile）），H可以是張量的任一維度（H為大於1的整數）。更明確地說，此步驟是根據AI網路100的最後一個運算子的輸出張量的其中一個維度，來決定H的值，然後將張量分為H個子張量。以圖1的AI網路100為例，因為最後一個運算子（加法運算子130）的輸出張量（即，張量TS4）的大小是[1,3,224,224]，所以H可以是3或224。切分張量的細節將於下方配合圖3做說明。Step S210: Split the tensor into H sub-tensors (or tiles), where H can be any dimension of the tensor (H is an integer greater than 1). More specifically, this step determines the value of H based on one of the dimensions of the output tensor of the last operator of the AI network 100, and then splits the tensor into H sub-tensors. Taking the AI network 100 of FIG. 1 as an example, since the size of the output tensor (i.e., tensor TS4) of the last operator (addition operator 130) is [1,3,224,224], H can be 3 or 224. The details of splitting the tensor will be explained below in conjunction with FIG. 3.

步驟S220：將運算子切分為H個子運算子。本步驟將連同步驟S210於下方配合圖3做說明。Step S220: Split the operator into H sub-operators. This step will be described together with step S210 below with reference to FIG. 3 .

步驟S230：確定多個子運算子之間的依賴關係。此步驟將於下方配合圖5做說明。Step S230: Determine the dependency relationship between multiple sub-operators. This step will be explained below with reference to FIG. 5 .

步驟S240：根據子運算子之間的依賴關係排序子運算子，以得到操作順序。此步驟將於下方配合圖7做說明。Step S240: Sort the sub-operators according to the dependency relationship between the sub-operators to obtain an operation order. This step will be explained below with reference to FIG. 7.

步驟S250：根據該操作順序決定執行人工智慧模型之電子裝置（更明確地說，該電子裝置之處理電路）何時從記憶體中刪除一目標資料，該目標資料是其中一個子運算子的輸出資料（也就是AI網路100的中間資料）。此步驟將於下方配合圖10、圖11、圖12A及圖12B做說明。Step S250: Determine when the electronic device executing the artificial intelligence model (more specifically, the processing circuit of the electronic device) deletes a target data from the memory according to the operation sequence, the target data being the output data of one of the sub-operators (i.e., the intermediate data of the AI network 100). This step will be explained below with reference to FIG. 10 , FIG. 11 , FIG. 12A , and FIG. 12B .

請參閱圖3，圖3是圖1之張量及運算子經切分後的結果。在圖3的實施例中，切分張量之操作（即，步驟S210）所根據的維度是張量TS4的第二個維度（即，H=3）。因此，減法運算子110被切分成減法子運算子110_1（「SUB1」）、減法子運算子110_2（「SUB2」）及減法子運算子110_3（「SUB3」）；卷積運算子120被切分成卷積子運算子120_1（「CONV1」）、卷積子運算子120_2（「CONV2」）及卷積子運算子120_3（「CONV3」）；加法運算子130被切分成加法子運算子130_1（「ADD1」）、加法子運算子130_2（「ADD2」）及加法子運算子130_3（「ADD3」）。張量TS1被切分成子張量TS1_i1、子張量TS1_i2及子張量TS1_i3（分別為減法子運算子110_1、減法子運算子110_2及減法子運算子110_3的輸入子張量，大小皆為[1,1,224,224]，且對應到張量TS1的同一維度（例如，第二個維度））。張量TS4被切分成子張量TS3_o1、子張量TS3_o2及子張量TS3_o3（分別為加法子運算子130_1、加法子運算子130_2及加法子運算子130_3的輸出子張量，大小皆為[1,1,224,224]，且對應到張量TS4的同一維度）。張量TS3被切分成子張量TS3_i1、子張量TS3_i2、子張量TS3_i3（分別為加法子運算子130_1、加法子運算子130_2及加法子運算子130_3的輸入子張量，大小皆為[1,1,224,224]，且對應到張量TS3的同一維度）。卷積子運算子120_1、卷積子運算子120_2及卷積子運算子120_3的個別的輸出子張量（即，子張量TS2_o1、子張量TS2_o2及子張量TS2_o3）分別等於子張量TS3_i1、子張量TS3_i2及子張量TS3_i3。Please refer to Figure 3, which is the result of the tensor and operator after segmentation in Figure 1. In the embodiment of Figure 3, the dimension according to which the operation of segmenting the tensor (i.e., step S210) is based on the second dimension of the tensor TS4 (i.e., H=3). Therefore, the subtraction operator 110 is divided into a subtraction sub-operator 110_1 ("SUB1"), a subtraction sub-operator 110_2 ("SUB2") and a subtraction sub-operator 110_3 ("SUB3"); the convolution operator 120 is divided into a convolution sub-operator 120_1 ("CONV1"), a convolution sub-operator 120_2 ("CONV2") and a convolution sub-operator 120_3 ("CONV3"); and the addition operator 130 is divided into an addition sub-operator 130_1 ("ADD1"), an addition sub-operator 130_2 ("ADD2") and an addition sub-operator 130_3 ("ADD3"). The tensor TS1 is divided into sub-tensors TS1_i1, TS1_i2 and TS1_i3 (respectively, the input sub-tensors of the subtraction sub-operator 110_1, 110_2 and 110_3, all of which have sizes of [1, 1, 224, 224] and correspond to the same dimension (e.g., the second dimension) of the tensor TS1). The tensor TS4 is divided into sub-tensors TS3_o1, TS3_o2 and TS3_o3 (respectively, the output sub-tensors of the addition sub-operator 130_1, 130_2 and 130_3, all of which have sizes of [1, 1, 224, 224] and correspond to the same dimension of the tensor TS4). Tensor TS3 is divided into sub-tensors TS3_i1, TS3_i2, and TS3_i3 (the input sub-tensors of addition sub-operator 130_1, addition sub-operator 130_2, and addition sub-operator 130_3, respectively, all of size [1,1,224,224], and corresponding to the same dimension of tensor TS3). The respective output sub-tensors (i.e., sub-tensors TS2_o1, sub-tensors TS2_o2, and sub-tensors TS2_o3) of convolution sub-operator 120_1, convolution sub-operator 120_2, and convolution sub-operator 120_3 are equal to sub-tensors TS3_i1, sub-tensors TS3_i2, and sub-tensors TS3_i3, respectively.

需注意的是，因為可視域放大（Visual Field Enlargement）的問題，減法子運算子110_1（減法子運算子110_2或減法子運算子110_3）所輸出的子張量TS1_o1（子張量TS1_o2或子張量TS1_o3）不等於卷積子運算子120_1（卷積子運算子120_2或卷積子運算子120_3）的輸入子張量TS2_i1（子張量TS2_i2或子張量TS2_i3）；更明確地說，子張量TS1_o1、TS1_o2、TS1_o3的大小皆為[1,1,224,224]，但子張量TS2_i1與子張量TS2_i3的大小皆為[1,2,224,224]，而子張量TS2_i2的大小為[1,3,224,224]。It should be noted that due to the problem of visual field enlargement, the output subtensor TS1_o1 (subtensor TS1_o2 or subtensor TS1_o3) of the subtraction operator 110_1 (subtraction operator 110_2 or subtraction operator 110_3) is not equal to the input subtensor TS2_i1 of the convolution operator 120_1 (convolution operator 120_2 or convolution operator 120_3). (subtensor TS2_i2 or subtensor TS2_i3); more specifically, the sizes of subtensors TS1_o1, TS1_o2, and TS1_o3 are all [1,1,224,224], but the sizes of subtensors TS2_i1 and TS2_i3 are both [1,2,224,224], and the size of subtensor TS2_i2 is [1,3,224,224].

由圖3可知，由於子張量TS1_o1、子張量TS1_o2與子張量TS1_o3分別對應到子張量TS1_i1、子張量TS1_i2與子張量TS1_i3，且子張量TS1_i1、子張量TS1_i2與子張量TS1_i3對應到張量TS1的同一維度（例如，第二個維度），所以子張量TS1_o1、子張量TS1_o2與子張量TS1_o3也對應到張量TS1的同一維度。As can be seen from Figure 3, since sub-tensor TS1_o1, sub-tensor TS1_o2 and sub-tensor TS1_o3 correspond to sub-tensor TS1_i1, sub-tensor TS1_i2 and sub-tensor TS1_i3 respectively, and sub-tensor TS1_i1, sub-tensor TS1_i2 and sub-tensor TS1_i3 correspond to the same dimension of tensor TS1 (for example, the second dimension), sub-tensor TS1_o1, sub-tensor TS1_o2 and sub-tensor TS1_o3 also correspond to the same dimension of tensor TS1.

圖2的流程可以有效地管理目標資料在電子裝置的記憶體中的存續時間，有助於減少記憶體的使用量及/或降低對記憶體的頻寬需求。細節將於下方配合圖9、圖10、圖11、圖12A及圖12B做說明。The process of FIG. 2 can effectively manage the duration of target data in the memory of the electronic device, which helps to reduce memory usage and/or reduce the bandwidth requirement for the memory. The details will be explained below in conjunction with FIG. 9, FIG. 10, FIG. 11, FIG. 12A and FIG. 12B.

基於同一張量所切分出來的子張量之間的重疊（overlapping）關係（請參閱圖3），可以得到如圖4所示的子運算子之間的連線的拓樸圖。具體地，基於同一運算子切分出來的各個子運算子的輸入子張量與其所依賴的來源運算子的各個子運算子的輸出子張量之間的重疊關係得到該拓樸圖。如圖所示，卷積子運算子120_1的輸入子張量TS2_i1包含子張量TS1_o1與子張量TS1_o2；也就是說，卷積子運算子120_1需要等減法子運算子110_1及減法子運算子110_2皆結束後才能開始。同理，卷積子運算子120_2需要等減法子運算子110_1、減法子運算子110_2及減法子運算子110_3皆結束後才能開始；卷積子運算子120_3需要等減法子運算子110_2及減法子運算子110_3皆結束後才能開始。加法子運算子130_1、加法子運算子130_2及加法子運算子130_3分別需要等卷積子運算子120_1、卷積子運算子120_2及卷積子運算子120_3結束後才能開始。Based on the overlapping relationship between the sub-tensors split from the same tensor (see FIG3 ), a topological diagram of the connections between the sub-operators as shown in FIG4 can be obtained. Specifically, the topological diagram is obtained based on the overlapping relationship between the input sub-tensors of each sub-operator split from the same operator and the output sub-tensors of each sub-operator of the source operator on which it depends. As shown in the figure, the input sub-tensor TS2_i1 of the convolution sub-operator 120_1 includes the sub-tensor TS1_o1 and the sub-tensor TS1_o2; that is, the convolution sub-operator 120_1 needs to wait until the subtraction sub-operator 110_1 and the subtraction sub-operator 110_2 are both finished before it can start. Similarly, the convolution operator 120_2 needs to wait for the subtraction operator 110_1, the subtraction operator 110_2 and the subtraction operator 110_3 to finish before it can start; the convolution operator 120_3 needs to wait for the subtraction operator 110_2 and the subtraction operator 110_3 to finish before it can start. The addition operator 130_1, the addition operator 130_2 and the addition operator 130_3 need to wait for the convolution operator 120_1, the convolution operator 120_2 and the convolution operator 120_3 to finish before they can start.

請參閱圖5，圖5是圖2之步驟S230之一實施例的詳細流程，包含以下步驟。以下配合圖4說明圖5的細節。Please refer to FIG. 5 , which is a detailed flow chart of an embodiment of step S230 of FIG. 2 , including the following steps. The details of FIG. 5 are explained below in conjunction with FIG. 4 .

步驟S510：決定目標子運算子。例如，選取卷積子運算子120_1作為目標子運算子。Step S510: Determine the target sub-operator. For example, select the convolution sub-operator 120_1 as the target sub-operator.

步驟S520：決定該目標子運算子之來源子運算子。承上例，因為卷積子運算子120_1的輸入子張量TS2_i1的來源包含子張量TS1_o1及子張量TS1_o2，所以卷積子運算子120_1的來源子運算子是減法子運算子110_1及減法子運算子110_2（即，減法子運算子110_1的輸出子張量TS1_o1與減法子運算子110_2的輸出子張量TS1_o2是卷積子運算子120_1的輸入子張量）。同理，卷積子運算子120_2的來源子運算子是減法子運算子110_1、減法子運算子110_2及減法子運算子110_3，卷積子運算子120_3的來源子運算子是減法子運算子110_2及減法子運算子110_3；加法子運算子130_1的來源子運算子是卷積子運算子120_1。Step S520: Determine the source sub-operator of the target sub-operator. Continuing with the above example, since the source of the input sub-tensor TS2_i1 of the convolution sub-operator 120_1 includes the sub-tensor TS1_o1 and the sub-tensor TS1_o2, the source sub-operator of the convolution sub-operator 120_1 is the subtraction sub-operator 110_1 and the subtraction sub-operator 110_2 (i.e., the output sub-tensor TS1_o1 of the subtraction sub-operator 110_1 and the output sub-tensor TS1_o2 of the subtraction sub-operator 110_2 are the input sub-tensors of the convolution sub-operator 120_1). Similarly, the source sub-operators of the convolution sub-operator 120_2 are the subtraction sub-operator 110_1, the subtraction sub-operator 110_2 and the subtraction sub-operator 110_3, the source sub-operators of the convolution sub-operator 120_3 are the subtraction sub-operator 110_2 and the subtraction sub-operator 110_3; and the source sub-operator of the addition sub-operator 130_1 is the convolution sub-operator 120_1.

步驟S530：決定該目標子運算子依賴於該來源子運算子，也就是說，來源子運算子是該目標子運算子的依賴子運算子。舉例來說，減法子運算子110_1、減法子運算子110_2及減法子運算子110_3是卷積子運算子120_2的依賴子運算子。Step S530: Determine whether the target sub-operator is dependent on the source sub-operator, that is, the source sub-operator is a dependent sub-operator of the target sub-operator. For example, subtraction sub-operator 110_1, subtraction sub-operator 110_2 and subtraction sub-operator 110_3 are dependent sub-operators of convolution sub-operator 120_2.

輪流以圖4的每個子運算子作為目標子運算子並且重複圖5的流程即可確定多個子運算子之間的依賴關係，如圖6所示。卷積子運算子120_1依賴於減法子運算子110_1及減法子運算子110_2。卷積子運算子120_2依賴於減法子運算子110_1、減法子運算子110_2及減法子運算子110_3。卷積子運算子120_3依賴於減法子運算子110_2及減法子運算子110_3。加法子運算子130_1、加法子運算子130_2及加法子運算子130_3分別依賴於卷積子運算子120_1、卷積子運算子120_2及卷積子運算子120_3。By taking each sub-operator of FIG. 4 as the target sub-operator in turn and repeating the process of FIG. 5 , the dependency relationship between the multiple sub-operators can be determined, as shown in FIG. 6 . The convolution sub-operator 120_1 depends on the subtraction sub-operator 110_1 and the subtraction sub-operator 110_2. The convolution sub-operator 120_2 depends on the subtraction sub-operator 110_1, the subtraction sub-operator 110_2 and the subtraction sub-operator 110_3. The convolution sub-operator 120_3 depends on the subtraction sub-operator 110_2 and the subtraction sub-operator 110_3. The adder operator 130_1, the adder operator 130_2 and the adder operator 130_3 are dependent on the convolution operator 120_1, the convolution operator 120_2 and the convolution operator 120_3 respectively.

請參閱圖7，圖7是圖2之步驟S240之一實施例的詳細流程，包含以下步驟。圖7的流程是基於深度優先搜尋演算法（Depth first search algorithm）。Please refer to FIG. 7 , which is a detailed flow chart of an embodiment of step S240 in FIG. 2 , including the following steps: The flow chart of FIG. 7 is based on a depth first search algorithm.

步驟S710：尋找入度（indegree）為0的子運算子，並標記該入度為0的子運算子為目標子運算子且已訪問。入度為0的子運算子是沒有被依賴的子運算子。以圖6為例，加法子運算子130_1、加法子運算子130_2及加法子運算子130_3是入度為0的子運算子，也就是最上層的子運算子。Step S710: Find a sub-operator with indegree 0, and mark the sub-operator with indegree 0 as a target sub-operator and visited. A sub-operator with indegree 0 is a sub-operator that has no dependencies. Taking FIG6 as an example, the addition sub-operator 130_1, the addition sub-operator 130_2, and the addition sub-operator 130_3 are sub-operators with indegree 0, that is, the top-level sub-operators.

步驟S720：判斷是否有找到入度為0的子運算子。如果是，則執行步驟S730；如果否，則執行步驟S795。以下以加法子運算子130_1為例做說明。Step S720: Determine whether a sub-operator with in-degree 0 is found. If yes, execute step S730; if not, execute step S795. The following is an example of the addition sub-operator 130_1.

步驟S730：尋找目標子運算子的未訪問過的依賴子運算子。如圖6所示，因為卷積子運算子120_1是加法子運算子130_1的依賴子運算子（即，加法子運算子130_1依賴於卷積子運算子120_1），所以步驟S730找到卷積子運算子120_1。Step S730: Find the unvisited dependent sub-operators of the target sub-operator. As shown in FIG6 , since the convolution sub-operator 120_1 is the dependent sub-operator of the addition sub-operator 130_1 (ie, the addition sub-operator 130_1 depends on the convolution sub-operator 120_1), step S730 finds the convolution sub-operator 120_1.

步驟S740：判斷是否找到依賴子運算子。如果是，則執行步驟S750；如果否，則執行步驟S760。Step S740: Determine whether the dependent sub-operator is found. If yes, execute step S750; if not, execute step S760.

步驟S750：標記該依賴子運算子為目標子運算子且已訪問，然後執行步驟S730。承上例，在步驟S750中，卷積子運算子120_1被標記為目標子運算子且已訪問，然後於再次執行步驟S730時找到卷積子運算子120_1的依賴子運算子（假設找到減法子運算子110_1）。接者再次執行步驟S730及步驟S740；此時因為減法子運算子110_1不依賴於任何子運算子（即沒有依賴子運算子，步驟S740為否），所以流程前往步驟S760。Step S750: Mark the dependent sub-operator as the target sub-operator and has been visited, and then execute step S730. Continuing with the above example, in step S750, the convolution sub-operator 120_1 is marked as the target sub-operator and has been visited, and then when step S730 is executed again, the dependent sub-operator of the convolution sub-operator 120_1 is found (assuming that the subtraction sub-operator 110_1 is found). Then, step S730 and step S740 are executed again; at this time, because the subtraction sub-operator 110_1 does not depend on any sub-operator (that is, there is no dependent sub-operator, step S740 is no), the process goes to step S760.

步驟S760：將目標子運算子加入隊列800。承上例，此時將減法子運算子110_1加入隊列800。請參考圖8A及圖8B，圖8A及圖8B顯示隊列800內容的變化（圖8B接續圖8A）。如圖8A的第一列所示，此時隊列800僅包含減法子運算子110_1（「SUB1」）。Step S760: Add the target sub-operator to the queue 800. Continuing with the above example, the subtraction sub-operator 110_1 is added to the queue 800. Please refer to FIG8A and FIG8B, which show the changes in the content of the queue 800 (FIG8B follows FIG8A). As shown in the first column of FIG8A, the queue 800 only includes the subtraction sub-operator 110_1 ("SUB1").

步驟S770：判斷目標子運算子是否為最上層的子運算子（即，入度為0的子運算子）。如果是，則執行步驟S710；如果否，則執行步驟S780。承上例，因為減法子運算子110_1不是最上層的子運算子，所以步驟S770的判斷結果為否。Step S770: Determine whether the target sub-operator is a top-level sub-operator (i.e., a sub-operator with an in-degree of 0). If yes, execute step S710; if no, execute step S780. Continuing with the above example, since the subtraction sub-operator 110_1 is not a top-level sub-operator, the determination result of step S770 is no.

步驟S780：決定依賴於該目標子運算子之一上層子運算子（即，退回上一層子運算子），並將該上層子運算子標記為目標子運算子。承上例，此時流程會退回到卷積子運算子120_1。Step S780: Determine a parent sub-operator that is dependent on the target sub-operator (i.e., return to the parent sub-operator), and mark the parent sub-operator as the target sub-operator. Continuing with the above example, the process will return to the convolution sub-operator 120_1.

步驟S790：判斷是否有未標記過的依賴子運算子。承上例，此時因為目標子運算子（卷積子運算子120_1）的依賴子運算子（減法子運算子110_1與減法子運算子110_2）中還有未被標記過的子運算子（即，減法子運算子110_2），所以步驟S790的判斷結果為是；接著，流程執行以下的步驟：步驟S730（找到減法子運算子110_2）→步驟S740（判斷結果為是）→步驟S750（將減法子運算子110_2標記為已訪問）→步驟S730（找不到減法子運算子110_2的依賴子運算子）→步驟S740（判斷結果為否）→步驟S760（將減法子運算子110_2加入隊列800後（如圖8A的第二列所示））→步驟S770（判斷結果為否）→步驟S780（將卷積子運算子120_1標記為目標子運算子）→步驟S790。此時因為目標子運算子（即，卷積子運算子120_1）的所有依賴子運算子（即，減法子運算子110_1與減法子運算子110_2）都已經被訪問過，所以步驟S790的判斷結果為否，而因此卷積子運算子120_1在接下來的步驟S760中被加入隊列800（如圖8A的第三列所示）。繼續執行步驟S770、步驟S780（將加法子運算子130_1標記為目標子運算子）、步驟S790及步驟S760（加法子運算子130_1被加入隊列800）後，步驟S770的判斷結果為是（因為加法子運算子130_1是最上層的子運算子），流程回到步驟S710以選取下一個入度為0的子運算子（例如加法子運算子130_2）。Step S790: Determine whether there are any unmarked dependent sub-operators. Continuing with the above example, at this time, because there are unmarked sub-operators (i.e., subtraction sub-operator 110_2) among the dependent sub-operators (subtraction sub-operator 110_1 and subtraction sub-operator 110_2) of the target sub-operator (convolution sub-operator 120_1), the determination result of step S790 is yes; then, the process executes the following steps: step S730 (find subtraction sub-operator 110_2) → step S740 (the determination result is yes) → step S75 0 (mark the subtraction operator 110_2 as visited) → step S730 (cannot find the dependent sub-operator of the subtraction operator 110_2) → step S740 (the judgment result is no) → step S760 (add the subtraction operator 110_2 to the queue 800 (as shown in the second column of Figure 8A)) → step S770 (the judgment result is no) → step S780 (mark the convolution operator 120_1 as the target sub-operator) → step S790. At this time, because all dependent sub-operators (i.e., subtraction sub-operators 110_1 and subtraction sub-operators 110_2) of the target sub-operator (i.e., convolution sub-operator 120_1) have been visited, the judgment result of step S790 is no, and therefore the convolution sub-operator 120_1 is added to the queue 800 in the next step S760 (as shown in the third column of FIG. 8A). After continuing to execute step S770, step S780 (marking the addition sub-operator 130_1 as the target sub-operator), step S790 and step S760 (the addition sub-operator 130_1 is added to the queue 800), the judgment result of step S770 is yes (because the addition sub-operator 130_1 is the top-level sub-operator), and the process returns to step S710 to select the next sub-operator with an in-degree of 0 (for example, the addition sub-operator 130_2).

上述的步驟S710~步驟S790將被重複執行（圖6的所有子運算子被加入隊列800的過程如圖8A及圖8B所示，不再贅述），直到所有的入度為0的子運算子都被訪問過（即，步驟S720的結果為否，流程前往步驟S795）。The above steps S710 to S790 will be repeatedly executed (the process of adding all sub-operators in Figure 6 to queue 800 is shown in Figures 8A and 8B and will not be repeated) until all sub-operators with an in-degree of 0 have been visited (that is, the result of step S720 is no, and the process goes to step S795).

步驟S795：依序取出隊列800中的所有子運算子。以圖8B為例，子運算子被取出隊列800的順序（也就是子運算子的操作順序）是：SUB1→SUB2→CONV1→…→CONV3→ADD3（即，與子運算子被加入隊列的順序相反）。Step S795: Sequentially remove all sub-operators from the queue 800. Taking FIG. 8B as an example, the order in which the sub-operators are removed from the queue 800 (i.e., the operation order of the sub-operators) is: SUB1→SUB2→CONV1→…→CONV3→ADD3 (i.e., the opposite order of the order in which the sub-operators are added to the queue).

圖9是本發明電子裝置之一實施例的功能方塊圖。電子裝置900包含晶片901及外部記憶體902（例如，動態隨機存取記憶體（Dynamic Random Access Memory, DRAM））。晶片901與外部記憶體902互相耦接或電連接。晶片901包含處理電路910及處理器920。處理電路910與處理器920互相耦接或電連接。FIG9 is a functional block diagram of an embodiment of the electronic device of the present invention. The electronic device 900 includes a chip 901 and an external memory 902 (e.g., a dynamic random access memory (DRAM)). The chip 901 and the external memory 902 are coupled or electrically connected to each other. The chip 901 includes a processing circuit 910 and a processor 920. The processing circuit 910 and the processor 920 are coupled or electrically connected to each other.

處理器920控制處理電路910來共同實現晶片901的功能。處理器920可以是具有程式執行能力的電路或電子元件，例如中央處理器、微處理器、微處理單元、數位訊號處理器、特殊應用積體電路（Application Specific Integrated Circuit, ASIC），或其等效電路。The processor 920 controls the processing circuit 910 to realize the functions of the chip 901. The processor 920 can be a circuit or electronic component with program execution capability, such as a central processing unit, a microprocessor, a microprocessing unit, a digital signal processor, an application specific integrated circuit (ASIC), or an equivalent circuit thereof.

處理電路910可以是一個智能處理單元（intelligence processing unit, IPU）或神經網路處理單元（neural-network processing unit, NPU）。處理電路910包含運算電路912（例如包含但不限於卷積引擎、向量（vector）引擎）、暫存電路914（例如包含但不限於多個暫存器）、記憶體管理電路916（例如，直接記憶體存取（Direct Memory Access, DMA））及記憶體918（例如，靜態隨機存取記憶體（Static Random Access Memory, SRAM））。暫存電路914用來儲存運算電路912執行卷積運算或向量運算時所需的資料。記憶體918可以儲存圖3之各個子運算子的輸出子張量。The processing circuit 910 may be an intelligence processing unit (IPU) or a neural-network processing unit (NPU). The processing circuit 910 includes an operation circuit 912 (e.g., including but not limited to a convolution engine and a vector engine), a temporary storage circuit 914 (e.g., including but not limited to a plurality of temporary storages), a memory management circuit 916 (e.g., direct memory access (DMA)), and a memory 918 (e.g., static random access memory (SRAM)). The temporary storage circuit 914 is used to store data required by the operation circuit 912 to perform convolution operations or vector operations. The memory 918 can store the output sub-tensors of each sub-operator in FIG. 3 .

外部記憶體902儲存輸入資料Din、核心參數Kp及輸出資料Dout。記憶體管理電路916用來從外部記憶體902讀取輸入資料Din及核心參數Kp並且將其存入記憶體918、從記憶體918讀取輸入資料Din的至少一部分及核心參數Kp的至少一部分並且將其存入暫存電路914，以及將運算電路912所產生的輸出資料Dout儲存至外部記憶體902。The external memory 902 stores input data Din, core parameter Kp and output data Dout. The memory management circuit 916 is used to read the input data Din and core parameter Kp from the external memory 902 and store them in the memory 918, read at least a portion of the input data Din and at least a portion of the core parameter Kp from the memory 918 and store them in the temporary storage circuit 914, and store the output data Dout generated by the operation circuit 912 in the external memory 902.

以下配合圖9、圖10、圖11、圖12A及圖12B說明圖2之步驟S250的細節。圖10顯示圖3的部分的子運算子的輸出子張量的生命週期（life span）列表。橫軸對應到前述的子運算子的操作順序（不必然對應到實際的時間長度）；更明確地說，減法子運算子110_1的輸出子張量TS1_o1在操作順序為0的時間點產生，並且在操作順序為5的時間點結束（即，不會再被其他子運算子使用）。請注意，子張量TS3_o1、子張量TS3_o2及子張量TS3_o3分別於操作順序為3、6及8的時間點開始產生；然而，因為子張量TS3_o1、子張量TS3_o2及子張量TS3_o3不會被提前從記憶體918中刪除（因為各為輸出資料Dout的一部分），所以圖10的生命週期列表未繪示該三個子張量。The following is a detailed description of step S250 of FIG2 in conjunction with FIG9, FIG10, FIG11, FIG12A and FIG12B. FIG10 shows a list of the life spans of the output subtensors of some of the sub-operators of FIG3. The horizontal axis corresponds to the operation order of the aforementioned sub-operators (not necessarily corresponding to the actual length of time); more specifically, the output subtensor TS1_o1 of the subtraction sub-operator 110_1 is generated at the time point when the operation order is 0, and ends at the time point when the operation order is 5 (i.e., it will no longer be used by other sub-operators). Please note that sub-tensor TS3_o1, sub-tensor TS3_o2 and sub-tensor TS3_o3 are generated at time points of operation order 3, 6 and 8 respectively; however, because sub-tensor TS3_o1, sub-tensor TS3_o2 and sub-tensor TS3_o3 will not be deleted from memory 918 in advance (because each is part of the output data Dout), the life cycle list of Figure 10 does not show the three sub-tensors.

圖2之步驟S250的細節包含根據圖10之生命週期列表分配記憶體918，分配記憶體918的流程如圖11所示。圖12A及圖12B為本發明的活躍列表之一實施例的示意圖。以下的說明請同時參考圖10、圖11、圖12A及圖12B。活躍列表用來顯示子張量於記憶體918中的活躍情形（更明確地說，顯示子張量被存入記憶體918以及被從記憶體918刪除的時間點）。圖11包含以下步驟。The details of step S250 of Figure 2 include allocating memory 918 according to the life cycle list of Figure 10, and the process of allocating memory 918 is shown in Figure 11. Figures 12A and 12B are schematic diagrams of an embodiment of the active list of the present invention. Please refer to Figures 10, 11, 12A and 12B for the following description. The active list is used to display the activity status of the sub-tensor in the memory 918 (more specifically, it displays the time points when the sub-tensor is stored in the memory 918 and deleted from the memory 918). Figure 11 includes the following steps.

步驟S1110：建立生命週期列表。生命週期列表的一個例子如圖10所示。Step S1110: Create a life cycle list. An example of a life cycle list is shown in FIG10 .

步驟S1120：搜尋生命週期列表，以找出當前生命週期活躍的子張量。例如，子張量TS1_o1在操作順序為0的時間點開始變得活躍，而子張量TS1_o1的活躍期間為操作順序為0的時間點至操作順序為5的時間點。Step S1120: Search the life cycle list to find the sub-tensor that is currently active in the life cycle. For example, sub-tensor TS1_o1 becomes active at the time point when the operation sequence is 0, and the active period of sub-tensor TS1_o1 is from the time point when the operation sequence is 0 to the time point when the operation sequence is 5.

步驟S1130：將活躍的子張量加入活躍列表。如圖12A所示，子張量TS1_o1在生命週期為0時被加入活躍列表。Step S1130: Add the active sub-tensor to the active list. As shown in FIG12A , the sub-tensor TS1_o1 is added to the active list when the life cycle is 0.

步驟S1140：對活躍的子張量分配記憶體，即，在記憶體918中安排相對應的儲存空間。承上例，如圖12A所示，部分的記憶體918在生命週期為0時分配給子張量TS1_o1。Step S1140: Allocate memory to the active sub-tensor, that is, arrange corresponding storage space in the memory 918. Continuing with the above example, as shown in FIG12A , part of the memory 918 is allocated to the sub-tensor TS1_o1 when the life cycle is 0.

步驟S1150：刪除活躍列表中不再活躍的子張量。舉例來說，因為在圖10中子張量TS2_o1在操作順序為3的時間點之後便不再活躍，所以在圖12A中子張量TS2_o1在生命週期為3時被刪除。Step S1150: Delete the sub-tensor that is no longer active in the active list. For example, because the sub-tensor TS2_o1 in FIG. 10 is no longer active after the time point when the operation sequence is 3, the sub-tensor TS2_o1 in FIG. 12A is deleted when the life cycle is 3.

步驟S1160：釋放對應於不再活躍的子張量的記憶體。因應前一步的從活躍列表中刪除子張量，此步驟釋放記憶體918中相對應的儲存空間，如此一來可以更及時且彈性地使用記憶體918。Step S1160: Release the memory corresponding to the sub-tensor that is no longer active. In response to the deletion of the sub-tensor from the active list in the previous step, this step releases the corresponding storage space in the memory 918, so that the memory 918 can be used more timely and flexibly.

步驟S1170：生命週期加1。Step S1170: The life cycle is increased by 1.

若當前生命週期無不再活躍的子張量，則跳過步驟S1150和S1160直接執行步驟S1170。If there is no sub-tensor that is no longer active in the current life cycle, skip steps S1150 and S1160 and directly execute step S1170.

步驟S1180：判斷生命週期是否結束（即，判斷圖10的操作順序是否結束）。如果是，則結束圖11的流程；如果否，則執行步驟S1120以繼續找出活躍的子張量。Step S1180: Determine whether the life cycle is terminated (i.e., determine whether the operation sequence of FIG. 10 is terminated). If yes, terminate the process of FIG. 11; if no, execute step S1120 to continue to find active sub-tensors.

如上所述，根據圖10的生命週期列表及圖11的流程可以得到圖12A及圖12B的活躍列表。圖12A及圖12B的生命週期對應到圖10的操作順序。舉例來說，在圖10中，子張量TS1_o1在操作順序為0的時間點產生並且在操作順序為5的時間點結束；因此，在圖12A中子張量TS1_o1的生命週期是0~4。類似地，在圖10中子張量TS2_o2存在於操作順序為5的時間點及操作順序為6的時間點之間；因此，在圖12B中，子張量TS2_o2只存在於生命週期5。如此一來，晶片901的開發者或設計者便可根據圖12A及圖12B的活躍列表來設計或管理記憶體918；因此，可以在不增加記憶體918的前提下（為了節省成本）降低對外部記憶體902的頻寬需求（可以提升外部記憶體902的整體效能），或是在不增加外部記憶體902的記憶體頻寬的前提下進一步節省記憶體918。As described above, the active lists of FIG. 12A and FIG. 12B can be obtained according to the life cycle list of FIG. 10 and the process of FIG. 11. The life cycle of FIG. 12A and FIG. 12B corresponds to the operation sequence of FIG. 10. For example, in FIG. 10, subtensor TS1_o1 is generated at the time point of operation sequence 0 and ends at the time point of operation sequence 5; therefore, the life cycle of subtensor TS1_o1 in FIG. 12A is 0~4. Similarly, in FIG. 10, subtensor TS2_o2 exists between the time point of operation sequence 5 and the time point of operation sequence 6; therefore, in FIG. 12B, subtensor TS2_o2 only exists in life cycle 5. In this way, the developer or designer of the chip 901 can design or manage the memory 918 according to the active list of Figures 12A and 12B; therefore, the bandwidth requirement of the external memory 902 can be reduced (the overall performance of the external memory 902 can be improved) without increasing the memory 918 (in order to save costs), or the memory 918 can be further saved without increasing the memory bandwidth of the external memory 902.

作為比較，若沒有對圖1的運算子及張量進行切分，則晶片開發者必須預先分配記憶體918的其中一個儲存區塊給張量TS2（總資料量等效於子張量TS1_o1的資料量、子張量TS1_o2的資料量與子張量TS1_o3的資料量的總和），而且該儲存區塊直到卷積運算子120結束才能被釋放；另外，若對運算子切分後的各子運算未進行操作順序排序，則需對同一子張量如子張量TS1_i1重複運算多次，使整個AI模型的數據量增大，導致成本上升（因為對記憶體918的需求增加）或是效能下降（因為對外部記憶體902的頻寬需求增加）。在實際操作時，由於運算子的數量以及張量的尺寸都非常龐大，所以本發明所能達成的功效相當顯著。In comparison, if the operator and tensor of Figure 1 are not split, the chip developer must pre-allocate one of the storage blocks of memory 918 to tensor TS2 (the total amount of data is equivalent to the sum of the data of sub-tensor TS1_o1, sub-tensor TS1_o2 and sub-tensor TS1_o3), and the storage block cannot be released until the convolution operator 120 ends; in addition, if the sub-operations after the operator is split are not sorted in order of operation, the same sub-tensor such as sub-tensor TS1_i1 needs to be repeatedly operated multiple times, which increases the data volume of the entire AI model, resulting in increased costs (because of increased demand for memory 918) or decreased performance (because of increased bandwidth requirements for external memory 902). In actual operation, since the number of operators and the size of tensors are very large, the effect that can be achieved by the present invention is quite significant.

本發明可以擴展至包含更多運算子的AI模型。請參閱圖13，圖13是另一個AI模型的示意圖。圖13的左邊顯示AI模型包含N個運算子（運算子1、運算子2、……、運算子N），圖13的右邊顯示一個運算子被切分成H個子運算子。箭頭表示子運算子執行時的依賴關係。舉例來說，運算子2的第1子運算子需等到運算子1的第3子運算子執行完畢後才能執行。The present invention can be extended to an AI model that includes more operators. Please refer to Figure 13, which is a schematic diagram of another AI model. The left side of Figure 13 shows that the AI model includes N operators (operator 1, operator 2, ..., operator N), and the right side of Figure 13 shows that an operator is divided into H sub-operators. The arrows represent the dependency relationship when the sub-operators are executed. For example, the 1st sub-operator of operator 2 must wait until the 3rd sub-operator of operator 1 is executed before it can be executed.

圖14是另一個AI模型的示意圖。AI網路1400包含加法運算子1410及卷積運算子1420，張量的大小為[1,56,56,224]。經切分後（如圖14的下半部所示），加法運算子1410及卷積運算子1420各被切分成56個子運算子（「ADD1」~「ADD56」及「CONV1」~「CONV56」）。每個加法子運算子（「ADD1」~「ADD56」）所輸出的子張量的大小為[1,1,56,224]。然而，並非所有卷積子運算子（「CONV1」~「CONV56」）的輸入子張量的大小都相同。更明確地說，第一卷積子運算子（「CONV1」）的輸入子張量的大小為[1,2,56,224]，而第二卷積子運算子（「CONV2」）的輸入子張量的大小為[1,3,56,224]（即，前述之可視域放大的問題）。FIG14 is a schematic diagram of another AI model. AI network 1400 includes an addition operator 1410 and a convolution operator 1420, and the size of the tensor is [1,56,56,224]. After segmentation (as shown in the lower half of FIG14), the addition operator 1410 and the convolution operator 1420 are each segmented into 56 sub-operators ("ADD1" to "ADD56" and "CONV1" to "CONV56"). The size of the sub-tensor output by each addition sub-operator ("ADD1" to "ADD56") is [1,1,56,224]. However, not all convolution sub-operators ("CONV1" to "CONV56") have the same size of input sub-tensors. More specifically, the input subtensor size of the first convolution operator ("CONV1") is [1,2,56,224], while the input subtensor size of the second convolution operator ("CONV2") is [1,3,56,224] (i.e., the aforementioned field of view magnification problem).

圖15A及圖15B是本發明人工智慧模型的執行方法之一實施例的流程圖。圖15A及圖15B包含以下步驟。Figures 15A and 15B are flow charts of an embodiment of an implementation method of the artificial intelligence model of the present invention. Figures 15A and 15B include the following steps.

步驟S1505：記憶體管理電路916從外部記憶體902讀取一個張量（即，輸入資料Din）及複數個核心參數Kp，並將該張量及核心參數Kp儲存至記憶體918。Step S1505: The memory management circuit 916 reads a tensor (ie, input data Din) and a plurality of core parameters Kp from the external memory 902, and stores the tensor and the core parameters Kp in the memory 918.

步驟S1510：處理電路910（更明確地說，運算電路912）對該張量的一第一子張量進行一第一種類的運算，以產生一第一中間資料，並且記憶體管理電路916將第一中間資料存入記憶體918。以圖4為例，第一子張量可以是減法子運算子110_1的輸入子張量（即，子張量TS1_i1），第一種類的運算可以是減法運算（向量運算的一種），而第一中間資料可以是減法子運算子110_1的輸出子張量（即，子張量TS1_o1）。以圖14為例，第一子張量可以是加法子運算子（例如「ADD1」）的輸入子張量，第一種類的運算可以是加法運算（向量運算的一種），而第一中間資料可以是加法子運算子（例如「ADD1」）的輸出子張量。Step S1510: The processing circuit 910 (more specifically, the operation circuit 912) performs a first type of operation on a first sub-tensor of the tensor to generate a first intermediate data, and the memory management circuit 916 stores the first intermediate data into the memory 918. Taking FIG. 4 as an example, the first sub-tensor may be an input sub-tensor (i.e., sub-tensor TS1_i1) of the subtraction sub-operator 110_1, the first type of operation may be a subtraction operation (a type of vector operation), and the first intermediate data may be an output sub-tensor (i.e., sub-tensor TS1_o1) of the subtraction sub-operator 110_1. Taking FIG. 14 as an example, the first sub-tensor may be an input sub-tensor of an addition sub-operator (e.g., “ADD1”), the first type of operation may be an addition operation (a type of vector operation), and the first intermediate data may be an output sub-tensor of the addition sub-operator (e.g., “ADD1”).

步驟S1520：處理電路910（更明確地說，運算電路912）對該張量的一第二子張量進行該第一種類的運算，以產生一第二中間資料，並且記憶體管理電路916將第二中間資料存入記憶體918。以圖4為例，第二子張量可以是減法子運算子110_2的輸入子張量（即，子張量TS1_i2），第一種類的運算可以是減法運算，而第二中間資料可以是減法子運算子110_2的輸出子張量（即，子張量TS1_o2）。以圖14為例，第二子張量可以是加法子運算子（例如「ADD2」）的輸入子張量，第一種類的運算可以是加法運算，而第二中間資料可以是加法子運算子（例如「ADD2」）的輸出子張量。Step S1520: The processing circuit 910 (more specifically, the operation circuit 912) performs the first type of operation on a second sub-tensor of the tensor to generate a second intermediate data, and the memory management circuit 916 stores the second intermediate data in the memory 918. Taking FIG. 4 as an example, the second sub-tensor may be an input sub-tensor (i.e., sub-tensor TS1_i2) of the subtraction sub-operator 110_2, the first type of operation may be a subtraction operation, and the second intermediate data may be an output sub-tensor (i.e., sub-tensor TS1_o2) of the subtraction sub-operator 110_2. Taking Figure 14 as an example, the second sub-tensor can be the input sub-tensor of the addition sub-operator (e.g., "ADD2"), the first type of operation can be an addition operation, and the second intermediate data can be the output sub-tensor of the addition sub-operator (e.g., "ADD2").

步驟S1530：處理電路910（更明確地說，運算電路912）對該第一中間資料及該第二中間資料進行一第二種類的運算，以產生一第三中間資料，並且記憶體管理電路916將第三中間資料存入記憶體918。以圖4及圖14為例，第二種類的運算可以是卷積運算（例如，卷積子運算子120_1或「CONV1」），而第三中間資料可以是該卷積運算的結果（例如，圖4之子張量TS2_o1）。Step S1530: The processing circuit 910 (more specifically, the operation circuit 912) performs a second type of operation on the first intermediate data and the second intermediate data to generate a third intermediate data, and the memory management circuit 916 stores the third intermediate data into the memory 918. Taking FIG. 4 and FIG. 14 as examples, the second type of operation may be a convolution operation (e.g., convolution sub-operator 120_1 or "CONV1"), and the third intermediate data may be a result of the convolution operation (e.g., sub-tensor TS2_o1 of FIG. 4).

步驟S1540：記憶體管理電路916將第三中間資料從記憶體918中刪除。如圖10所示，因為操作順序為3的時間點之後的操作不會再使用子張量TS2_o1（即，子張量TS2_o1於圖12A的生命週期3開始變得不再活躍），所以可以將子張量TS2_o1從記憶體918中刪除，以釋放部分的記憶體918。Step S1540: The memory management circuit 916 deletes the third intermediate data from the memory 918. As shown in FIG10 , since the operations after the time point of the operation sequence 3 will no longer use the sub-tensor TS2_o1 (i.e., the sub-tensor TS2_o1 becomes inactive starting from the life cycle 3 of FIG12A ), the sub-tensor TS2_o1 can be deleted from the memory 918 to release part of the memory 918.

步驟S1550：處理電路910（更明確地說，運算電路912）對該張量的一第三子張量進行該第一種類的運算，以產生一第四中間資料，並且記憶體管理電路916將第四中間資料存入記憶體918。以圖4為例，第三子張量可以是減法子運算子110_3的輸入子張量（即，子張量TS1_i3），第一種類的運算可以是減法運算，而第四中間資料可以是減法子運算子110_3的輸出子張量（即，子張量TS1_o3）。以圖14為例，第三子張量可以是加法子運算子（例如「ADD3」）的輸入子張量，第一種類的運算可以是加法運算，而第四中間資料可以是加法子運算子（例如「ADD3」）的輸出子張量。Step S1550: The processing circuit 910 (more specifically, the operation circuit 912) performs the first type of operation on a third sub-tensor of the tensor to generate a fourth intermediate data, and the memory management circuit 916 stores the fourth intermediate data in the memory 918. Taking FIG. 4 as an example, the third sub-tensor may be an input sub-tensor (i.e., sub-tensor TS1_i3) of the subtraction sub-operator 110_3, the first type of operation may be a subtraction operation, and the fourth intermediate data may be an output sub-tensor (i.e., sub-tensor TS1_o3) of the subtraction sub-operator 110_3. Taking FIG. 14 as an example, the third sub-tensor may be an input sub-tensor of an addition sub-operator (e.g., “ADD3”), the first type of operation may be an addition operation, and the fourth intermediate data may be an output sub-tensor of the addition sub-operator (e.g., “ADD3”).

步驟S1560：處理電路910（更明確地說，運算電路912）對該第一中間資料、該第二中間資料及該第四中間資料進行該第二種類的運算，以產生一第五中間資料，並且記憶體管理電路916將第五中間資料存入記憶體918。以圖4及圖14為例，第二種類的運算可以是卷積運算（例如，卷積子運算子120_2或「CONV2」），而第五中間資料可以是該卷積運算的結果（例如，圖4之子張量TS2_o2）。Step S1560: The processing circuit 910 (more specifically, the operation circuit 912) performs the second type of operation on the first intermediate data, the second intermediate data, and the fourth intermediate data to generate a fifth intermediate data, and the memory management circuit 916 stores the fifth intermediate data into the memory 918. Taking FIG. 4 and FIG. 14 as examples, the second type of operation may be a convolution operation (e.g., the convolution sub-operator 120_2 or "CONV2"), and the fifth intermediate data may be the result of the convolution operation (e.g., the sub-tensor TS2_o2 of FIG. 4).

步驟S1570：記憶體管理電路916將第一中間資料從記憶體918中刪除。如圖10所示，因為操作順序為5的時間點之後的操作不會再使用子張量TS1_o1（即，子張量TS1_o1於圖12B的生命週期5開始變得不再活躍），所以可以將子張量TS1_o1從記憶體918中刪除，以釋放部分的記憶體918。Step S1570: The memory management circuit 916 deletes the first intermediate data from the memory 918. As shown in FIG10 , since the operation after the time point of the operation sequence 5 will no longer use the sub-tensor TS1_o1 (i.e., the sub-tensor TS1_o1 becomes inactive starting from the life cycle 5 of FIG12B ), the sub-tensor TS1_o1 can be deleted from the memory 918 to release part of the memory 918.

步驟S1580：記憶體管理電路916將第五中間資料從記憶體918中刪除。如圖10所示，因為操作順序為6的時間點之後的操作不會再使用子張量TS2_o2（即，子張量TS2_o2於圖12B的生命週期6開始變得不再活躍），所以可以將子張量TS2_o2從記憶體918中刪除，以釋放部分的記憶體918。Step S1580: The memory management circuit 916 deletes the fifth intermediate data from the memory 918. As shown in FIG10 , since the operations after the time point of the operation sequence 6 will no longer use the sub-tensor TS2_o2 (i.e., the sub-tensor TS2_o2 becomes inactive starting from the life cycle 6 of FIG12B ), the sub-tensor TS2_o2 can be deleted from the memory 918 to release part of the memory 918.

如圖15A及圖15B的討論所示，晶片901的開發者可以預先根據記憶體918的使用狀態來安排處理電路910（更明確地說，記憶體管理電路916）所執行的指令（在一些實施例中由處理器920提供給處理電路910），以達到充分利用記憶體918的目的。As shown in the discussion of Figures 15A and 15B, the developer of chip 901 can arrange the instructions executed by the processing circuit 910 (more specifically, the memory management circuit 916) (provided to the processing circuit 910 by the processor 920 in some embodiments) according to the usage status of the memory 918 in advance to achieve the purpose of fully utilizing the memory 918.

請參閱圖16，圖16是本發明暫存電路914及記憶體918的儲存內容之一實施例的示意圖。在圖16的例子中，儲存在記憶體918的核心參數Kp包含減法核心參數Kp_s（向量核心參數的一種）、卷積核心參數Kp_c及加法核心參數Kp_a（向量核心參數的一種）。Please refer to Figure 16, which is a schematic diagram of an embodiment of the temporary storage circuit 914 and the storage content of the memory 918 of the present invention. In the example of Figure 16, the core parameter Kp stored in the memory 918 includes a subtraction core parameter Kp_s (a type of vector core parameter), a convolution core parameter Kp_c and an addition core parameter Kp_a (a type of vector core parameter).

減法核心參數Kp_s包含子參數Kp_s1、子參數Kp_s2及子參數Kp_s3。減法核心參數Kp_s是減法運算子110對張量TS1進行減法運算時所需的參數，而子參數Kp_s1、子參數Kp_s2及子參數Kp_s3可以分別對應到圖4的減法子運算子110_1、減法子運算子110_2及減法子運算子110_3。也就是說，運算電路912在執行減法子運算子110_1（110_2或110_3）時參考子參數Kp_s1（Kp_s2或Kp_s3）來對子張量TS1_i1（TS1_i2或TS1_i3）進行運算。在一些實施例中，子參數Kp_s1不等於子參數Kp_s2及子參數Kp_s3，且子參數Kp_s2不等於子參數Kp_s3。The subtraction core parameter Kp_s includes subparameters Kp_s1, Kp_s2, and Kp_s3. The subtraction core parameter Kp_s is a parameter required by the subtraction operator 110 to perform a subtraction operation on the tensor TS1, and the subparameters Kp_s1, Kp_s2, and Kp_s3 can correspond to the subtraction suboperator 110_1, 110_2, and 110_3 of FIG. 4, respectively. That is, the operation circuit 912 refers to the sub-parameter Kp_s1 (Kp_s2 or Kp_s3) to operate the sub-tensor TS1_i1 (TS1_i2 or TS1_i3) when executing the subtraction sub-operator 110_1 (110_2 or 110_3). In some embodiments, the sub-parameter Kp_s1 is not equal to the sub-parameter Kp_s2 and the sub-parameter Kp_s3, and the sub-parameter Kp_s2 is not equal to the sub-parameter Kp_s3.

加法核心參數Kp_a包含子參數Kp_a1、子參數Kp_a2及子參數Kp_a3。加法核心參數Kp_a是加法運算子130對張量TS3進行加法運算時所需的參數，而子參數Kp_a1、子參數Kp_a2及子參數Kp_a3可以分別對應到圖4的加法子運算子130_1、加法子運算子130_2及加法子運算子130_3。也就是說，運算電路912在執行加法子運算子130_1（130_2或130_3）時參考子參數Kp_a1（Kp_a2或Kp_a3）來對子張量TS3_i1（TS3_i2或TS3_i3）進行運算。在一些實施例中，子參數Kp_a1不等於子參數Kp_a2及子參數Kp_a3，且子參數Kp_a2不等於子參數Kp_a3。The addition core parameter Kp_a includes sub-parameters Kp_a1, Kp_a2, and Kp_a3. The addition core parameter Kp_a is a parameter required by the addition operator 130 to perform addition operations on the tensor TS3, and the sub-parameters Kp_a1, Kp_a2, and Kp_a3 can correspond to the addition sub-operator 130_1, the addition sub-operator 130_2, and the addition sub-operator 130_3 in FIG. 4, respectively. In other words, when executing the addition sub-operator 130_1 (130_2 or 130_3), the operation circuit 912 refers to the sub-parameter Kp_a1 (Kp_a2 or Kp_a3) to operate on the sub-tensor TS3_i1 (TS3_i2 or TS3_i3). In some embodiments, sub-parameter Kp_a1 is not equal to sub-parameter Kp_a2 and sub-parameter Kp_a3, and sub-parameter Kp_a2 is not equal to sub-parameter Kp_a3.

卷積核心參數Kp_c包含子參數Kp_c1、子參數Kp_c2及子參數Kp_c3。不同於減法運算及加法運算，雖然運算子及張量皆經過切分，但是卷積子運算子120_1、卷積子運算子120_2及卷積子運算子120_3進行卷積運算時仍需要參考完整的卷積核心參數Kp_c。The convolution core parameter Kp_c includes sub-parameters Kp_c1, Kp_c2, and Kp_c3. Different from the subtraction operation and the addition operation, although the operator and the tensor are split, the convolution sub-operator 120_1, the convolution sub-operator 120_2, and the convolution sub-operator 120_3 still need to refer to the complete convolution core parameter Kp_c when performing the convolution operation.

請參閱圖17，圖17是圖15A之步驟S1510或步驟S1520的細部流程，包含以下步驟。以下的說明請同時參閱圖16。Please refer to FIG. 17 , which is a detailed flow chart of step S1510 or step S1520 of FIG. 15A , including the following steps. Please refer to FIG. 16 for the following description.

步驟S1710：記憶體管理電路916從記憶體918讀取該向量核心參數的一目標部分，並且將該目標部分儲存至暫存電路914。更明確地說，對步驟S1510而言，目標部分可以是子參數Kp_s1。對步驟S1520而言，目標部分可以是子參數Kp_s2。如圖16所示，暫存電路914儲存子參數Kp_s1及/或子參數Kp_s2及其他資料（例如待運算的子張量）。由於步驟S1510及步驟S1520的第一種類的運算（例如，向量運算）只需用到減法核心參數Kp_s的一部分（即，目標部分），所以此時暫存電路914可以不用儲存完整的減法核心參數Kp_s以節省儲存空間。在一些實施例中，記憶體管理電路916於步驟S1510（S1520）之前將子參數Kp_s1（Kp_s2）存入暫存電路914，並且於步驟S1510（S1520）完成後將子參數Kp_s1（Kp_s2）從暫存電路914中刪除，以節省儲存空間。Step S1710: The memory management circuit 916 reads a target portion of the vector core parameter from the memory 918 and stores the target portion in the temporary storage circuit 914. More specifically, for step S1510, the target portion may be the sub-parameter Kp_s1. For step S1520, the target portion may be the sub-parameter Kp_s2. As shown in FIG. 16 , the temporary storage circuit 914 stores the sub-parameter Kp_s1 and/or the sub-parameter Kp_s2 and other data (e.g., the sub-tensor to be operated). Since the first type of operation (e.g., vector operation) in step S1510 and step S1520 only needs to use a portion (i.e., the target portion) of the subtraction core parameter Kp_s, the temporary storage circuit 914 does not need to store the complete subtraction core parameter Kp_s to save storage space. In some embodiments, the memory management circuit 916 stores the sub-parameter Kp_s1 (Kp_s2) in the temporary storage circuit 914 before step S1510 (S1520), and deletes the sub-parameter Kp_s1 (Kp_s2) from the temporary storage circuit 914 after step S1510 (S1520) is completed to save storage space.

步驟S1720：運算電路912參考該向量核心參數的該目標部分對目標子張量進行目標向量運算，以產生目標中間資料。更明確地說，對步驟S1510而言，目標子張量可以是子張量TS1_i1，目標向量運算可以是減法子運算子110_1，而目標中間資料可以是子張量TS1_o1。對步驟S1520而言，目標子張量可以是子張量TS1_i2，目標向量運算可以是減法子運算子110_2，而目標中間資料可以是子張量TS1_o2。對圖14而言，目標部分可以是子參數Kp_a1，目標子張量可以是一加法子運算子（例如「ADD1」）的輸入子張量，目標向量運算可以是該加法子運算子，而第一中間資料可以是該加法子運算子的輸出子張量。Step S1720: The operation circuit 912 performs a target vector operation on the target subtensor with reference to the target portion of the vector core parameter to generate target intermediate data. More specifically, for step S1510, the target subtensor may be subtensor TS1_i1, the target vector operation may be subtraction suboperator 110_1, and the target intermediate data may be subtensor TS1_o1. For step S1520, the target subtensor may be subtensor TS1_i2, the target vector operation may be subtraction suboperator 110_2, and the target intermediate data may be subtensor TS1_o2. For Figure 14, the target part can be the sub-parameter Kp_a1, the target sub-tensor can be the input sub-tensor of an addition sub-operator (for example, "ADD1"), the target vector operation can be the addition sub-operator, and the first intermediate data can be the output sub-tensor of the addition sub-operator.

如上所述，因為原本的張量被切分為多個子張量，所以對子張量進行向量運算只需參考部分的核心參數即可。As mentioned above, because the original tensor is split into multiple sub-tensors, vector operations on the sub-tensors only require reference to some of the core parameters.

本技術領域具有通常知識者可以從圖17的說明了解圖15B之步驟S1550的細節，故不再贅述。舉例來說，對步驟S1550而言，目標部分可以是子參數Kp_s3，目標子張量可以是子張量TS1_i3，目標向量運算可以是減法子運算子110_3，而目標中間資料可以是子張量TS1_o3。A person skilled in the art can understand the details of step S1550 of FIG. 15B from the description of FIG. 17 , so it is not described in detail. For example, for step S1550, the target part may be sub-parameter Kp_s3, the target sub-tensor may be sub-tensor TS1_i3, the target vector operation may be subtraction sub-operator 110_3, and the target intermediate data may be sub-tensor TS1_o3.

請參閱圖18，圖18是圖15A之步驟S1530的細部流程，包含以下步驟。以下的說明請同時參閱圖19。Please refer to FIG. 18, which is a detailed flow chart of step S1530 of FIG. 15A, including the following steps. Please refer to FIG. 19 for the following description.

步驟S1810：記憶體管理電路916從記憶體918讀取卷積核心參數Kp_c，並且將卷積核心參數Kp_c儲存至暫存電路914。如圖19所示，暫存電路914於卷積運算開始前儲存卷積核心參數Kp_c及其他資料（例如待運算的子張量）。由於步驟S1530的第二種類的運算（例如，卷積運算）需用到完整的卷積核心參數Kp_c，所以此時暫存電路914需儲存完整卷積核心參數Kp_c。Step S1810: The memory management circuit 916 reads the convolution kernel parameter Kp_c from the memory 918 and stores the convolution kernel parameter Kp_c to the temporary storage circuit 914. As shown in FIG19, the temporary storage circuit 914 stores the convolution kernel parameter Kp_c and other data (e.g., the sub-tensor to be operated) before the convolution operation starts. Since the second type of operation (e.g., convolution operation) of step S1530 requires the complete convolution kernel parameter Kp_c, the temporary storage circuit 914 needs to store the complete convolution kernel parameter Kp_c at this time.

步驟S1820：運算電路912參考卷積核心參數Kp_c對第一中間資料及第二中間資料進行該第二種類的運算，以產生第三中間資料。參考卷積核心參數Kp_c來對張量進行卷積運算為本技術領域具有通常知識者所熟知，故不再贅述。在一些實施例中，記憶體管理電路916於步驟S1820結束後從暫存電路914刪除卷積核心參數Kp_c。Step S1820: The operation circuit 912 performs the second type of operation on the first intermediate data and the second intermediate data with reference to the convolution core parameter Kp_c to generate the third intermediate data. The convolution operation on the tensor with reference to the convolution core parameter Kp_c is well known to those skilled in the art and will not be described in detail. In some embodiments, the memory management circuit 916 deletes the convolution core parameter Kp_c from the temporary storage circuit 914 after step S1820 is completed.

請參閱圖20，圖20是圖15B之步驟S1560的細部流程，包含以下步驟。以下的說明請同時參閱圖19。Please refer to FIG. 20 , which is a detailed flow chart of step S1560 of FIG. 15B , including the following steps. Please refer to FIG. 19 for the following description.

步驟S2010：步驟S2010與步驟S1810相似，故不再贅述。請注意，若記憶體管理電路916於步驟S1530結束後沒有將卷積核心參數Kp_c從暫存電路914刪除，則可以略過步驟S2010。Step S2010: Step S2010 is similar to step S1810 and will not be described in detail. Please note that if the memory management circuit 916 does not delete the convolution kernel parameter Kp_c from the temporary storage circuit 914 after step S1530, step S2010 can be skipped.

步驟S2020：運算電路912參考卷積核心參數Kp_c對第一中間資料、第二中間資料及第四中間資料進行該第二種類的運算，以產生第五中間資料。步驟S2020與步驟S1820相似，故不再贅述。Step S2020: The operation circuit 912 performs the second type of operation on the first intermediate data, the second intermediate data and the fourth intermediate data with reference to the convolution core parameter Kp_c to generate the fifth intermediate data. Step S2020 is similar to step S1820, so it will not be described again.

請參閱圖21，圖21是本發明記憶體管理電路916之一實施例的功能方塊圖。記憶體管理電路916包含至少2個通道（通道916a及通道916b），每個通道可以獨立操作，而且多個通道可以同時操作。基於此特性，本發明進一步將子運算子及子張量切分為數個小塊，使得處理電路910可以採用多級流水線（multistage pipeline）的方式對AI模型的運算，以提升晶片901的效能。Please refer to FIG. 21 , which is a functional block diagram of an embodiment of the memory management circuit 916 of the present invention. The memory management circuit 916 includes at least two channels (channel 916a and channel 916b), each channel can be operated independently, and multiple channels can be operated simultaneously. Based on this feature, the present invention further divides the sub-operators and sub-tensors into several small blocks, so that the processing circuit 910 can use a multi-stage pipeline to operate the AI model to improve the performance of the chip 901.

請參閱圖22，圖22是本發明多級流水線的示意圖。圖22以減法子運算子110_1、減法子運算子110_2及卷積子運算子120_1為例做說明。在圖22的例子中，減法子運算子110_1進一步被切分為運算塊110_1a（「SUB1a」）及運算塊110_1b（「SUB1b」），且子張量TS1_i1進一步被切分為資料塊TS1_i1a及資料塊TS1_i1b；減法子運算子110_2進一步被切分為運算塊110_2a（「SUB2a」）及運算塊110_2b（「SUB2b」），且子張量TS1_i2進一步被切分為資料塊TS1_i2a及資料塊TS1_i2b；卷積子運算子120_1進一步被切分為運算塊120_1a（「CONV1a」）及運算塊120_1b（「CONV1b」），且子張量TS2_i1進一步被切分為資料塊TS2_i1a及資料塊TS2_i1b。如此一來，當通道916a進行與運算塊110_1a相關的操作時（時間點T0與時間點T3之間），通道916b可以實質上同時進行與運算塊110_1b相關的操作（時間點T1與時間點T4之間）。相較於單級流水線（single-stage pipeline）（即，沒有同時使用多個通道來對AI模型進行運算），同時使用2個通道約可節省一半的時間。類似地，同時使用N個通道大約只需單級流水線的處理時間的1/N。Please refer to FIG. 22, which is a schematic diagram of the multi-stage pipeline of the present invention. FIG. 22 uses the subtraction sub-operator 110_1, the subtraction sub-operator 110_2 and the convolution sub-operator 120_1 as examples for explanation. In the example of FIG. 22, the subtraction sub-operator 110_1 is further divided into operation blocks 110_1a ("SUB1a") and operation blocks 110_1b ("SUB1b"), and the sub-tensor TS1_i1 is further divided into data blocks TS1_i1a and data blocks TS1_i1b; the subtraction sub-operator 110_2 is further divided into operation blocks 110_2a ("SUB2a") and operation blocks 110_ 2b ("SUB2b"), and the sub-tensor TS1_i2 is further divided into data blocks TS1_i2a and TS1_i2b; the convolution sub-operator 120_1 is further divided into operation blocks 120_1a ("CONV1a") and operation blocks 120_1b ("CONV1b"), and the sub-tensor TS2_i1 is further divided into data blocks TS2_i1a and TS2_i1b. In this way, when the channel 916a performs the operation related to the operation block 110_1a (between the time point T0 and the time point T3), the channel 916b can substantially perform the operation related to the operation block 110_1b (between the time point T1 and the time point T4) at the same time. Compared to a single-stage pipeline (i.e., not using multiple channels at the same time to run the AI model), using two channels simultaneously can save about half the time. Similarly, using N channels simultaneously takes about 1/N the processing time of a single-stage pipeline.

請參閱圖23，圖23是本發明多級流水線操作之一實施例的流程圖，包含以下步驟。Please refer to Figure 23, which is a flow chart of an embodiment of the multi-stage pipeline operation of the present invention, including the following steps.

步驟S2310：記憶體管理電路916使用第一通道（例如，通道916a）從記憶體918讀取第一子張量（例如，子張量TS1_i1）的第一資料塊（例如，資料塊TS1_i1a），並將第一資料塊存入暫存電路914。舉例來說，步驟S2310可以對應到圖22的時間點T0與時間點T1之間（即，「SUB1a載入（load）」操作）。Step S2310: The memory management circuit 916 uses the first channel (e.g., channel 916a) to read the first data block (e.g., data block TS1_i1a) of the first sub-tensor (e.g., sub-tensor TS1_i1) from the memory 918, and stores the first data block into the temporary storage circuit 914. For example, step S2310 may correspond to between time point T0 and time point T1 in FIG. 22 (i.e., "SUB1a load" operation).

步驟S2320：運算電路912對第一資料塊（例如，資料塊TS1_i1a）進行第一種類的運算（例如，減法運算）以產生該第一中間資料的一第一部分（例如，子張量TS1_o1的一部分）。舉例來說，步驟S2320可以對應到圖22的時間點T1與時間點T2之間（即，「SUB1a計算（compute）」操作）。Step S2320: The operation circuit 912 performs a first type of operation (e.g., a subtraction operation) on the first data block (e.g., data block TS1_i1a) to generate a first portion of the first intermediate data (e.g., a portion of the subtensor TS1_o1). For example, step S2320 may correspond to between time point T1 and time point T2 in FIG. 22 (i.e., the "SUB1a compute" operation).

步驟S2330：記憶體管理電路916使用第二通道（例如，通道916b）從記憶體918讀取第一子張量的第二資料塊（例如，資料塊TS1_i1b），並將第二資料塊存入暫存電路914。舉例來說，步驟S2330可以對應到圖22的時間點T1與時間點T2之間（即，「SUB1b載入」操作）。換言之，步驟S2320與步驟S2330至少部分同時執行。Step S2330: The memory management circuit 916 uses the second channel (e.g., channel 916b) to read the second data block (e.g., data block TS1_i1b) of the first sub-tensor from the memory 918, and stores the second data block into the temporary storage circuit 914. For example, step S2330 may correspond to between time point T1 and time point T2 in FIG. 22 (i.e., "SUB1b load" operation). In other words, step S2320 and step S2330 are executed at least partially simultaneously.

步驟S2340：記憶體管理電路916使用第一通道（例如，通道916a）將該第一中間資料的該第一部分（例如，子張量TS1_o1的一部分）儲存至記憶體918。舉例來說，步驟S2340可以對應到圖22的時間點T2與時間點T3之間（即，「SUB1a儲存（store）」操作）。Step S2340: The memory management circuit 916 uses the first channel (e.g., channel 916a) to store the first portion of the first intermediate data (e.g., a portion of the subtensor TS1_o1) to the memory 918. For example, step S2340 may correspond to between time point T2 and time point T3 in FIG. 22 (i.e., "SUB1a store" operation).

步驟S2350：運算電路912對該第二資料塊（例如，資料塊TS1_i1b）進行該第一種類的運算（例如，減法運算）以產生該第一中間資料的一第二部分（例如，子張量TS1_o1的一部分）。舉例來說，步驟S2350可以對應到圖22的時間點T2與時間點T3之間（即，「SUB1b計算」操作）。換言之，步驟S2340與步驟S2350至少部分同時執行。Step S2350: The operation circuit 912 performs the first type of operation (e.g., subtraction operation) on the second data block (e.g., data block TS1_i1b) to generate a second portion of the first intermediate data (e.g., a portion of the subtensor TS1_o1). For example, step S2350 may correspond to between time point T2 and time point T3 in FIG. 22 (i.e., "SUB1b calculation" operation). In other words, step S2340 and step S2350 are at least partially executed simultaneously.

步驟S2360：記憶體管理電路916使用該第二通道（例如，通道916b）將該第一中間資料的該第二部分（例如，子張量TS1_o1的一部分）儲存至記憶體918。舉例來說，步驟S2360可以對應到圖22的時間點T3與時間點T4之間（即，「SUB1b儲存」操作）。Step S2360: The memory management circuit 916 uses the second channel (e.g., channel 916b) to store the second portion of the first intermediate data (e.g., a portion of the subtensor TS1_o1) to the memory 918. For example, step S2360 may correspond to between time point T3 and time point T4 in FIG. 22 (i.e., “SUB1b store” operation).

本技術領域具有通常知識者可以根據圖23的說明了解圖22的時間點T4以後的其他操作，故不再贅述。A person having ordinary knowledge in this technical field can understand other operations after time point T4 in Figure 22 based on the description of Figure 23, so they will not be elaborated here.

雖然本發明之實施例如上所述，然而該些實施例並非用來限定本發明，本技術領域具有通常知識者可根據本發明之明示或隱含之內容對本發明之技術特徵施以變化，凡此種種變化均可能屬於本發明所尋求之專利保護範疇，換言之，本發明之專利保護範圍須視本說明書之申請專利範圍所界定者為準。Although the embodiments of the present invention are described above, these embodiments are not intended to limit the present invention. A person having ordinary knowledge in the technical field may modify the technical features of the present invention according to the explicit or implicit contents of the present invention. All such modifications may fall within the scope of patent protection sought by the present invention. In other words, the scope of patent protection of the present invention shall be subject to the scope of the patent application defined in this specification.

100,1400:AI網路 Din:輸入資料 Dout:輸出資料 110,SUB:減法運算子 120,1420,CONV:卷積運算子 130,1410,ADD:加法運算子 TS1,TS2,TS3,TS4:張量 110_1,110_2,110_3,SUB1,SUB2,SUB3:減法子運算子 120_1,120_2,120_3,CONV1,CONV2,CONV3,CONV56:卷積子運算子 130_1,130_2,130_3,ADD1,ADD2,ADD3,ADD56:加法子運算子 TS1_i1,TS1_i2,TS1_i3,TS3_o1,TS3_o2,TS3_o3,TS3_i1,TS3_i2,TS3_i3,TS2_o1,TS2_o2,TS2_o3,TS1_o1,TS1_o2,TS1_o3,TS2_i1,TS2_i2,TS2_i3:子張量 800:隊列 900:電子裝置 901:晶片 902:外部記憶體 910:處理電路 920:處理器 912:運算電路 914:暫存電路 916:記憶體管理電路 918:記憶體 Kp:核心參數 Kp_s:減法核心參數 Kp_c:卷積核心參數 Kp_a:加法核心參數 Kp_s1,Kp_s2,Kp_s3,Kp_a1,Kp_a2,Kp_a3,Kp_c1,Kp_c2,Kp_c3:子參數 916a,916b:通道 110_1a,110_1b,110_2a,110_2b,120_1a,120_1b,SUB1a,SUB1b,SUB2a,SUB2b,CONV1a,CONV1b:運算塊 TS1_i1a,TS1_i1b,TS1_i2a,TS1_i2b,TS2_i1a,TS2_i1b:資料塊 T0,T1,T2,T3,T4:時間點 S210,S220,S230,S240,S250,S510,S520,S530,S710,S720,S730,S740,S750,S760,S770,S780,S790,S795,S1110,S1120,S1130,S1140,S1150,S1160,S1170,S1180,S1505,S1510,S1520,S1530,S1540,S1550,S1560,S1570,S1580,S1710,S1720,S1810,S1820,S2010,S2020,S2310,S2320,S2330,S2340,S2350,S2360:步驟 100,1400:AI network Din:input data Dout:output data 110,SUB:subtraction operator 120,1420,CONV:convolution operator 130,1410,ADD:addition operator TS1,TS2,TS3,TS4:tensor 110_1,110_2,110_3,SUB1,SUB2,SUB3:subtraction operator 120_1,120_2,120_3,CONV1,CONV2,CONV3,CONV56:convolution operator 130_1,130_2,130_3,ADD1,ADD2,ADD3,ADD56:addition operator TS1_i1,TS1_i2,TS1_i3,TS3_o1,TS3_o2,TS3_o3,TS3_i1,TS3_i2,TS3_i3,TS2_o1,TS2_o2,TS2_o3,TS1_o1,TS1_o2,TS1_o3,TS2_i1,TS2_i2,TS2_i3: subtensor 800: queue 900: electronic device 901: chip 902: external memory 910: processing circuit 920: processor 912: operation circuit 914: cache circuit 916: memory management circuit 918: memory Kp: core parameter Kp_s: subtraction core parameter Kp_c: Convolution core parameter Kp_a: Addition core parameter Kp_s1, Kp_s2, Kp_s3, Kp_a1, Kp_a2, Kp_a3, Kp_c1, Kp_c2, Kp_c3: Sub parameter 916a, 916b: Channel 110_1a, 110_1b, 110_2a, 110_2b, 120_1a, 120_1b, SUB1a, SUB1b, SUB2a, SUB2b, CONV1a, CONV1b: Operation block TS1_i1a, TS1_i1b, TS1_i2a, TS1_i2b, TS2_i1a, TS2_i1b: Data block T0, T1, T2, T3, T4: Time point S210,S220,S230,S240,S250,S510,S520,S530,S710,S720,S730,S740,S750,S 760,S770,S780,S790,S795,S1110,S1120,S1130,S1140,S1150,S1160,S1170,S 1180,S1505,S1510,S1520,S1530,S1540,S1550,S1560,S1570,S1580,S1710,S1720,S1810,S1820,S2010,S2020,S2310,S2320,S2330,S2340,S2350,S2360: Steps

圖1顯示AI網路的一個例子；圖2為本發明人工智慧模型的運算排程方法之一實施例的流程圖；圖3是圖1之張量及運算子經切分後的結果；圖4顯示子運算子之間的連線的拓樸圖（topological graph）；圖5是圖2之步驟S230之一實施例的詳細流程；圖6顯示多個子運算子之間的依賴關係；圖7是圖2之步驟S240之一實施例的詳細流程；圖8A及圖8B顯示本發明的隊列之一實施例的示意圖；圖9是本發明電子裝置之一實施例的功能方塊圖；圖10顯示本發明的生命週期列表之一實施例的示意圖；圖11顯示分配記憶體的流程圖；圖12A及圖12B為本發明的活躍列表之一實施例的示意圖；圖13是另一個AI模型的示意圖；圖14是另一個AI模型的示意圖；圖15A及圖15B是本發明人工智慧模型的執行方法之一實施例的流程圖；圖16是本發明暫存電路及記憶體的儲存內容之一實施例的示意圖；圖17是圖15A之步驟S1510或步驟S1520的細部流程；圖18是圖15A之步驟S1530的細部流程；圖19是本發明暫存電路及記憶體的儲存內容之另一實施例的示意圖；圖20是圖15B之步驟S1560的細部流程；圖21是本發明記憶體管理電路之一實施例的功能方塊圖；圖22是本發明多級流水線的示意圖；以及圖23是本發明多級流水線操作之一實施例的流程圖。 FIG1 shows an example of an AI network; FIG2 is a flow chart of an embodiment of the operation scheduling method of the artificial intelligence model of the present invention; FIG3 is the result of the tensor and operator of FIG1 after segmentation; FIG4 shows a topological graph of the connection between sub-operators; FIG5 is a detailed flow chart of an embodiment of step S230 of FIG2; FIG6 shows the dependency relationship between multiple sub-operators; FIG7 is a detailed flow chart of an embodiment of step S240 of FIG2; FIG8A and FIG8B show schematic diagrams of an embodiment of the queue of the present invention; FIG9 is a functional block diagram of an embodiment of the electronic device of the present invention; FIG10 shows a schematic diagram of an embodiment of the life cycle list of the present invention; FIG11 shows a flowchart of memory allocation; FIG12A and FIG12B are schematic diagrams of an embodiment of the active list of the present invention; FIG13 is a schematic diagram of another AI model; FIG14 is a schematic diagram of another AI model; FIG15A and FIG15B are flowcharts of an embodiment of the execution method of the artificial intelligence model of the present invention; FIG16 is a schematic diagram of an embodiment of the temporary storage circuit and the storage content of the memory of the present invention; FIG17 is a detailed flow of step S1510 or step S1520 of FIG15A; FIG18 is a detailed flow of step S1530 of FIG15A; FIG19 is a schematic diagram of another embodiment of the temporary storage circuit and the storage content of the memory of the present invention; FIG. 20 is a detailed flow chart of step S1560 of FIG. 15B; FIG. 21 is a functional block diagram of an embodiment of the memory management circuit of the present invention; FIG. 22 is a schematic diagram of a multi-stage pipeline of the present invention; and FIG. 23 is a flow chart of an embodiment of the multi-stage pipeline operation of the present invention.

900:電子裝置 900: Electronic devices

901:晶片 901: Chip

902:外部記憶體 902: External memory

910:處理電路 910: Processing circuit

912:運算電路 912: Operational circuit

914:暫存電路 914: Temporary circuit

916:記憶體管理電路 916: Memory management circuit

918:記憶體 918:Memory

920:處理器 920: Processor

Kp:核心參數 Kp: core parameters

Din:輸入資料 Din: Input data

Dout:輸出資料 Dout: output data

Claims

A processing circuit of an artificial intelligence model is coupled to an external memory, comprising: a memory; a memory management circuit coupled to the memory, used to read a tensor from the external memory and store the tensor in the memory; and an operation circuit coupled to the memory management circuit, configured to perform the following operations: performing a first type of operation on a first sub-tensor of the tensor to generate a first intermediate data; performing the first type of operation on a second sub-tensor of the tensor to generate a second intermediate data; performing a second type of operation on the first intermediate data and the second intermediate data to generate a third intermediate data; performing the first type of operation on a third sub-tensor of the tensor , to generate a fourth intermediate data; and perform the second type of operation on the first intermediate data, the second intermediate data and the fourth intermediate data to generate a fifth intermediate data; wherein the memory management circuit determines an operation sequence of the first intermediate data and the second intermediate data based on the dependency relationship between the first type of operation and the second type of operation, stores the first intermediate data and the second intermediate data that will participate in the second type of operation into the memory, and based on the operation sequence of the first intermediate data, knows that the first intermediate data will no longer participate in the operation after the second type of operation is performed to generate the fifth intermediate data, and deletes the first intermediate data from the memory.

A processing circuit as claimed in claim 1, wherein the first type of operation is one of an addition operation and a subtraction operation, and the second type of operation is a product operation.

A processing circuit as in claim 1, wherein the first intermediate data, the second intermediate data, and the fourth intermediate data correspond to the same dimension of the tensor.

As in the processing circuit of claim 3, wherein the fourth intermediate data is generated after the third intermediate data is generated.

The processing circuit of claim 1 further comprises: a temporary storage circuit; wherein, when the operation circuit performs the first type of operation, the memory management circuit reads at least a portion of a core parameter from the memory to the temporary storage circuit, and the operation circuit only refers to the at least a portion of the core parameter to perform the first type of operation; wherein the first type of operation is one of a subtraction operation and an addition operation.

A processing circuit as in claim 1, wherein the first intermediate data, the second intermediate data, and the fourth intermediate data are of the same size.

The processing circuit of claim 1, wherein the memory management circuit includes a first channel and a second channel, and the first type of operation on the first sub-tensor of the tensor includes the following steps: (A) using the first channel to read a first data block of the first sub-tensor from the memory; (B) performing the first type of operation on the first data block to generate a first part of the first intermediate data; (C) using the second channel to read a first data block of the first sub-tensor from the memory (D) storing the first portion of the first intermediate data to the memory using the first channel; (E) performing the first type of operation on the second data block to generate a second portion of the first intermediate data; and (F) storing the second portion of the first intermediate data to the memory using the second channel; wherein the step (B) and the step (C) are at least partially performed simultaneously, and the step (D) and the step (E) are at least partially performed simultaneously.

A processing circuit of an artificial intelligence model is coupled to an external memory and includes a memory, wherein the processing circuit performs the following operations: reading a tensor and a plurality of core parameters from the external memory and storing the tensor and the core parameters in the memory, wherein the tensor includes a first sub-tensor and a second sub-tensor, and the core parameters include a vector core parameter; performing a first vector operation on the first sub-tensor with reference to a first part of the vector core parameter to generate a first intermediate data; and performing a first vector operation on the first sub-tensor with reference to the vector core parameter. A second vector operation is performed on the second sub-tensor with reference to a second part of the vector core parameter to generate a second intermediate data; wherein the first part of the vector core parameter is not equal to the second part of the vector core parameter; wherein the tensor further includes a third sub-tensor, and the core parameters further include a convolution core parameter for a convolution operation, and the processing circuit further performs the following operations: The convolution operation is performed on the first intermediate data and the second intermediate data with reference to the convolution core parameter to generate a third intermediate data; and after the convolution operation , performing a third vector operation on the third sub-tensor with reference to a third part of the vector core parameter to generate a fourth intermediate data; wherein the convolution operation is a first convolution operation, and the processing circuit further performs the following operations: performing a second convolution operation on the first intermediate data, the second intermediate data, and the fourth intermediate data with reference to the convolution core parameter to generate a fifth intermediate data; wherein the first intermediate data is stored in the memory, and the processing circuit further performs the following operations: after the second convolution operation, performing a second convolution operation on the first intermediate data, the second intermediate data, and the fourth intermediate data to generate a fifth intermediate data; wherein the first intermediate data is stored in the memory, and the processing circuit further performs the following operations: after the second convolution operation, performing a second convolution operation on the first intermediate data, the second intermediate data, and the fourth intermediate data The first intermediate data and the second intermediate data to be involved in a second type of operation are deleted from the memory; wherein, the first intermediate data and the second intermediate data are stored in the memory in an operation order determined based on a dependency relationship between the first type of operation and the second type of operation, and the first intermediate data is deleted from the memory after it is known based on the operation order of the first intermediate data that the first intermediate data will no longer participate in the operation after the fifth intermediate data is generated by performing the second type of operation.

A processing circuit as claimed in claim 8, wherein the first sub-tensor and the second sub-tensor correspond to the same dimension of the tensor.

A processing circuit as claimed in claim 8, wherein the first vector operation and the second vector operation are one of an addition operation and a subtraction operation.

The processing circuit of claim 8, wherein the processing circuit further comprises a memory management circuit, the memory management circuit comprises a first channel and a second channel, and the first vector operation on the first sub-tensor to generate the first intermediate data comprises the following steps: (A) using the first channel to read a first data block of the first sub-tensor from the memory; (B) performing an operation on the first data block to generate a first part of the first intermediate data; (C) using the second channel to read the first data block from the memory; (B) reading a second data block of the first subtensor from the first channel; (D) using the first channel to store the first portion of the first intermediate data to the memory; (E) performing the operation on the second data block to generate a second portion of the first intermediate data; and (F) using the second channel to store the second portion of the first intermediate data to the memory; wherein the step (B) and the step (C) are at least partially executed simultaneously, and the step (D) and the step (E) are at least partially executed simultaneously.

A method for scheduling operations of an artificial intelligence model, the artificial intelligence model comprising a first operator and a second operator, the output of the first operator being the input of the second operator, the method comprising: dividing a tensor into H sub-tensors, H being an integer greater than 1; dividing the first operator into H first sub-operators; dividing the second operator into H second sub-operators; determining the H first sub-operators according to the input of each of the H second sub-operators A dependency relationship between the H first sub-operators and the H second sub-operators; sorting the H first sub-operators and the H second sub-operators according to the dependency relationship to obtain an operation order; and determining when a processing circuit executing the artificial intelligence model deletes a target data from a memory included in the processing circuit according to the operation order, the target data being an output data of one of the H first sub-operators and the H second sub-operators.

As in claim 12, the step of determining the dependency relationship between the H first sub-operators and the H second sub-operators includes: determining a target sub-operator; determining a source sub-operator of the target sub-operator according to the input of the target sub-operator, wherein the output of the source sub-operator is the input of the target sub-operator; and determining that the target sub-operator is dependent on the source sub-operator.

The operation scheduling method of claim 12, wherein the step of sorting the H first sub-operators and the H second sub-operators according to the dependency relationship to obtain the operation order includes: (A) determining a target sub-operator; (B) determining a source sub-operator on which the target sub-operator depends; (C) when the source sub-operator does not depend on any sub-operator, adding the source sub-operator to a queue; (D) repeating the steps (B) to (C) until all the source sub-operators on which the target sub-operator depends have been added to the queue; and (E) adding the target sub-operator to the queue.

The operation scheduling method of claim 14, wherein the step of sorting the H first sub-operators and the H second sub-operators according to the dependency relationship to obtain the operation order further includes: (F) determining an upper-level sub-operator that depends on the target sub-operator; (G) using the upper-level sub-operator as the target sub-operator, repeating the steps (B) to (E); and (H) repeating the steps (F) and (G) until the target sub-operator is a top-level sub-operator.

As in claim 14, the step of determining when to delete the target data from the memory according to the operation order includes: determining a sub-operator that uses the target data last according to the queue, the sub-operator being one of the H first sub-operators and the H second sub-operators; wherein the target data is deleted after the sub-operator completes the operation.