[go: up one dir, main page]

TWI775170B - Method for cpu to execute artificial intelligent related processes - Google Patents

Method for cpu to execute artificial intelligent related processes Download PDF

Info

Publication number
TWI775170B
TWI775170B TW109134194A TW109134194A TWI775170B TW I775170 B TWI775170 B TW I775170B TW 109134194 A TW109134194 A TW 109134194A TW 109134194 A TW109134194 A TW 109134194A TW I775170 B TWI775170 B TW I775170B
Authority
TW
Taiwan
Prior art keywords
artificial intelligence
cpu
weighted
program
processing unit
Prior art date
Application number
TW109134194A
Other languages
Chinese (zh)
Other versions
TW202215305A (en
Inventor
曾建維
江錦陵
陳柏旭
Original Assignee
新漢股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新漢股份有限公司 filed Critical 新漢股份有限公司
Priority to TW109134194A priority Critical patent/TWI775170B/en
Publication of TW202215305A publication Critical patent/TW202215305A/en
Application granted granted Critical
Publication of TWI775170B publication Critical patent/TWI775170B/en

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Saccharide Compounds (AREA)

Abstract

A method for CPU to execute artificial intelligent related processes is disclosed and includes following steps of: when running TensorFlow on an electronic device, calls a corresponding AI module of TensorFlow according to the content of program code; determining and obtaining one or more sparse matrix used by the AI module while the AI module runs calculations; performing a matrix simplified procedure to the one or more sparse matrix; performing an instruction transforming procedure to an instruction set applied by the AI module; applying the transformed instruction set by the AI module to issue instructions to a weighted CPU of the electronic device; and, the weighted CPU, after receiving the instructions, equally distributing multiple processes indicated by the AI module accompanying with the instructions to be respectively run on multiple threads of the weighted CPU according to weight values of the multiple processes.

Description

CPU應用於人工智慧相關程序時的執行方法 Execution method when CPU is applied to artificial intelligence related programs

本發明涉及中央處理單元,尤其涉及中央處理單元在執行與人工智慧相關程序時的執行方法。 The present invention relates to a central processing unit, in particular to an execution method of the central processing unit when executing a program related to artificial intelligence.

近年來人工智慧的技術蓬勃發展,各大企業紛紛投入人工智慧的相關產品的研發動作。 In recent years, artificial intelligence technology has developed vigorously, and major companies have invested in the research and development of artificial intelligence-related products.

一般來說,軟體工程師經常使用谷歌公司(Google)所開發的TensorFlow來開發人工智慧的相關程序。具體地,TensorFlow是一種應用於機器學習的開源軟體庫,並且TensorFlow提供了多種人工智慧相關模型,有助於軟體工程師直接用來進行人工智慧的開發。 Generally speaking, software engineers often use TensorFlow developed by Google (Google) to develop artificial intelligence related programs. Specifically, TensorFlow is an open source software library applied to machine learning, and TensorFlow provides a variety of artificial intelligence related models, which are helpful for software engineers to directly use for artificial intelligence development.

然而,於現有技術中一般並不會對所述人工智慧相關模型用來進行運算的矩陣進行優化,因此這些模型在進行運算時所使用的計算參數較多。也因為這樣,目前若要執行人工智慧的相關運算,通常需要使用較高階的硬體才能完成。例如,需要使用高階的中央處理單元(Central Processing Unit,CPU),或是需要使用特定的圖像處理單元(Graphics Processing Unit,GPU)。然而,上述硬體的使用實大幅提高了人工智慧的開發、使用成本,並且也造成了硬體散熱不易的問題。 However, in the prior art, the matrix used by the AI-related models to perform operations is generally not optimized, so these models use many computing parameters when performing operations. Because of this, to perform artificial intelligence-related operations, it usually requires the use of higher-end hardware. For example, a high-level central processing unit (Central Processing Unit, CPU) needs to be used, or a specific image processing unit (Graphics Processing Unit, GPU) needs to be used. However, the use of the above-mentioned hardware greatly increases the development and use costs of artificial intelligence, and also causes the problem that the heat dissipation of the hardware is not easy.

另外,有鑑於目前CPU、GPU的強大計算能力,CPU在執行上並不會也不需要對各項程序進行最佳化排序。具體地,CPU通常僅會單純地按照順序來執行所接收的多個指令,因此常會發生CPU的一個執行緒(Thread)無法算完一個程序,而需要交給下一個執行緒繼續計算的現象。當於上述現象發生時,CPU即增加了執行SWAP的時間成本,導致整體速度變慢,而需要更換更高階的CPU,或是必須額外設置GPU。於此情況下,無疑是阻擋了利用低階的硬體,例如X86架構的工業電腦(Inidustry Personal Computer,IPC)來開發、使用人工智慧的可能性,實相當可惜。 In addition, in view of the powerful computing capabilities of the current CPU and GPU, the CPU does not and does not need to optimize the ordering of various programs in execution. Specifically, the CPU usually only executes the received multiple instructions in sequence, so it often happens that one thread (Thread) of the CPU cannot complete a program and needs to be handed over to the next thread to continue the calculation. When the above phenomenon occurs, the CPU increases the time cost of executing SWAP, resulting in a slower overall speed, and a higher-end CPU needs to be replaced, or an additional GPU must be set up. In this case, the possibility of developing and using artificial intelligence by using low-end hardware, such as the Industrial Personal Computer (IPC) with the X86 architecture, is undoubtedly blocked, which is quite a pity.

本發明的主要目的,在於提供一種CPU應用於人工智慧相關程序時的執行方法,係對人工智慧模型在運算時採用的矩陣進行簡化,同時最佳化CPU的執行時間,藉此令低階的硬體設備也可以被用於開發、使用人工智慧。 The main purpose of the present invention is to provide an execution method when the CPU is applied to artificial intelligence related programs, which simplifies the matrix used by the artificial intelligence model in the operation, and optimizes the execution time of the CPU, thereby enabling low-level execution time. Hardware devices can also be used to develop and use artificial intelligence.

為了達成上述的目的,本發明的執行方法主要包括下列步驟:於一電子設備上執行TensorFlow時,依據程式內容呼叫對應的一或多個人工智慧模型;判斷並取出該些人工智慧模型於運算中使用的一或多個稀疏矩陣;對該些稀疏矩陣進行一矩陣簡化程序;對該些人工智慧模型所採用的一指令集進行指令轉換;該些人工智慧模型通過轉換後的該指令集對一加權值中央處理單元進行指令發佈;及,該加權值中央處理單元接收指令後,依據權重值平均分配由多個執行緒來分別執行人工智慧模型指示的多個程序。 In order to achieve the above-mentioned purpose, the execution method of the present invention mainly includes the following steps: when executing TensorFlow on an electronic device, calling one or more corresponding artificial intelligence models according to the content of the program; judging and extracting these artificial intelligence models during operation One or more sparse matrices used; perform a matrix simplification procedure on these sparse matrices; perform instruction conversion on an instruction set adopted by these artificial intelligence models; these artificial intelligence models pass the converted instruction set to an The weighted central processing unit issues the instruction; and after receiving the instruction, the weighted central processing unit distributes the multiple threads evenly according to the weighted value to execute the multiple programs indicated by the artificial intelligence model.

相較於現有技術,本發明先通過矩陣簡化程序來壓縮人工智慧模型於運算中使用的稀疏矩陣,以令運算中的計算參數產生縮減。並且,本發明先 藉由轉換後的指令對人工智慧模型所指示的多個程序進行權重值的設定後,再下指令給加權值中央處理單元,藉此加權值中央處理單元可以將這些程序平均分配給多個執行緒來執行,以最佳化CPU執行時間。藉此,達到令低階的硬體設備也可以被用來開發、使用人工智慧的主要目的。 Compared with the prior art, the present invention first compresses the sparse matrix used by the artificial intelligence model in the calculation through the matrix simplification procedure, so as to reduce the calculation parameters in the calculation. Furthermore, the present invention first After setting the weight values of the multiple programs indicated by the artificial intelligence model through the converted instructions, and then issuing instructions to the weighted value central processing unit, so that the weighted value central processing unit can evenly distribute these programs to multiple executions. Threads are executed to optimize CPU execution time. In this way, the main purpose of enabling low-end hardware devices to be used to develop and use artificial intelligence is achieved.

1:TensorFlow 1: TensorFlow

2:預測程序 2: Prediction procedure

3:矩陣簡化程序 3: Matrix simplification procedure

31:稀疏矩陣 31: Sparse Matrix

32:合併矩陣 32: Merge Matrix

4:加權值中央處理單元 4: Weighted value central processing unit

41:加權CPU時間 41: Weighted CPU time

42:加權程序時間 42: Weighted Program Time

43:執行頻率 43: Execution frequency

44:CPU時間 44: CPU time

45:週期時間 45: Cycle time

46:週期計數 46: Cycle count

47:停頓週期 47: Pause Cycle

5:執行緒 5: Thread

S10~S24:執行步驟 S10~S24: Execution steps

圖1為本發明的第一具體實施例的系統架構圖。 FIG. 1 is a system architecture diagram of a first specific embodiment of the present invention.

圖2為本發明的第一具體實施例的執行方法流程圖。 FIG. 2 is a flow chart of the execution method of the first specific embodiment of the present invention.

圖3為本發明的第一具體實施例的矩陣簡化示意圖。 FIG. 3 is a simplified schematic diagram of a matrix according to the first specific embodiment of the present invention.

圖4為本發明的第一具體實施例的加權值中央處理單元的示意圖。 FIG. 4 is a schematic diagram of a weighted value central processing unit according to the first specific embodiment of the present invention.

茲就本發明之一較佳實施例,配合圖式,詳細說明如後。 Hereinafter, a preferred embodiment of the present invention will be described in detail in conjunction with the drawings.

本發明揭露了一種CPU應用於人工智慧相關程序時的執行方法(下面將於說明書中簡稱為執行方法),所述執行方法主要可被應用在使用加權值中央處理單元(Weighted Central Processing Unit,Weighted CPU)的電子設備上。更具體地,本發明中所指的電子設備,主要是指採用了較低階的CPU,並且沒有將圖像處理單元(Graphics Processing Unit,GPU)視為必要配備的電子設備,例如個人電腦(Personal Computer,PC)、工業電腦(Industrial PC,IPC)、工業伺服器等,但並不以此為限。通過本發明的執行方法,可以協助使用者藉由相對低階的硬體設備來開發、使用人工智慧。 The present invention discloses an execution method when a CPU is applied to an artificial intelligence related program (hereinafter referred to as an execution method in the specification), and the execution method can be mainly applied to a central processing unit (Weighted Central Processing Unit, Weighted Central Processing Unit) using a weighted value. CPU) electronic equipment. More specifically, the electronic equipment referred to in the present invention mainly refers to the electronic equipment that adopts a lower-level CPU and does not regard a graphics processing unit (Graphics Processing Unit, GPU) as a necessary equipment, such as a personal computer ( Personal Computer, PC), industrial computer (Industrial PC, IPC), industrial server, etc., but not limited to this. Through the execution method of the present invention, the user can be assisted in developing and using artificial intelligence with relatively low-end hardware devices.

首請參閱圖1,為本發明的第一具體實施例的系統架構圖。如上所述,本發明的執行方法主要用於協助使用者使用低階的硬體設備來開發、使用一般需要耗費大量運算資源的人工智慧,並且本技術領域中的技術人員皆知,於人工智慧的開發中,經常會使用谷歌公司(Google)所開發的TensorFlow來進行各項運算。 First, please refer to FIG. 1 , which is a system architecture diagram of a first specific embodiment of the present invention. As mentioned above, the implementation method of the present invention is mainly used to assist users to use low-end hardware devices to develop and use artificial intelligence that generally requires a lot of computing resources. In the development of TensorFlow, TensorFlow developed by Google is often used to perform various operations.

TensorFlow 1是一種應用於機器學習的開源軟體庫,並且TensorFlow提供了多種主要被用來開發人工智慧相關程序的模型(下面於說明書中稱為人工智慧模型)。軟體工程師在撰寫程式時,可以直接引用TensorFlow提供的多個人工智慧模型,以直接、快速地對資料進行分析與運算,進而實現如機器學習(Machine Learning)、預測(Inference)等人工智慧相關的程序。 TensorFlow 1 is an open source software library applied to machine learning, and TensorFlow provides a variety of models that are mainly used to develop artificial intelligence related programs (hereinafter referred to as artificial intelligence models in the specification). When writing programs, software engineers can directly refer to multiple artificial intelligence models provided by TensorFlow to analyze and calculate data directly and quickly, and then realize artificial intelligence-related functions such as machine learning and prediction. program.

如圖1所示,使用者於電子設備上撰寫程序時,可以執行上述TensorFlow 1,藉此呼叫TensorFlow 1的一或多個人工智慧模型,並藉由這些人工智慧模型來實現人工智慧中的預測程序(Inference Process)2。於一實施例中,使用者主要可以於程式碼中通過例如import TensorFlow的指令來令電子設備執行TensorFlow,以便呼叫TensorFlow中的一或多個人工智慧模型。於另一實施例中,所述人工智慧模型可例如為imageNet模型、inception模型等與影像處理相關的模型,但不以此為限。 As shown in Figure 1, when a user writes a program on an electronic device, he can execute the above-mentioned TensorFlow 1, thereby calling one or more artificial intelligence models of TensorFlow 1, and using these artificial intelligence models to realize predictions in artificial intelligence Procedure (Inference Process) 2. In one embodiment, the user can mainly execute TensorFlow on the electronic device through an instruction such as import TensorFlow in the code, so as to call one or more artificial intelligence models in TensorFlow. In another embodiment, the artificial intelligence model may be, for example, an imageNet model, an inception model and other models related to image processing, but not limited thereto.

本技術領域中的技術人員皆知,目前絕大多數的人工智慧模型(例如上述的imageNet模型、inception模型)都是利用矩陣(matrix)來進行資料運算,並且藉由指令集的使用,指示CPU執行(run)運算相關的程序。有鑑於目前CPU的規格不斷提昇,並且部分高階硬體設備中還配置有專門用來處理影像的GPU,因此目前在進行人工智慧的運算時,一般並不會對所述人工智慧模型進 行優化,並且也不會特別對CPU所要執行的程序進行分配與排序。因此,人工智慧的運算較難在低階的硬體設備(例如採用低階的CPU,並且不具有GPU的工業電腦)上被實現。 It is well known to those skilled in the art that the vast majority of artificial intelligence models (such as the above-mentioned imageNet model and inception model) use a matrix to perform data operations, and use an instruction set to instruct the CPU Execute (run) the program related to the operation. In view of the continuous improvement of the current CPU specifications, and some high-end hardware devices are also equipped with GPUs specially used to process images, the artificial intelligence models are generally not processed when performing artificial intelligence operations. Line optimization, and does not specifically allocate and order the program to be executed by the CPU. Therefore, it is difficult to implement artificial intelligence operations on low-end hardware devices (eg, industrial computers that use low-end CPUs and do not have GPUs).

本發明的其中一項主要技術特徵在於,當一支應用程式執行了TensorFlow 1,並且呼叫上述人工智慧模型進行運算以期實現預測程序2時,先對這些人工智慧模型在運算中所要使用的矩陣進行矩陣簡化程序3,以縮減這些人工智慧模型於運算中的計算參數的數量。 One of the main technical features of the present invention is that when an application program executes TensorFlow 1 and calls the above-mentioned artificial intelligence models to perform operations in order to realize the prediction program 2, firstly, the matrices to be used by these artificial intelligence models in the operation are performed. Matrix simplification procedure 3 to reduce the number of computational parameters in the computation of these artificial intelligence models.

除了縮減運算中的計算參數的數量之外,本發明還可預先查詢這些人工智慧模型所需採用的指令集,並且對指令集中的指令進行指令轉換程序。藉此,本發明中的加權值中央處理單元4在收到人工智慧模型所發佈的指令後,可以依據所述運算所包含的多個程序的權重值(w)來對這些程序進行排序,以將這些程序平均分配給其下的多個執行緒(thread)5來分別執行,進而實現所述應用程式所要實現的預測程序2。如此一來,本發明可以更好地反映人工智慧所需的各個程序對於CPU的平均資源利用率,進而可以最佳化CPU的執行時間。 In addition to reducing the number of calculation parameters in the operation, the present invention can also query the instruction sets required by these artificial intelligence models in advance, and perform an instruction conversion program on the instructions in the instruction set. Thereby, after receiving the instruction issued by the artificial intelligence model, the weighted value central processing unit 4 of the present invention can sort the programs according to the weight value (w) of the multiple programs included in the operation, so as to These programs are evenly distributed to multiple threads 5 under them to be executed respectively, thereby realizing the prediction program 2 to be realized by the application program. In this way, the present invention can better reflect the average resource utilization of each program required by the artificial intelligence for the CPU, thereby optimizing the execution time of the CPU.

本發明的執行方法結合了矩陣簡化程序3以及加權值中央處理單元4的執行緒分配技術,大幅降低了人工智慧運算所需佔用的資源,使得一般低階的硬體設備(例如X86架構的工業電腦)也可以執行人工智慧下的訓練、辨識任務。藉此,使用者可以有效地節省開發與使用人工智慧時的成本,並且可提高電子設備在實現所述預測程序2時的運算速度。 The execution method of the present invention combines the matrix simplification program 3 and the thread allocation technology of the weighted central processing unit 4, which greatly reduces the resources occupied by artificial intelligence operations, and makes general low-end hardware devices (such as X86 architecture industrial computer) can also perform training and recognition tasks under artificial intelligence. In this way, the user can effectively save the cost of developing and using artificial intelligence, and can improve the computing speed of the electronic device when implementing the prediction program 2 .

請同時參閱圖2,為本發明的第一具體實施例的執行方法流程圖。本發明的執行方法主要是應用在使用加權值中央處理單元4的電子設備 上,於一實施例中,本發明的執行方法主要是運用在低階的電子設備(例如採用X86架構的工業電腦),並且所述電子設備採用Intel架構的加權值中央處理單元4,但不加以限定。 Please also refer to FIG. 2 , which is a flowchart of the execution method of the first specific embodiment of the present invention. The execution method of the present invention is mainly applied to the electronic equipment using the weighted value central processing unit 4 Above, in one embodiment, the execution method of the present invention is mainly used in low-end electronic equipment (eg, an industrial computer using the X86 architecture), and the electronic equipment adopts the weighted central processing unit 4 of the Intel architecture, but not be limited.

如圖2所示,所述電子設備在執行程式碼時,係先依據程式碼內容執行TensorFlow 1(步驟S10),並且依據程式碼內容呼叫TensorFlow中的至少一個人工智慧模型(步驟S12)。於一實施例中,所述程式碼係通過例如import TensorFlow的指令來執行TensorFlow 1,並且呼叫例如imageNet模型、inception模型等主要用來執行影像處理的人工智慧模型,但不加以限定。 As shown in FIG. 2 , when the electronic device executes the code, it first executes TensorFlow 1 according to the content of the code (step S10 ), and calls at least one artificial intelligence model in TensorFlow according to the content of the code (step S12 ). In one embodiment, the code executes TensorFlow 1 through an instruction such as import TensorFlow, and calls artificial intelligence models such as imageNet model, inception model, etc., which are mainly used for image processing, but not limited thereto.

所述imageNet模型、inception模型係為人工智慧相關程序經常使用的模型,為本技術領域中的技術人員的常用手段,於此不再贅述。 The imageNet model and the inception model are models that are often used in artificial intelligence related programs, and are commonly used by those skilled in the art, and will not be repeated here.

本發明的執行方法主要係以驅動程式(driver)或中介軟體(middleware)的形式存在於TensorFlow與所述加權值中央處理單元4之間,用以縮減TensorFlow的人工智慧模型於運算時的計算參數的數量,並且令加權值中央處理單元4可以通過多個執行緒5來平均處理這個運算內容,藉此最佳化這個加權值中央處理單元4的執行時間。 The execution method of the present invention mainly exists between TensorFlow and the weighted value central processing unit 4 in the form of a driver or middleware, so as to reduce the calculation parameters of the artificial intelligence model of TensorFlow during operation. , and the weighted central processing unit 4 can process the operation content on average through a plurality of threads 5 , thereby optimizing the execution time of the weighted central processing unit 4 .

如圖2所述,本發明的執行方法進一步判斷並取得所述人工智慧模型於運算中使用的一或多個稀疏矩陣(Sparse Matrix)(步驟S14),並且對一或多個稀疏矩陣執行矩陣簡化程序3(步驟S16),藉此縮減所述一或多個人工智慧模型在運算中的計算參數的數量。 As shown in FIG. 2 , the execution method of the present invention further determines and obtains one or more sparse matrices (Sparse Matrix) used in the operation of the artificial intelligence model (step S14 ), and executes the matrix operation on the one or more sparse matrices Simplify the procedure 3 (step S16 ), thereby reducing the number of calculation parameters of the one or more artificial intelligence models in the operation.

請同時參閱圖3,為本發明的第一具體實施例的矩陣簡化示意圖。如圖3所示,稀疏矩陣31指的是內部的元素大部分為零的矩陣,一般在求 解線性模型時經常會使用到大型的稀疏矩陣31,也因此,與人工智慧相關的模型經常會利用稀疏矩陣31來進行運算。 Please also refer to FIG. 3 , which is a simplified schematic diagram of a matrix according to the first specific embodiment of the present invention. As shown in Figure 3, the sparse matrix 31 refers to a matrix whose internal elements are mostly zero. Large sparse matrices 31 are often used when solving linear models, and therefore, models related to artificial intelligence often use sparse matrices 31 for operations.

為了縮減計算參數的數量,所述矩陣簡化程序3主要僅儲存稀疏矩陣31中非零的元素,以壓縮稀疏矩陣31。並且,所述矩陣簡化程序3進一步將壓縮後的稀疏矩陣31分解成一或多個合併矩陣(Consolidate Matrix)32,以完成簡化矩陣的動作。本發明中,被呼叫的人工智慧模型在進行運算時,主要會利用如圖3中所示的合併矩陣32來進行運算。 In order to reduce the number of calculation parameters, the matrix reduction program 3 mainly stores only non-zero elements in the sparse matrix 31 to compress the sparse matrix 31 . Moreover, the matrix simplification program 3 further decomposes the compressed sparse matrix 31 into one or more consolidated matrices (Consolidate Matrix) 32 to complete the action of simplifying the matrix. In the present invention, when the called artificial intelligence model performs operations, it mainly uses the merge matrix 32 as shown in FIG. 3 to perform operations.

上述稀疏矩陣31的壓縮,以及合併矩陣32的產生,都是本技術領域中的公知技術,於此不再贅述。並且,上述將稀疏矩陣31分解為合併矩陣32的方式,僅為本發明的矩陣簡化程序3的其中一種具體實施範例,本技術領域中亦存在其他可以對矩陣進行簡化的類似程序,故於此不再贅述。 The compression of the above-mentioned sparse matrix 31 and the generation of the merge matrix 32 are well-known techniques in the technical field, and details are not described herein again. In addition, the above-mentioned method of decomposing the sparse matrix 31 into the merged matrix 32 is only one specific implementation example of the matrix simplification program 3 of the present invention. No longer.

回到圖2。於步驟S16後,本發明的執行方法進一步對所述人工智慧模型所採用的指令集進行查詢與分析,並且對指令集中的指令進行指令轉換程序,以產生轉換後指令集(步驟S18)。於一實施例中,所述加權值中央處理單元4為Intel架構的中央處理單元,而所述指令集則為Intel架構的指令集,但並不以此為限。 Back to Figure 2. After step S16, the execution method of the present invention further queries and analyzes the instruction set used by the artificial intelligence model, and performs an instruction conversion procedure on the instructions in the instruction set to generate a converted instruction set (step S18). In one embodiment, the weighted central processing unit 4 is an Intel architecture central processing unit, and the instruction set is an Intel architecture instruction set, but not limited thereto.

如前文所述,現有技術在進行人工智慧的運算時並不會對CPU要執行的程序進行有效的排序與分配,意即,CPU僅會單純地按照指令的接收順序來依序執行多個指令,而不會對其下的多個執行緒進行最佳化的分配,以進行平衡負載(loading)。 As mentioned above, the prior art does not effectively sort and allocate the programs to be executed by the CPU when performing artificial intelligence operations, that is, the CPU simply executes multiple instructions in sequence according to the order in which the instructions are received , without optimal allocation of multiple threads under it for load balancing.

本發明中,所述指令轉換程序係對人工智慧模型所採用的指令集進行修改,使得人工智慧模型在將指令發佈至加權值中央處理單元4之前,可 以先判斷、設定所需執行的多個程序的權重值,進而能夠對這些程序進行排序與分配。如此一來,當加權值中央處理單元4接收了人工智慧模型所發佈的指令後,即可依據權重值來將這些程序平均分配給其下的各個執行緒5,藉此可以有效地利用加權值中央處理單元4的有限資源,進而提高這些程序的執行速度。 In the present invention, the instruction conversion program modifies the instruction set used by the artificial intelligence model, so that the artificial intelligence model can By first judging and setting the weight values of multiple programs to be executed, these programs can be sorted and assigned. In this way, when the weighted value central processing unit 4 receives the instructions issued by the artificial intelligence model, it can evenly distribute these programs to each of the threads 5 under it according to the weighted value, so that the weighted value can be effectively utilized. The limited resources of the central processing unit 4, thereby increasing the execution speed of these programs.

步驟S18後,所述人工智慧模型即可通過轉換後指令集來對加權值中央處理單元4進行指令發佈(步驟S20),而加權值中央處理單元4在接收指令後,即可依據人工智慧模型所指示的多個程序的權重值,來將多個程序平均分配給其下的多個執行緒5來分別執行(步驟S22)。其中,所述多個程序中即包含了對矩陣簡化程序3後的一或多個合併矩陣32的運算,但並不以此為限。 After step S18, the artificial intelligence model can issue instructions to the weighted value central processing unit 4 through the converted instruction set (step S20), and after receiving the instructions, the weighted value central processing unit 4 can use the artificial intelligence model. According to the weight value of the multiple programs indicated, the multiple programs are evenly distributed to multiple threads 5 under them to be executed respectively (step S22). Wherein, the multiple programs include operations on one or more merged matrices 32 after the matrix simplification program 3, but it is not limited thereto.

值得一提的是,所述人工智慧模型在進行運算時,係可執行呼叫資源庫(call library)的動作,進而可以執行(run)CPU的指令集。本發明係對指令集(較佳為Intel架構的指令集)中的指令進行修改,使得人工智慧模型在下達指令給加權值中央處理單元4之前,可以先判斷、設定所需執行的多個程序的權重值,進而可以將這些程序平均分配給加權值中央處理單元4下的多個執行緒5。 It is worth mentioning that the artificial intelligence model can perform an action of calling a resource library (call library) during operation, and then can execute (run) the instruction set of the CPU. The present invention modifies the instructions in the instruction set (preferably the instruction set of Intel architecture), so that the artificial intelligence model can first determine and set multiple programs to be executed before issuing instructions to the weighted central processing unit 4 The weight value of the weight value, and then these programs can be evenly distributed to a plurality of threads 5 under the weight value central processing unit 4 .

所述呼叫資源庫的動作為本技術領域中的常用技術手段,於此不再贅述。 The action of calling the resource library is a common technical means in the technical field, and details are not described herein again.

參閱圖4,為本發明的第一具體實施例的加權值中央處理單元的示意圖。如圖4所示,一個加權值中央處理單元4在運作時,係基於本身的執行週期而包括複數加權CPU時間41,圖4中係以第一加權CPU時間至第n加權CPU時間為例。其中,加權值中央處理單元4在每一個加權CPU時間41下 皆可控制其下的多個執行緒5(圖4中以第一執行緒至第n執行緒為例)同時運作,並且所述加權CPU時間41為所有執行緒5的各自的加權程序時間42的總和。具體地,所述加權程序時間42指的是單一個執行緒5執行一個程序的時間。 Referring to FIG. 4 , it is a schematic diagram of a weighted value central processing unit according to the first specific embodiment of the present invention. As shown in FIG. 4 , a weighted central processing unit 4 includes complex weighted CPU time 41 based on its execution cycle during operation. In FIG. 4 , the first weighted CPU time to the nth weighted CPU time are used as examples. Among them, the weighted value of the central processing unit 4 under each weighted CPU time 41 can control multiple threads 5 under it (the first thread to the nth thread are taken as an example in FIG. 4 ) to operate at the same time, and the weighted CPU time 41 is the respective weighted program time 42 of all threads 5 Sum. Specifically, the weighted program time 42 refers to the time for a single thread 5 to execute a program.

更具體地,所述加權程序時間42為此執行緒5執行被分配的程序的執行頻率(Program Frequency)與所佔據的CPU時間44的乘積,而所述CPU時間44則為此執行緒5的週期時間(Cycle time)45與此執行緒5的週期計數(Cycle count)46和停頓週期(Idle Cycles)的和之乘積。 More specifically, the weighted program time 42 is the product of the execution frequency (Program Frequency) of executing the assigned program for this thread 5 and the CPU time 44 occupied by the thread 5, and the CPU time 44 is the time of the thread 5. The product of Cycle time 45 and the sum of Cycle count 46 and Idle Cycles of this thread 5.

由上述說明可看出,加權值中央處理單元4在每一個加權CPU時間41中都可以藉由多個執行緒5來同步執行多個程序,因此,若各個執行緒5在每一個加權程序時間42中所能執行的程序都被有效分配,則加權值中央處理單元4整體的資源利用率就可以被大幅的提升。 It can be seen from the above description that the weighted central processing unit 4 can use multiple threads 5 to execute multiple programs synchronously in each weighted CPU time 41. Therefore, if each thread 5 is used in each weighted program time All the programs that can be executed in 42 are effectively allocated, and the overall resource utilization of the weighted central processing unit 4 can be greatly improved.

回到圖2。於步驟S22後,本發明的執行方法(即,所述驅動程式或中介軟體)判斷程式碼是否執行完成(步驟S24),並且於程式碼尚未執行完成前,重覆執行步驟S12至步驟S22,以令電子設備可以依據本發明的執行方法,通過加權值中央處理單元4來實現人工智慧的相關程序(例如所述預測程序2)。 Back to Figure 2. After step S22, the execution method of the present invention (ie, the driver or the intermediary software) determines whether the execution of the code is completed (step S24), and before the execution of the code is completed, steps S12 to S22 are repeatedly executed, So that the electronic device can implement the artificial intelligence related program (for example, the prediction program 2 ) through the weighted value central processing unit 4 according to the execution method of the present invention.

本發明的執行方法結合了矩陣簡化技術以及加權值中央處理單元的執行緒分配技術,先縮減運算中的計算參數的數量,再對加權值中央處理單元下的多個執行緒進行最佳化的程序管理,藉此達到令低階的硬體設備也可以被用來開發、使用人工智慧的主要目的。相對於現有技術,本發明的執行方法有效降低了人工智慧的開發、使用成本與門檻限制。 The execution method of the present invention combines the matrix simplification technology and the thread allocation technology of the weighted central processing unit, first reducing the number of calculation parameters in the operation, and then optimizing the multiple threads under the weighted central processing unit. Program management, thereby achieving the main purpose of enabling low-end hardware devices to be used to develop and use artificial intelligence. Compared with the prior art, the execution method of the present invention effectively reduces the development and use costs and threshold restrictions of artificial intelligence.

以上所述僅為本發明之較佳具體實例,非因此即侷限本發明之專利範圍,故舉凡運用本發明內容所為之等效變化,均同理皆包含於本發明之範圍內,合予陳明。 The above description is only a preferred specific example of the present invention, and therefore does not limit the scope of the patent of the present invention. Therefore, all equivalent changes made by using the content of the present invention are all included in the scope of the present invention. Bright.

1:TensorFlow 1: TensorFlow

2:預測程序 2: Prediction procedure

3:矩陣簡化程序 3: Matrix simplification procedure

4:加權值中央處理單元 4: Weighted value central processing unit

5:執行緒 5: Thread

Claims (5)

一種CPU應用於人工智慧相關程序時的執行方法,運用於至少具有一加權值中央處理單元(Weighted Central Processing Unit,Weighted CPU)的一電子設備,並且包括下列步驟:a)於該電子設備上執行TensorFlow;b)依據程式碼內容呼叫TensorFlow中的至少一人工智慧模型;c)判斷並取出該人工智慧模型於運算中使用的一或多個稀疏矩陣(Sparse Matrix);d)對該一或多個稀疏矩陣執行一矩陣簡化程序;e)對該人工智慧模型所採用的一指令集進行一指令轉換程序,以產生一轉換後指令集;f)該人工智慧模型通過該轉換後指令集對該加權值中央處理單元進行指令發佈;及g)該加權值中央處理單元接收指令後,依據該人工智慧模型所指示的多個程序的權重值,將該多個程序平均分配由其下的多個執行緒(thread)來分別執行,其中該加權值中央處理單元的一加權CPU時間為該多個執行緒的一加權程序時間之和,各該執行緒的該加權程序時間分別是執行對應的該程序的一CPU時間與一執行頻率(Program Frequency)的乘積,該CPU時間為該執行緒的一週期時間(Cycle time)與一週期計數(Cycle count)及一停頓週期(Idle cycles)的和之乘積。 An execution method when a CPU is applied to an artificial intelligence-related program, is applied to an electronic device having at least one weighted central processing unit (Weighted Central Processing Unit, Weighted CPU), and includes the following steps: a) executing on the electronic device TensorFlow; b) calling at least one artificial intelligence model in TensorFlow according to the content of the code; c) judging and extracting one or more sparse matrices (Sparse Matrix) used by the artificial intelligence model in the operation; d) for the one or more sparse matrices (Sparse Matrix); A sparse matrix executes a matrix simplification program; e) an instruction conversion program is performed on an instruction set adopted by the artificial intelligence model to generate a converted instruction set; f) the artificial intelligence model passes the converted instruction set to the The weighted value central processing unit issues instructions; and g) after the weighted value central processing unit receives the instruction, according to the weight values of the plurality of programs indicated by the artificial intelligence model, the plurality of programs are evenly distributed among the plurality of programs under it. Threads are executed separately, wherein a weighted CPU time of the weighted central processing unit is the sum of a weighted program time of the multiple threads, and the weighted program time of each thread is to execute the corresponding The product of a CPU time of a program and an execution frequency (Program Frequency), the CPU time is the sum of a cycle time (Cycle time), a cycle count (Cycle count) and an idle cycle (Idle cycles) of the thread product. 如請求項1所述的CPU應用於人工智慧相關程序時的執行方法,其中該步驟a)是於程式碼中通過import TensorFlow的指令來執行TensorFlow。 The execution method when the CPU is applied to an artificial intelligence-related program according to claim 1, wherein the step a) is to execute TensorFlow in the program code by importing the TensorFlow instruction. 如請求項1所述的CPU應用於人工智慧相關程序時的執行方法,其中該人工智慧模型至少包括一imageNet模型及一inception模型。 The execution method when the CPU is applied to an artificial intelligence related program according to claim 1, wherein the artificial intelligence model at least includes an imageNet model and an inception model. 如請求項1所述的CPU應用於人工智慧相關程序時的執行方法,其中該矩陣簡化程序包括將該一或多個稀疏矩陣分解成一或多個合併矩陣(Consolidate Matrix),其中該多個程序至少包括對該一或多個合併矩陣的運算。 The execution method when a CPU is applied to an artificial intelligence-related program according to claim 1, wherein the matrix reduction program comprises decomposing the one or more sparse matrices into one or more consolidated matrices (Consolidate Matrix), wherein the plurality of programs At least operations on the one or more merged matrices are included. 如請求項1所述的CPU應用於人工智慧相關程序時的執行方法,其中該加權值中央處理單元為Intel架構的中央處理單元,該人工智慧模型採用的該指令集為Intel架構的指令集。 The execution method when the CPU of claim 1 is applied to an artificial intelligence related program, wherein the weighted central processing unit is an Intel architecture central processing unit, and the instruction set used by the artificial intelligence model is an Intel architecture instruction set.
TW109134194A 2020-09-30 2020-09-30 Method for cpu to execute artificial intelligent related processes TWI775170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW109134194A TWI775170B (en) 2020-09-30 2020-09-30 Method for cpu to execute artificial intelligent related processes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW109134194A TWI775170B (en) 2020-09-30 2020-09-30 Method for cpu to execute artificial intelligent related processes

Publications (2)

Publication Number Publication Date
TW202215305A TW202215305A (en) 2022-04-16
TWI775170B true TWI775170B (en) 2022-08-21

Family

ID=82197238

Family Applications (1)

Application Number Title Priority Date Filing Date
TW109134194A TWI775170B (en) 2020-09-30 2020-09-30 Method for cpu to execute artificial intelligent related processes

Country Status (1)

Country Link
TW (1) TWI775170B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI579694B (en) * 2015-10-08 2017-04-21 上海兆芯集成電路有限公司 Neural network unit that performs concurrent lstm cell calculations
CN107871160A (en) * 2016-09-26 2018-04-03 谷歌公司 Communication Efficient Joint Learning
WO2019013711A1 (en) * 2017-07-12 2019-01-17 Mastercard Asia/Pacific Pte. Ltd. Mobile device platform for automated visual retail product recognition
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI579694B (en) * 2015-10-08 2017-04-21 上海兆芯集成電路有限公司 Neural network unit that performs concurrent lstm cell calculations
CN107871160A (en) * 2016-09-26 2018-04-03 谷歌公司 Communication Efficient Joint Learning
WO2019013711A1 (en) * 2017-07-12 2019-01-17 Mastercard Asia/Pacific Pte. Ltd. Mobile device platform for automated visual retail product recognition
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study

Also Published As

Publication number Publication date
TW202215305A (en) 2022-04-16

Similar Documents

Publication Publication Date Title
CN112465129B (en) On-chip heterogeneous artificial intelligent processor
Yu et al. Gillis: Serving large neural networks in serverless functions with automatic model partitioning
CN120066806B (en) Task processing method of artificial intelligent processor, storage medium and electronic equipment
Cheong et al. SCARL: Attentive reinforcement learning-based scheduling in a multi-resource heterogeneous cluster
CN112162854A (en) A computing task scheduling method, system and medium between CPU and GPU
Zhou et al. Training and serving system of foundation models: A comprehensive survey
CN114217688A (en) A system and method for NPU power consumption optimization based on neural network structure
CN118446321A (en) Method, device and system for fast inference of large language model for smart phones
CN111401560A (en) Processing method, device and storage medium for reasoning task
CN118796471A (en) Reasoning resource optimization method, device and electronic equipment
CN118536565A (en) AI algorithm acceleration method, device, equipment and readable storage medium
Fowers et al. Inside Project Brainwave's Cloud-Scale, Real-Time AI Processor
CN117130760B (en) Intelligent core particle selection scheduling method and system
CN115470901B (en) Mixed-precision training method and device supporting mobile heterogeneous processor load sharing
TWI775170B (en) Method for cpu to execute artificial intelligent related processes
CN120011093B (en) Accelerator-oriented multitasking method and related device
CN111124691A (en) Multi-process sharing GPU scheduling method, system and electronic device
CN119597414A (en) Operator scheduling method, device, equipment, storage medium and program product
US11847504B2 (en) Method for CPU to execute artificial intelligence related processes
Bi et al. AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems
US20240062104A1 (en) Inference Performance Using Divide-and-Conquer Techniques
CN114443260A (en) Execution method of CPU applied to artificial intelligence related program
CN117056066A (en) Heterogeneous intensive computing optimization method and system based on dynamic pipeline technology
CN120066739B (en) A multi-process task scheduling method and device on a real-time operating system
US20250371288A1 (en) Techniques for implementing multiple large language models on a single physical computing device

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent