US20230351145A1 - Pipelining and parallelizing graph execution method for neural network model computation and apparatus thereof - Google Patents
Pipelining and parallelizing graph execution method for neural network model computation and apparatus thereof Download PDFInfo
- Publication number
- US20230351145A1 US20230351145A1 US17/838,342 US202217838342A US2023351145A1 US 20230351145 A1 US20230351145 A1 US 20230351145A1 US 202217838342 A US202217838342 A US 202217838342A US 2023351145 A1 US2023351145 A1 US 2023351145A1
- Authority
- US
- United States
- Prior art keywords
- executive
- memory block
- batch
- subdata
- kernel function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
- G06F12/0253—Garbage collection, i.e. reclamation of unreferenced memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5022—Mechanisms to release resources
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5017—Task decomposition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates to the technical field of deep learning, in particular to a pipelining and parallelizing graph execution method for neural network model computation and apparatus.
- a pipelining and parallelizing graph execution method for neural network model computation provided by the present disclosure, various batches of training data and different subgraphs are isolated, and each batch of training data flows through a forward computation graph and a backward computation graph sequentially in a 1F1B forward-backward manner.
- there is one batch of data being processed on each device process to keep all device processes busy without pipeline pause, and the entire pipeline is relatively balanced.
- the present disclosure aims to provide a pipelining and parallelizing graph execution method for neural network model computation and apparatus, so as to overcome the shortcomings in the prior art.
- the present application discloses a pipelining and parallelizing graph execution method for neural network model computation.
- Several executives are set in a neural network model; a total of 2*N executives are provided, and N is a positive integer; and several memory blocks are set in the executive.
- the method specifically includes the following steps:
- an executive may check whether there is an idle memory block in the executive, execute the self-kernel function computation on the ith batch of subdata if there is an idle memory block, and otherwise, instruct the ith batch to wait for an idle memory block.
- the executive may check whether the executive where an (N*n ⁇ 1)th batch of subdata is located completes execution, wherein n is a positive integer.
- the step S 5 specifically includes the following operations:
- the method further includes constructing an executive, and the constructing an executive specifically includes the following substeps:
- creating an operator kernel function task queue adding a current operator kernel function computation task into a current kernel function task queue in sequence;
- S 02 creating a thread of an executive: acquiring, by the thread of the executive, a current task to be processed in sequence from the kernel function task queue, and submitting the current task to be processed to a thread pool;
- creating an executive of a kernel function creating an executive used for operator kernel function computation according to a current kernel function task and context information of a current thread, and using the executive to run the kernel function task in the task queue;
- creating an event recall queue adding tasks that have been processed by a task executive into an event recall queue
- the present disclosure further discloses a neural network model computation-oriented graph execution apparatus, including an executive construction module and an executive pipelining and parallelizing working module; the executive construction module is configured to construct an executive; and the executive pipelining and parallelizing working module is configured to implement the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- the present disclosure further discloses a neural network model computation-oriented graph execution apparatus, including a memory and one or more processors.
- the memory stores an executable code.
- the one or more processors when executing the executable code, implement the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- the present disclosure further provides a computer-readable storage medium on which a program is stored.
- the program when executed by a processor, implements the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- the present disclosure has the beneficial effects.
- a graph executive on a native machine is created according to a physical computation graph compiled and generated by a deep learning framework.
- a solution for allocating a plurality of idle memory blocks to each graph executive an entire computation graph simultaneously participates in deep learning training tasks of different batches of data in a pipelining and parallelizing manner.
- the parallel execution method of graph executives based on a plurality of free tensor storage blocks disclosed in the present disclosure can achieve distributed training of large models more easily than the existing method.
- the present disclosure In a distributed application scenario of a large-scale deep neural network, the present disclosure has a low threshold for users, and enables a model to learn the intrinsic correlation of a large number of data flowing into a neural network in batches, so as to obtain the “intelligent” sensation and judgment ability in a corresponding scenario.
- the present disclosure provides a set of simple and easy-to-use neural network model operation apparatus for algorithm engineers related to deep learning, so that the deep learning model can be conveniently trained.
- FIG. 1 is an architecture diagram of a pipelining and parallelizing graph execution method for neural network model computation
- FIG. 2 is a flowchart of creating and managing a task executive thread module
- FIG. 3 is a basic action of a pipelining and parallelizing working module of a task executive
- FIG. 4 is a pipelining and parallelizing execution process of executives
- FIG. 5 is a structural schematic diagram of a neural network model computation-oriented pipelining and parallelizing graph execution apparatus.
- FIG. 1 an architecture diagram of a pipelining and parallelizing graph execution method for neural network model computation is illustrated.
- training data is fed into a neural network model in batches;
- a graph executive on the native machine is created according to a physical computation graph compiled and generated by a deep learning framework;
- a plurality of idle memory blocks are allocated for each graph executive, so that the entire computation graph simultaneously participates in a deep learning training task in a pipelining and parallelizing manner.
- Specific operations are as follows:
- an executive may check whether there is an idle memory block in the executive, execute the self-kernel function computation on the ith batch of subdata if there is an idle memory block, and otherwise, instruct the ith batch to wait for an idle memory block.
- the executive may check whether the executive where an (N*n ⁇ 1)th batch of subdata is located completes execution, wherein n is a positive integer.
- step S 5 specifically includes the following operations:
- the method further includes constructing an executive, and the constructing an executive specifically includes the following substeps:
- creating an operator kernel function task queue adding a current operator kernel function computation task into a current kernel function task queue in sequence;
- S 02 creating a thread of an executive: acquiring, by the thread of the executive, a current task to be processed in sequence from the kernel function task queue, and submitting the current task to be processed to a thread pool;
- creating an executive of a kernel function creating an executive used for operator kernel function computation according to a current kernel function task and context information of a current thread, and using the executive to run the kernel function task in the task queue;
- creating an event recall queue adding tasks that have been processed by a task executive into an event recall queue
- a neural network model computation-oriented graph execution apparatus includes an executive construction module and an executive pipelining and parallelizing working module.
- the executive construction module includes the following basic actions:
- the executive pipelining and parallelizing work module includes the following basic actions: an executive inputs data; a current executive sends a message to a downstream executive; the downstream executive prepares tensor data to be consumed; the current executive sends a message to an upstream executive; the upstream executive reclaims tensor data that has been consumed; and the last executive reclaims computation data on its own.
- An executive inputs data At time t, for an ith batch of data, the executive inputs the ith batch of data, loads an operator kernel function computation task inside, executes kernel function computation, generates output tensor data of the kernel function computation task, and writes an execution result into an idle memory block.
- a current executive sends a message to a downstream executive: At time t, for the ith batch of data, tensor data generated by the current executive is stored to an empty storage unit, and an address of the storage unit and an identity identification number of the downstream executive corresponding to the current executive are packaged into a message; the message is then sent to a target executive; and the target executive is the downstream executive corresponding to the current executive.
- the downstream executive prepares tensor data to be consumed: At time t, for the ith batch of data, the downstream executive receives the message, and parses the tensor data generated by the current executive from the message, and the tensor data will be used as an input tensor when the downstream executive operates its operator kernel function; the downstream executive checks whether there is an available free memory block among memory blocks produced by the downstream executive; if it is found that there is an available free memory block, the downstream executive executes a corresponding operator kernel function computation task and reads the free memory block; and the downstream executive writes an output tensor result generated by execution into the memory block.
- the current executive sends a message to an upstream executive: At time t, for the ith batch of data, the executive sends a message to an upstream producer executive to inform the upstream producer executive that the executive has consumed the memory block of the upstream producer executive; and the upstream executive may reclaim its storage unit for output tensor data.
- the upstream executive reclaims data that has been consumed: At time t, for the ith batch of data, once the upstream executive receives a reclaiming message sent by the downstream executive, the upstream executive starts to check whether the memory blocks have been consumed by all the consumer executives, reclaims the memory blocks if the memory blocks have been consumed by all the consumer executives, and marks the memory blocks as free blocks.
- the last executive reclaims computation data on its own: At time t, for the ith batch of data, the last executive executes a corresponding operator kernel function computation task, and writes the task into its own free memory block; and the executive A immediately reclaims the memory block after completing the execution.
- the embodiment of the neural network model computation-oriented graph execution apparatus of the present disclosure can be applied to any device with data processing capability.
- Any device with data processing capability may be a device or apparatus such as a computer.
- the apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware.
- Implementation by software is taken as an example, an apparatus in a logical sense is formed by reading corresponding computer program instructions in a nonvolatile memory into an internal memory through a processor of any device with the data processing capability where it is located.
- FIG. 5 a hardware structure diagram of any device with the data processing capability where the neural network model computation-oriented graph execution apparatus of the present disclosure is located is illustrated.
- any device with the data processing capability where the apparatus in the embodiment is located may also include other hardware usually according to the actual functions of any device with the data processing capability, and repeated descriptions are omitted here.
- the implementation processes of the corresponding steps in the above method are referred to, and repeated descriptions are omitted here.
- the apparatus embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts.
- the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
- An embodiment of the present disclosure further provides a computer-readable storage medium on which a program is stored.
- the program when executed by a processor, implements the neural network model computation-oriented graph execution apparatus in the above embodiment.
- the computer-readable storage medium may be an internal storage unit of any device with the data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory.
- the computer-readable storage medium may also be an external storage device of any device with the data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, and a flash card.
- the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with the data processing capability.
- the computer-readable storage medium is used for storing the computer program and other programs and data required by any device with the data processing capability, and can also be used for temporarily storing data that has been output or will be output.
- a constructed physical computation graph is composed of forward operator x ⁇ forward operator y ⁇ forward operator z and backward operator Z ⁇ backward operator Y ⁇ backward operator X; executives for running own kernel functions are respectively created according to all the operators to correspondingly form execution computation graphs of executive a ⁇ executive b ⁇ executive c ⁇ executive C ⁇ executive B ⁇ executive A; and execution of the executives are initiated to run an entire computation graph in parallel.
- a first batch of data is input, and executive a inputs the data: executive a runs a kernel function of forward operator x and writes an output tensor of a running result into free memory block r 11 .
- executive a For a second batch of data, executive a inputs the data: executive a may also check whether there is a writable free block in executive a; if any, at time T 2 , executive a also executes the second batch of data and writes an execution result into free memory block r 12 .
- executive a For a third batch of data, executive a inputs the data: executive a may also check whether there is a writable free block in executive a; if any, executive a also executes the third batch of data and writes an execution result into free memory block r 13 .
- current executive b sends a message to downstream executive c
- downstream executive c prepares tensor data to be consumed
- current executive b sends a message to upstream executive a
- upstream executive a reclaims the tensor data that has been consumed:
- executive b produces memory block r 21 , and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r 21 produced by executive b;
- executive c receives memory block r 21 and finds that there is free memory block r 31 in executive c, whereupon executive c starts execution to read memory block r 21 and writes a result into memory block r 31 .
- executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r 1 of executive a; executive a receives memory block r 11 that is returned by executive b after use and checks whether all consumers have finished using memory block r 11 , and then reclaims memory block r 11 and marks memory block r 11 as a free block.
- executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- current executive c sends a message to downstream executive C
- downstream executive C prepares tensor data to be consumed
- current executive c sends a message to upstream executive b
- upstream executive b reclaims the tensor data that has been consumed:
- executive c produces memory block r 31 , and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r 31 produced by executive c;
- executive C receives memory block r 31 and finds that there is free memory block r 11 in executive C, whereupon executive C starts execution to read memory block r 31 and writes a result into memory block r 11 .
- executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r 21 of executive b; executive b receives memory block r 21 that is returned by executive c after use and checks whether all consumers have finished using memory block r 21 , and then reclaims memory block r 21 and marks memory block r 21 as a free block.
- current executive b sends a message to downstream executive c
- downstream executive c prepares tensor data to be consumed
- current executive b sends a message to upstream executive a
- upstream executive a reclaims the tensor data that has been consumed:
- executive b produces memory block r 22 , and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r 22 produced by executive b;
- executive c receives memory block r 22 and finds that there is free memory block r 32 in executive c, whereupon executive c starts execution to read memory block r 22 and writes a result into memory block r 32 .
- executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r 12 of executive a; executive a receives memory block r 12 that is returned by executive b after use and checks whether all consumers have finished using memory block r 12 , and then reclaims memory block r 12 and marks memory block r 12 as a free block.
- executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive a and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r 11 , and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r 11 produced by executive c; executive B receives memory block r 11 and finds that there is free memory block r 21 in executive B, whereupon executive B starts execution to read memory block r 11 and writes a result into memory block r 21 .
- executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r 31 of executive c; executive c receives memory block r 31 that is returned by executive C after use and checks whether all consumers have finished using memory block r 31 , and then reclaims memory block r 31 and marks memory block r 31 as a free block.
- current executive c sends a message to downstream executive C
- downstream executive C prepares tensor data to be consumed
- current executive c sends a message to upstream executive b
- upstream executive b reclaims the tensor data that has been consumed:
- executive c produces memory block r 32 , and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r 32 produced by executive c;
- executive C receives memory block r 32 and finds that there is free memory block r 12 in executive C, whereupon executive C starts execution to read 30 memory block r 32 and writes a result into memory block r 12 .
- executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r 22 of executive b; executive b receives memory block r 22 that is returned by executive c after use and checks whether all consumers have finished using memory block r 22 , and then reclaims memory block r 22 and marks memory block r 22 as a free block.
- current executive b sends a message to downstream executive c
- downstream executive c prepares tensor data to be consumed
- current executive b sends a message to upstream executive a
- upstream executive a reclaims the tensor data that has been consumed:
- executive b produces memory block r 23 , and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r 23 produced by executive b;
- executive c receives memory block r 23 and finds that there is free memory block r 33 in executive c, whereupon executive c starts execution to read memory block r 23 and writes a result into memory block r 33 .
- executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r 13 of executive a; executive a receives memory block r 13 that is returned by executive b after use and checks whether all consumers have finished using memory block r 13 , and then reclaims memory block r 13 and marks memory block r 13 as a free block.
- executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive a and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- current executive B sends a message to downstream executive A, and downstream executive A prepares tensor data to be consumed, then downstream executive A reclaims computation data on its own, current executive B sends a message to upstream executive C, and upstream executive C reclaims the tensor data that has been consumed:
- executive B produces memory block r 21 , and whereupon sends a message to downstream consumer executive A to inform executive A of reading memory block r 21 produced by executive B;
- executive A receives memory block r 21 and finds that there is free memory block r 31 in executive A, whereupon executive A starts execution to read memory block r 21 and writes a result into memory block r 31 ; and executive A immediately reclaims memory block r 31 on its own after completing the execution.
- executive B sends a message to upstream producer executive C to inform executive C that executive B has finished using memory block r 11 of executive C;
- executive C receives memory block r 11 that is returned by executive B after use and checks whether all consumers have finished using memory block r 11 , and then reclaims memory block r 11 and marks memory block r 11 as a free block.
- current executive C sends a message to downstream executive B, downstream executive B prepares tensor data to be consumed, current executive C sends a message to upstream executive c, and upstream executive c reclaims the tensor data that has been consumed:
- executive C produces memory block r 12 , and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r 12 produced by executive C;
- executive B receives memory block r 12 and finds that there is free memory block r 22 in executive B, whereupon executive B starts execution to read memory block r 12 and writes a result into memory block r 22 .
- executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r 32 of executive c; executive c receives memory block r 32 that returned by executive C after use and checks whether all consumers have finished using memory block r 32 , and then reclaims memory block r 32 and marks memory block r 32 as a free block.
- current executive c sends a message to downstream executive C
- downstream executive C prepares tensor data to be consumed
- current executive c sends a message to upstream executive b
- upstream executive b reclaims the tensor data that has been consumed:
- executive c produces memory block r 33 , and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r 33 produced by executive c;
- executive C receives memory block r 33 and finds that there is free memory block r 13 in executive C, whereupon executive C starts execution to read memory block r 33 and writes a result into memory block r 13 .
- executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r 23 of executive b; executive b receives memory block r 23 that is returned by executive c after use and checks whether all consumers have finished using memory block r 23 , and then reclaims memory block r 23 and marks memory block r 23 as a free block.
- executive a For a fourth batch of data, executive a inputs the data: executive a may also simultaneously check whether there is a writable free block in executive a and whether executive A completes the execution; if any, executive a also executes the fourth batch of data and writes an execution result into free memory block r 11 .
- current executive B sends a message to downstream executive A, downstream executive A prepares tensor data to be consumed, current executive B sends a message to upstream executive C, and upstream executive C reclaims the tensor data that has been consumed:
- executive B produces memory block r 22 , and whereupon sends a message to downstream consumer executive A to inform executive A of reading memory block r 22 produced by executive B;
- executive A receives memory block r 22 and finds that there is free memory block r 32 in executive A, whereupon executive A starts execution to read memory block r 22 and writes a result into memory block r 32 ; and executive A immediately reclaims memory block r 32 on its own after completing the execution.
- executive B sends a message to upstream producer executive C to inform executive C that executive B has finished using memory block r 12 of executive C;
- executive C receives memory block r 12 that is returned by executive B after use and checks whether all consumers have finished using memory block r 12 , and then reclaims memory block r 12 and marks memory block r 12 as a free block.
- current executive C sends a message to downstream executive B, downstream executive B prepares tensor data to be consumed, current executive C sends a message to upstream executive c, and upstream executive c reclaims the tensor data that has been consumed: current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r 13 , and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r 13 produced by executive C; executive B receives memory block r 13 and finds that there is free memory block r 23 in executive B, whereupon executive B starts execution to read memory block r 13 and writes a result into memory block r 23 .
- executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r 33 of executive c;
- executive c receives memory block r 33 that is returned by executive C after use and checks whether all consumers have finished using memory block r 33 , and then reclaims memory block r 33 and marks memory block r 33 as a free block.
- executive a, executive b, executive c and executive C start to work in parallel.
- Executives B and A are still standby since there is no readable data.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Neurology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Abstract
Description
- The present disclosure claims the benefit of priority to Chinese patent application No. 202210447287.7, filed on Apr. 27, 2022 to China National Intellectual Property Administration and titled “Pipelining and parallelizing graph execution method for neural network model computation and Apparatus thereof”, which is incorporated herein by reference in its entirety.
- The present disclosure relates to the technical field of deep learning, in particular to a pipelining and parallelizing graph execution method for neural network model computation and apparatus.
- With the rapid development of industrialization application of artificial intelligence, the demand for large models in practical application scenarios becomes increasingly urgent, and structures of machine learning workloads tend to become more and more complex large models, resulting in extremely high execution cost of graphs used for large model computations. Most of the existing graph execution methods for neural network model computation are based on synchronization methods, resulting in low resource utilization rate of an entire graph execution system, which limits a speedup ratio and throughput rate of a distributed system.
- In order to solve the above problems, in a pipelining and parallelizing graph execution method for neural network model computation provided by the present disclosure, various batches of training data and different subgraphs are isolated, and each batch of training data flows through a forward computation graph and a backward computation graph sequentially in a 1F1B forward-backward manner. In the present disclosure, there is one batch of data being processed on each device process to keep all device processes busy without pipeline pause, and the entire pipeline is relatively balanced. At the same time, it can be ensured that parameter updates on each subgraph are performed at a fixed cycle, which also helps prevent too many mini-batches from being processed at the same time and ensures model convergence.
- The present disclosure aims to provide a pipelining and parallelizing graph execution method for neural network model computation and apparatus, so as to overcome the shortcomings in the prior art.
- In order to achieve the above purposes, the present disclosure provides the following technical solution:
- The present application discloses a pipelining and parallelizing graph execution method for neural network model computation. Several executives are set in a neural network model; a total of 2*N executives are provided, and N is a positive integer; and several memory blocks are set in the executive. The method specifically includes the following steps:
- S1, dividing training data into several batches of subdata;
- S2, inputting the several batches of subdata into a neural network model in sequence; after an ith batch of subdata is input, executing, by an nth executive, self-kernel function computation on the ith batch of subdata, and writing an execution result into an idle memory block of the nth executive; then inputting an (i+1)th batch of subdata, wherein i and n are both positive integers;
- S3, after the (i+1)th batch of subdata is input, executing, by the nth executive, the operation in S2 on the (i+1)th batch of subdata, and sending an address of the memory block where the ith batch is located to an (n+1)th executive; parsing, by the (n+1)th executive, the memory block where the ith batch is located to obtain an execution result of the nth executive on the ith batch of subdata, executing the self-kernel function computation by taking the execution result of the nth executive as input data of the (n+1)th executive, and writing the execution result into an idle memory block of the (n+1)th executive; then inputting an (i+2)th batch of subdata;
- S4, after the (i+2)th batch of subdata is input, executing, by the nth executive, the operation in S2 on the (i+2)th batch of subdata, and executing, by the nth executive and the (n+1)th executive, the operation in S3 on the (i+1)th batch of subdata; at the same time, sending, by the (n+1)th executive, the address of the memory block where the ith batch is located to an (n+2)th executive; parsing, by the (n+2)th executive, the memory block where the ith batch is located to obtain an execution result of the (n+1)th executive on the ith batch of subdata, executing the self-kernel function computation by taking the execution result of the (n+1)th executive as input data of the (n+2)th executive, and writing the execution result into an idle memory block of the (n+2)th executive;
- S5, reclaiming, by the nth executive, the memory block sent to the (n+1)th executive;
- S6, executing, by the last executive, the self-kernel function computation;
- writing the execution result to a memory block of the last executive; and immediately reclaiming the memory block on its own at the end of the execution.
- Preferably, before executing the self-kernel function computation, an executive may check whether there is an idle memory block in the executive, execute the self-kernel function computation on the ith batch of subdata if there is an idle memory block, and otherwise, instruct the ith batch to wait for an idle memory block.
- Preferably, for an [(N*n+1)]th batch of subdata, before executing the self-kernel function computation, the executive may check whether the executive where an (N*n−1)th batch of subdata is located completes execution, wherein n is a positive integer.
- Preferably, the step S5 specifically includes the following operations:
- S51, informing, by the (n+1)th executive, the nth executive that the memory block sent to the (n+1)th executive has been consumed;
- S52, reclaiming, by the nth executive, the memory block sent to the (n+1)th executive, and marking the memory block as being free.
- Preferably, the method further includes constructing an executive, and the constructing an executive specifically includes the following substeps:
- S01, creating an operator kernel function task queue: adding a current operator kernel function computation task into a current kernel function task queue in sequence;
- S02, creating a thread of an executive: acquiring, by the thread of the executive, a current task to be processed in sequence from the kernel function task queue, and submitting the current task to be processed to a thread pool;
- S03, creating an executive of a kernel function: creating an executive used for operator kernel function computation according to a current kernel function task and context information of a current thread, and using the executive to run the kernel function task in the task queue;
- S04, creating an event recall queue: adding tasks that have been processed by a task executive into an event recall queue;
- S05, creating a thread of the event recall queue: taking out and returning, by the thread of the event recall queue, the tasks that have been processed in the event recall queue.
- The present disclosure further discloses a neural network model computation-oriented graph execution apparatus, including an executive construction module and an executive pipelining and parallelizing working module; the executive construction module is configured to construct an executive; and the executive pipelining and parallelizing working module is configured to implement the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- The present disclosure further discloses a neural network model computation-oriented graph execution apparatus, including a memory and one or more processors. The memory stores an executable code. The one or more processors, when executing the executable code, implement the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- The present disclosure further provides a computer-readable storage medium on which a program is stored. The program, when executed by a processor, implements the above-mentioned pipelining and parallelizing graph execution method for neural network model computation.
- The present disclosure has the beneficial effects.
- According to the pipelining and parallelizing graph execution method for neural network model computation and apparatus, a graph executive on a native machine is created according to a physical computation graph compiled and generated by a deep learning framework. By designing a solution for allocating a plurality of idle memory blocks to each graph executive, an entire computation graph simultaneously participates in deep learning training tasks of different batches of data in a pipelining and parallelizing manner. The parallel execution method of graph executives based on a plurality of free tensor storage blocks disclosed in the present disclosure can achieve distributed training of large models more easily than the existing method. In a distributed application scenario of a large-scale deep neural network, the present disclosure has a low threshold for users, and enables a model to learn the intrinsic correlation of a large number of data flowing into a neural network in batches, so as to obtain the “intelligent” sensation and judgment ability in a corresponding scenario. The present disclosure provides a set of simple and easy-to-use neural network model operation apparatus for algorithm engineers related to deep learning, so that the deep learning model can be conveniently trained.
- The features and advantages of the present disclosure will be described in detail in combination with the embodiments and accompanying drawings.
-
FIG. 1 is an architecture diagram of a pipelining and parallelizing graph execution method for neural network model computation; -
FIG. 2 is a flowchart of creating and managing a task executive thread module; -
FIG. 3 is a basic action of a pipelining and parallelizing working module of a task executive; -
FIG. 4 is a pipelining and parallelizing execution process of executives; -
FIG. 5 is a structural schematic diagram of a neural network model computation-oriented pipelining and parallelizing graph execution apparatus. - In order to make the objectives, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described below in detail with reference to accompanying drawings and embodiments. It should be understood that the specific embodiments described here are merely to explain the present disclosure, and not intended to limit the scope of the present disclosure. In addition, in the following descriptions, the descriptions of known structures and known art are omitted to avoid unnecessary confusion of the concept of the present disclosure.
- As shown in
FIG. 1 , an architecture diagram of a pipelining and parallelizing graph execution method for neural network model computation is illustrated. As shown in the figure, training data is fed into a neural network model in batches; a graph executive on the native machine is created according to a physical computation graph compiled and generated by a deep learning framework; a plurality of idle memory blocks are allocated for each graph executive, so that the entire computation graph simultaneously participates in a deep learning training task in a pipelining and parallelizing manner. Specific operations are as follows: - S1, dividing training data into several batches of subdata;
- S2, inputting the several batches of subdata into the neural network model in sequence; after an ith batch of subdata is input, executing, by an nth executive, self- kernel function computation on the ith batch of subdata, and writing an execution result into an idle memory block of the nth executive; then inputting an (i+1)th batch of subdata, wherein i and n are both positive integers;
- S3, after the (i+1)th batch of subdata is input, executing, by the nth executive, the operation in S2 on the (i+1)th batch of subdata, and sending an address of the memory block where the ith batch is located to an (n+1)th executive; parsing, by the (n+1)th executive, the memory block where the ith batch is located to obtain an execution result of the nth executive on the ith batch of subdata, executing the self-kernel function computation by taking the execution result of the nth executive as input data of the (n+1)th executive, and writing the execution result into an idle memory block of the (n+1)th executive; then inputting an (i+2)th batch of subdata;
- S4, after the (i+2)th batch of subdata is input, executing, by the nth executive, the operation in S2 on the (i+2)th batch of subdata, and executing, by the nth executive and the (n+1)th executive, the operation in S3 on the (i+1)th batch of subdata; at the same time, sending, by the (n+1)th executive, the address of the memory block where the ith batch is located to an (n+2)th executive; parsing, by the (n+2)th executive, the memory block where the ith batch is located to obtain an execution result of the (n+1)th executive on the ith batch of subdata, executing the self-kernel function computation by taking the execution result of the (n+1)th executive as input data of the (n+2)th executive, and writing the execution result into an idle memory block of the (n+2)th executive;
- S5, reclaiming, by the nth executive, the memory block sent to the (n+1)th executive;
- S6, executing, by the last executive, the self-kernel function computation; writing the execution result to a memory block of the last executive; and immediately reclaiming the memory block on its own at the end of the execution.
- In one feasible embodiment, before executing the self-kernel function computation, an executive may check whether there is an idle memory block in the executive, execute the self-kernel function computation on the ith batch of subdata if there is an idle memory block, and otherwise, instruct the ith batch to wait for an idle memory block.
- In one feasible embodiment, for an (N*n+1)th batch of subdata, before executing the self-kernel function computation, the executive may check whether the executive where an (N*n−1)th batch of subdata is located completes execution, wherein n is a positive integer.
- In one feasible embodiment, the step S5 specifically includes the following operations:
- S51, informing, by the (n+1)th executive, the nth executive that the memory block sent to the (n+1)th executive has been consumed;
- S52, reclaiming, by the nth executive, the memory block sent to the (n+1)th executive, and marking the memory block as being free.
- In one feasible embodiment, the method further includes constructing an executive, and the constructing an executive specifically includes the following substeps:
- S01, creating an operator kernel function task queue: adding a current operator kernel function computation task into a current kernel function task queue in sequence;
- S02, creating a thread of an executive: acquiring, by the thread of the executive, a current task to be processed in sequence from the kernel function task queue, and submitting the current task to be processed to a thread pool;
- S03, creating an executive of a kernel function: creating an executive used for operator kernel function computation according to a current kernel function task and context information of a current thread, and using the executive to run the kernel function task in the task queue;
- S04, creating an event recall queue: adding tasks that have been processed by a task executive into an event recall queue;
- S05, creating a thread of the event recall queue: taking out and returning, by the thread of the event recall queue, the tasks that have been processed in the event recall queue.
- A neural network model computation-oriented graph execution apparatus includes an executive construction module and an executive pipelining and parallelizing working module.
- Referring to
FIG. 2 , the executive construction module includes the following basic actions: -
- creating an operator kernel function task queue: adding a current operator kernel function computation task into a current kernel function task queue in sequence;
- creating a thread of a task executive: creating a thread of a task executive, wherein the thread of the task executive is in charge of acquiring a current task to be processed in sequence from the task queue, and submitting the current task to be processed to a thread pool when a server receives a request, and continuing to wait for other requests; if there is one available thread in the pool, the thread will be awakened, and the request will be served immediately; if there is no available thread in the pool, the task will be queued until there is a free thread; once one thread completes its service, the thread will return to the pool and wait for more jobs; when the task submitted to the thread pool can be asynchronously executed, the thread pool can work well;
- creating a task executive of a kernel function: creating a task executive used for operator kernel function computation according to a current kernel function task and context information of a current thread, and using the task executive to run the kernel function task in the task queue;
- creating an event recall queue: when all the task executives in the task queue have been processed, creating an event recall queue, and adding the tasks that have been processed by the task executive into the event recall queue in sequence;
- creating a thread of the event recall queue: creating a thread of the event recall queue, wherein the thread of the event recall queue is in charge of taking out and returning the tasks that have been processed in the event recall queue.
- Referring to
FIG. 3 , the executive pipelining and parallelizing work module includes the following basic actions: an executive inputs data; a current executive sends a message to a downstream executive; the downstream executive prepares tensor data to be consumed; the current executive sends a message to an upstream executive; the upstream executive reclaims tensor data that has been consumed; and the last executive reclaims computation data on its own. - An executive inputs data: At time t, for an ith batch of data, the executive inputs the ith batch of data, loads an operator kernel function computation task inside, executes kernel function computation, generates output tensor data of the kernel function computation task, and writes an execution result into an idle memory block.
- A current executive sends a message to a downstream executive: At time t, for the ith batch of data, tensor data generated by the current executive is stored to an empty storage unit, and an address of the storage unit and an identity identification number of the downstream executive corresponding to the current executive are packaged into a message; the message is then sent to a target executive; and the target executive is the downstream executive corresponding to the current executive.
- The downstream executive prepares tensor data to be consumed: At time t, for the ith batch of data, the downstream executive receives the message, and parses the tensor data generated by the current executive from the message, and the tensor data will be used as an input tensor when the downstream executive operates its operator kernel function; the downstream executive checks whether there is an available free memory block among memory blocks produced by the downstream executive; if it is found that there is an available free memory block, the downstream executive executes a corresponding operator kernel function computation task and reads the free memory block; and the downstream executive writes an output tensor result generated by execution into the memory block.
- The current executive sends a message to an upstream executive: At time t, for the ith batch of data, the executive sends a message to an upstream producer executive to inform the upstream producer executive that the executive has consumed the memory block of the upstream producer executive; and the upstream executive may reclaim its storage unit for output tensor data.
- The upstream executive reclaims data that has been consumed: At time t, for the ith batch of data, once the upstream executive receives a reclaiming message sent by the downstream executive, the upstream executive starts to check whether the memory blocks have been consumed by all the consumer executives, reclaims the memory blocks if the memory blocks have been consumed by all the consumer executives, and marks the memory blocks as free blocks.
- The last executive reclaims computation data on its own: At time t, for the ith batch of data, the last executive executes a corresponding operator kernel function computation task, and writes the task into its own free memory block; and the executive A immediately reclaims the memory block after completing the execution.
- The embodiment of the neural network model computation-oriented graph execution apparatus of the present disclosure can be applied to any device with data processing capability. Any device with data processing capability may be a device or apparatus such as a computer. The apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Implementation by software is taken as an example, an apparatus in a logical sense is formed by reading corresponding computer program instructions in a nonvolatile memory into an internal memory through a processor of any device with the data processing capability where it is located. In terms of hardware, as shown in
FIG. 5 , a hardware structure diagram of any device with the data processing capability where the neural network model computation-oriented graph execution apparatus of the present disclosure is located is illustrated. In addition to the processor, an internal memory, a network interface, and a non-volatile memory shown inFIG. 5 , any device with the data processing capability where the apparatus in the embodiment is located may also include other hardware usually according to the actual functions of any device with the data processing capability, and repeated descriptions are omitted here. For details of the implementation process of the functions and effects of all units in the above apparatus, the implementation processes of the corresponding steps in the above method are referred to, and repeated descriptions are omitted here. - For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
- An embodiment of the present disclosure further provides a computer-readable storage medium on which a program is stored. The program, when executed by a processor, implements the neural network model computation-oriented graph execution apparatus in the above embodiment.
- The computer-readable storage medium may be an internal storage unit of any device with the data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device with the data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, and a flash card. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with the data processing capability. The computer-readable storage medium is used for storing the computer program and other programs and data required by any device with the data processing capability, and can also be used for temporarily storing data that has been output or will be output.
- Referring to
FIG. 4 , a constructed physical computation graph is composed of forward operator x→forward operator y→forward operator z and backward operator Z→backward operator Y→backward operator X; executives for running own kernel functions are respectively created according to all the operators to correspondingly form execution computation graphs of executive a→executive b→executive c→executive C→executive B→executive A; and execution of the executives are initiated to run an entire computation graph in parallel. - At time T1:
- A first batch of data is input, and executive a inputs the data: executive a runs a kernel function of forward operator x and writes an output tensor of a running result into free memory block r11.
- Executive b, executive c, executive C, executive B and executive A are in a standby state since there is no readable input tensor data.
- At time T2:
- For a second batch of data, executive a inputs the data: executive a may also check whether there is a writable free block in executive a; if any, at time T2, executive a also executes the second batch of data and writes an execution result into free memory block r12.
- At the same time, for the first batch of data, current executive a sends a message to downstream executive b, and downstream executive b prepares tensor data to be consumed: executive a sends a message to executive b to inform executive b of reading memory block r11 produced by executive a; executive b receives the message and checks whether there is an available free memory block among memory blocks b produced by executive b; if available free memory block r21 is found, at time T2, executive b executes a kernel function computation task of forward operator b and reads memory block r11; and executive b writes an output tensor result generated by the execution into memory block r21.
- Whereupon executive a and executive b start to work in parallel. Executives c, C, B and A are still standby since there is no readable data.
- At time T3:
- For a third batch of data, executive a inputs the data: executive a may also check whether there is a writable free block in executive a; if any, executive a also executes the third batch of data and writes an execution result into free memory block r13.
- At the same time, for the first batch of data, current executive b sends a message to downstream executive c, downstream executive c prepares tensor data to be consumed, current executive b sends a message to upstream executive a, and upstream executive a reclaims the tensor data that has been consumed: executive b produces memory block r21, and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r21 produced by executive b; executive c receives memory block r21 and finds that there is free memory block r31 in executive c, whereupon executive c starts execution to read memory block r21 and writes a result into memory block r31. At the same time, executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r1 of executive a; executive a receives memory block r11 that is returned by executive b after use and checks whether all consumers have finished using memory block r11, and then reclaims memory block r11 and marks memory block r11 as a free block.
- At the same time, for the second batch of data, current executive a sends a message to downstream executive b, and downstream executive b prepares tensor data to be consumed: executive a sends a message to executive b to inform executive b of reading memory block r12 produced by executive a; executive b receives the message and checks whether there is an available free memory block among memory blocks b produced by executive b; if available free memory block r22 is found, executive b executes a kernel function computation task of forward operator b and reads memory block r12; and executive b writes an output tensor result generated by the execution into memory block r22.
- Then executive a, executive b and executive c start to work in parallel.
- Executives C, B and A are still standby since there is no readable data.
- At time T4:
- For a fourth batch of data, executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- At the same time, for the first batch of data, current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r31, and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r31 produced by executive c; executive C receives memory block r31 and finds that there is free memory block r11 in executive C, whereupon executive C starts execution to read memory block r31 and writes a result into memory block r11. At the same time, executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r21 of executive b; executive b receives memory block r21 that is returned by executive c after use and checks whether all consumers have finished using memory block r21, and then reclaims memory block r21 and marks memory block r21 as a free block.
- At the same time, for the second batch of data, current executive b sends a message to downstream executive c, downstream executive c prepares tensor data to be consumed, current executive b sends a message to upstream executive a, and upstream executive a reclaims the tensor data that has been consumed: executive b produces memory block r22, and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r22 produced by executive b; executive c receives memory block r22 and finds that there is free memory block r32 in executive c, whereupon executive c starts execution to read memory block r22 and writes a result into memory block r32. At the same time, executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r12 of executive a; executive a receives memory block r12 that is returned by executive b after use and checks whether all consumers have finished using memory block r12, and then reclaims memory block r12 and marks memory block r12 as a free block.
- At the same time, for the third batch of data, current executive a sends a message to downstream executive b, and downstream executive b prepares tensor data to be consumed: executive a sends a message to executive b to inform executive b of reading memory block r13 produced by executive a; executive b receives the message 30 and checks whether there is an available free memory block among memory blocks b produced by executive b; if available free memory block r23 is found, executive b executes a kernel function computation task of forward operator b and reads memory block r13; and executive b writes an output tensor result generated by the execution into memory block r23.
- Then executive a, executive b, executive c and executive C start to work in parallel. Executives B and A are still standby since there is no readable data.
- At time T5:
- For a fourth batch of data, executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive a and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- At the same time, for the first batch of data, current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r11, and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r11 produced by executive c; executive B receives memory block r11 and finds that there is free memory block r21 in executive B, whereupon executive B starts execution to read memory block r11 and writes a result into memory block r21. At the same time, executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r31 of executive c; executive c receives memory block r31 that is returned by executive C after use and checks whether all consumers have finished using memory block r31, and then reclaims memory block r31 and marks memory block r31 as a free block.
- At the same time, for the second batch of data, current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r32, and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r32 produced by executive c; executive C receives memory block r32 and finds that there is free memory block r12 in executive C, whereupon executive C starts execution to read 30 memory block r32 and writes a result into memory block r12. At the same time, executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r22 of executive b; executive b receives memory block r22 that is returned by executive c after use and checks whether all consumers have finished using memory block r22, and then reclaims memory block r22 and marks memory block r22 as a free block.
- At the same time, for the third batch of data, current executive b sends a message to downstream executive c, downstream executive c prepares tensor data to be consumed, current executive b sends a message to upstream executive a, and upstream executive a reclaims the tensor data that has been consumed: executive b produces memory block r23, and whereupon sends a message to downstream consumer executive c to inform executive c of reading memory block r23 produced by executive b; executive c receives memory block r23 and finds that there is free memory block r33 in executive c, whereupon executive c starts execution to read memory block r23 and writes a result into memory block r33. At the same time, executive b sends a message to upstream producer executive a to inform executive a that executive b has finished using memory block r13 of executive a; executive a receives memory block r13 that is returned by executive b after use and checks whether all consumers have finished using memory block r13, and then reclaims memory block r13 and marks memory block r13 as a free block.
- Then executive a, executive b, executive c, executive C and executive B start to work in parallel. Executive A is still standby since there is no readable data.
- At time T6:
- For a fourth batch of data, executive a inputs the data: executive a may also simultaneously check whether there is a writable free memory block in executive a and whether executive A has completed the execution; and if no, executive a waits and does not enter the pipeline.
- At the same time, for the first batch of data, current executive B sends a message to downstream executive A, and downstream executive A prepares tensor data to be consumed, then downstream executive A reclaims computation data on its own, current executive B sends a message to upstream executive C, and upstream executive C reclaims the tensor data that has been consumed: executive B produces memory block r21, and whereupon sends a message to downstream consumer executive A to inform executive A of reading memory block r21 produced by executive B; executive A receives memory block r21 and finds that there is free memory block r31 in executive A, whereupon executive A starts execution to read memory block r21 and writes a result into memory block r31; and executive A immediately reclaims memory block r31 on its own after completing the execution. At the same time, executive B sends a message to upstream producer executive C to inform executive C that executive B has finished using memory block r11 of executive C; executive C receives memory block r11 that is returned by executive B after use and checks whether all consumers have finished using memory block r11, and then reclaims memory block r11 and marks memory block r11 as a free block.
- At the same time, for the second batch of data, current executive C sends a message to downstream executive B, downstream executive B prepares tensor data to be consumed, current executive C sends a message to upstream executive c, and upstream executive c reclaims the tensor data that has been consumed: executive C produces memory block r12, and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r12 produced by executive C; executive B receives memory block r12 and finds that there is free memory block r22 in executive B, whereupon executive B starts execution to read memory block r12 and writes a result into memory block r22. At the same time, executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r32 of executive c; executive c receives memory block r32 that returned by executive C after use and checks whether all consumers have finished using memory block r32, and then reclaims memory block r32 and marks memory block r32 as a free block.
- At the same time, for the third batch of data, current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r33, and whereupon sends a message to downstream consumer executive C to inform executive C of reading memory block r33 produced by executive c; executive C receives memory block r33 and finds that there is free memory block r13 in executive C, whereupon executive C starts execution to read memory block r33 and writes a result into memory block r13. At the same time, executive c sends a message to upstream producer executive b to inform executive b that executive c has finished using memory block r23 of executive b; executive b receives memory block r23 that is returned by executive c after use and checks whether all consumers have finished using memory block r23, and then reclaims memory block r23 and marks memory block r23 as a free block.
- Then executive a, executive b, executive c, executive C, executive B and executive A all start to work in parallel.
- At time T7:
- For a fourth batch of data, executive a inputs the data: executive a may also simultaneously check whether there is a writable free block in executive a and whether executive A completes the execution; if any, executive a also executes the fourth batch of data and writes an execution result into free memory block r11.
- At the same time, for the first batch of data, all the executives complete the execution.
- At the same time, for the second batch of data, current executive B sends a message to downstream executive A, downstream executive A prepares tensor data to be consumed, current executive B sends a message to upstream executive C, and upstream executive C reclaims the tensor data that has been consumed: executive B produces memory block r22, and whereupon sends a message to downstream consumer executive A to inform executive A of reading memory block r22 produced by executive B; executive A receives memory block r22 and finds that there is free memory block r32 in executive A, whereupon executive A starts execution to read memory block r22 and writes a result into memory block r32; and executive A immediately reclaims memory block r32 on its own after completing the execution. At the same time, executive B sends a message to upstream producer executive C to inform executive C that executive B has finished using memory block r12 of executive C; executive C receives memory block r12 that is returned by executive B after use and checks whether all consumers have finished using memory block r12, and then reclaims memory block r12 and marks memory block r12 as a free block.
- At the same time, for the third batch of data, current executive C sends a message to downstream executive B, downstream executive B prepares tensor data to be consumed, current executive C sends a message to upstream executive c, and upstream executive c reclaims the tensor data that has been consumed: current executive c sends a message to downstream executive C, downstream executive C prepares tensor data to be consumed, current executive c sends a message to upstream executive b, and upstream executive b reclaims the tensor data that has been consumed: executive c produces memory block r13, and whereupon sends a message to downstream consumer executive B to inform executive B of reading memory block r13 produced by executive C; executive B receives memory block r13 and finds that there is free memory block r23 in executive B, whereupon executive B starts execution to read memory block r13 and writes a result into memory block r23. At the same time, executive C sends a message to upstream producer executive c to inform executive c that executive C has finished using memory block r33 of executive c; executive c receives memory block r33 that is returned by executive C after use and checks whether all consumers have finished using memory block r33, and then reclaims memory block r33 and marks memory block r33 as a free block. Then executive a, executive b, executive c and executive C start to work in parallel. Executives B and A are still standby since there is no readable data.
- At time T8:
- Executives a, b, c, C, B and A all work. At this time, the executives for one batch of data all complete the execution and a next batch of data is input. By means of the design of a plurality of idle memory blocks, the executives achieve pipelining and parallelizing work.
- The above embodiments are only the preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements or improvements that are made within the spirit and principle of the present disclosure shall fall within the protection scope of the present disclosure.
Claims (8)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210447287.7A CN114548383A (en) | 2022-04-27 | 2022-04-27 | Graph execution pipeline parallel method and device for neural network model calculation |
| CN202210447287.7 | 2022-04-27 | ||
| PCT/CN2022/092481 WO2023082575A1 (en) | 2022-04-27 | 2022-05-12 | Graph execution pipeline parallelism method and apparatus for neural network model computation |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/092481 Continuation WO2023082575A1 (en) | 2022-04-27 | 2022-05-12 | Graph execution pipeline parallelism method and apparatus for neural network model computation |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20230351145A1 true US20230351145A1 (en) | 2023-11-02 |
| US12468921B2 US12468921B2 (en) | 2025-11-11 |
Family
ID=81667147
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/838,342 Active 2044-08-11 US12468921B2 (en) | 2022-04-27 | 2022-06-13 | Pipelining and parallelizing graph execution method for neural network model computation and apparatus thereof |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12468921B2 (en) |
| CN (1) | CN114548383A (en) |
| WO (1) | WO2023082575A1 (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114548383A (en) | 2022-04-27 | 2022-05-27 | 之江实验室 | Graph execution pipeline parallel method and device for neural network model calculation |
| CN115408157A (en) * | 2022-08-31 | 2022-11-29 | 北京中科睿信科技有限公司 | Model parallelization data simulation method based on thread pool |
| CN115688893B (en) * | 2022-10-19 | 2024-09-03 | 北京百度网讯科技有限公司 | Memory scheduling method and device, electronic device and storage medium |
| CN117032954B (en) * | 2023-07-17 | 2024-04-26 | 北京泛睿科技合伙企业(有限合伙) | Memory optimization method, system, equipment and medium for terminal training model |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190362227A1 (en) * | 2018-05-23 | 2019-11-28 | Microsoft Technology Licensing, Llc | Highly performant pipeline parallel deep neural network training |
| US20220156469A1 (en) * | 2020-11-16 | 2022-05-19 | Lightmatter, Inc. | Parallelization and pipelining strategies for an efficient analog neural network accelerator |
| US20220318614A1 (en) * | 2021-04-02 | 2022-10-06 | Tenstorrent Inc. | Graph execution using access request response dynamic batch assembly |
| US20230004871A1 (en) * | 2021-06-30 | 2023-01-05 | Advanced Micro Devices, Inc. | Machine learning cluster pipeline fusion |
| US20230083345A1 (en) * | 2021-09-07 | 2023-03-16 | Nvidia Corporation | Multi-architecture execution graphs |
| US20230084951A1 (en) * | 2021-09-16 | 2023-03-16 | Nvidia Corporation | Synchronizing graph execution |
| US20230169408A1 (en) * | 2021-11-30 | 2023-06-01 | International Business Machines Corporation | Annotation of a Machine Learning Pipeline with Operational Semantics |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112884086B (en) * | 2021-04-06 | 2022-08-30 | 北京百度网讯科技有限公司 | Model training method, device, equipment, storage medium and program product |
| CN114139702B (en) * | 2021-11-25 | 2025-06-20 | 广东浪潮智慧计算技术有限公司 | A deep neural network training method, system, device, equipment and medium |
| CN114186687B (en) * | 2022-02-17 | 2022-05-17 | 之江实验室 | An intermediate representation method and device for neural network model calculation |
| CN114237918B (en) * | 2022-02-28 | 2022-05-27 | 之江实验室 | A graph execution method and device for neural network model calculation |
| CN114548383A (en) * | 2022-04-27 | 2022-05-27 | 之江实验室 | Graph execution pipeline parallel method and device for neural network model calculation |
-
2022
- 2022-04-27 CN CN202210447287.7A patent/CN114548383A/en active Pending
- 2022-05-12 WO PCT/CN2022/092481 patent/WO2023082575A1/en not_active Ceased
- 2022-06-13 US US17/838,342 patent/US12468921B2/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190362227A1 (en) * | 2018-05-23 | 2019-11-28 | Microsoft Technology Licensing, Llc | Highly performant pipeline parallel deep neural network training |
| US20220156469A1 (en) * | 2020-11-16 | 2022-05-19 | Lightmatter, Inc. | Parallelization and pipelining strategies for an efficient analog neural network accelerator |
| US20220318614A1 (en) * | 2021-04-02 | 2022-10-06 | Tenstorrent Inc. | Graph execution using access request response dynamic batch assembly |
| US20230004871A1 (en) * | 2021-06-30 | 2023-01-05 | Advanced Micro Devices, Inc. | Machine learning cluster pipeline fusion |
| US20230083345A1 (en) * | 2021-09-07 | 2023-03-16 | Nvidia Corporation | Multi-architecture execution graphs |
| US20230084951A1 (en) * | 2021-09-16 | 2023-03-16 | Nvidia Corporation | Synchronizing graph execution |
| US20230169408A1 (en) * | 2021-11-30 | 2023-06-01 | International Business Machines Corporation | Annotation of a Machine Learning Pipeline with Operational Semantics |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2023082575A1 (en) | 2023-05-19 |
| US12468921B2 (en) | 2025-11-11 |
| CN114548383A (en) | 2022-05-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12468921B2 (en) | Pipelining and parallelizing graph execution method for neural network model computation and apparatus thereof | |
| US11941514B2 (en) | Method for execution of computational graph in neural network model and apparatus thereof | |
| CN107526645B (en) | A kind of communication optimization method and system | |
| CN108376221A (en) | A software system security verification and evaluation method based on AADL model extension | |
| CN103778015A (en) | Managing computing resources in graph-based computations | |
| CN103078941A (en) | Task scheduling method and system for distributed computing system | |
| CN113157710B (en) | Block chain data parallel writing method and device, computer equipment and storage medium | |
| CN109885452A (en) | Method for monitoring performance, device and terminal device | |
| CN110290166B (en) | Cross-cluster data interaction method, system and device and readable storage medium | |
| CN108710536A (en) | A kind of multi-level fine-grained virtualization GPU method for optimizing scheduling | |
| CN108205440A (en) | A Implementation Method of Task Flow Framework Supporting Rollback | |
| CN114371939A (en) | Task processing method, device, electronic device, storage medium and program product | |
| CN106293947A (en) | GPU CPU mixing resource allocation system and method under virtualization cloud environment | |
| CN111985634A (en) | Neural network computing method, device, computer equipment and storage medium | |
| CN115586988A (en) | Method and device for recovering data, electronic equipment and storage medium | |
| CN115858667A (en) | Method, apparatus, device and storage medium for synchronizing data | |
| CN112130849B (en) | Code automatic generation method and device | |
| CN111274667B (en) | Cross-scale material computing software integrated computing system and method | |
| US20240104395A1 (en) | Memory optimization method and device oriented to neural network computing | |
| CN114020476B (en) | An operation processing method, equipment and medium | |
| CN111553379B (en) | Asynchronous training-based image data processing method and system | |
| CN114816680A (en) | Business process model discovery method based on CPU-GPU architecture | |
| US20250138890A1 (en) | Processing a user task | |
| US20250258712A1 (en) | Method, device, and storage medium for data processing | |
| CN111538714B (en) | Instruction execution method and device, electronic equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: ZHEJIANG LAB, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, HONGSHENG;TAN, BOWEN;BAO, HUJUN;AND OTHERS;SIGNING DATES FROM 20220525 TO 20220603;REEL/FRAME:060176/0289 Owner name: ZHEJIANG LAB, CHINA Free format text: ASSIGNMENT OF ASSIGNOR'S INTEREST;ASSIGNORS:WANG, HONGSHENG;TAN, BOWEN;BAO, HUJUN;AND OTHERS;SIGNING DATES FROM 20220525 TO 20220603;REEL/FRAME:060176/0289 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: ALLOWED -- NOTICE OF ALLOWANCE NOT YET MAILED Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |