CN120803677A

CN120803677A - Model distributed training optimization method, electronic equipment and storage medium

Info

Publication number: CN120803677A
Application number: CN202511308976.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2025-09-15
Filing date: 2025-09-15
Publication date: 2025-10-17
Anticipated expiration: 2045-09-15
Also published as: CN120803677B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a model distributed training optimization method, electronic equipment and a storage medium, wherein the method comprises the steps of respectively constructing a calculation flow and a communication flow based on calculation operation and communication operation in a model, and executing cross-flow dispatching aiming at tasks of at least two micro-batches in training iteration of the model, wherein the cross-flow dispatching comprises the steps of dispatching calculation tasks of a first micro-batch to the execution of the calculation flow, dispatching communication tasks of a second micro-batch to the execution of the communication flow, or dispatching communication tasks of the first micro-batch to the execution of the communication flow, and dispatching calculation tasks of the second micro-batch to the execution of the calculation flow, so that the tasks of the first micro-batch and the tasks of the second micro-batch are processed in parallel. According to the invention, by performing staggered pipeline scheduling on tasks of different micro batches, parallel processing of calculation tasks and communication tasks is realized, and the hardware utilization rate and model training efficiency are improved.

Description

Model distributed training optimization method, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a model distributed training optimization method, electronic equipment and a storage medium.

Background

In recent years, large-scale language models based on transformers have achieved significant success in natural language processing and other fields. To further improve model capacity and performance, while controlling computational costs, a hybrid expert (Mixture of Experts, abbreviated as MoE) model is proposed. The MoE model dynamically distributes calculation tasks to a plurality of parallel expert sub-networks through a routing (Router) module, so that the model performance is effectively improved under the controllable calculation cost. In a multi-device distributed training scenario, data needs to be exchanged globally (All-to-All) between different devices to ensure that the data can be correctly sent to a designated private subnetwork for computation, a major performance bottleneck.

Currently, training optimization of the MoE model mainly depends on two main schemes. One is a synchronous execution mode, i.e., serial communication and computing tasks are performed strictly. This mode results in the device being idle in the computing unit during communication and the communication link being idle during computation, resulting in serious waste of computing resources and inefficiency in training. The other scheme is to improve load balancing among expert sub-networks through algorithm optimization (such as introducing an auxiliary loss function), but the scheme does not solve the problem of calculation and communication in series, hysteresis exists in the optimization effect, the efficiency of single iteration cannot be improved immediately, and additional calculation overhead is introduced.

Disclosure of Invention

The invention provides a model distributed training optimization method, electronic equipment and a storage medium, and aims to overcome the defects of low hardware utilization rate, low training efficiency and the like caused by serial execution of calculation and communication when a large-scale model is in distributed training.

The invention provides a model distributed training optimization method, which comprises the following steps:

Based on the calculation operation and the communication operation in the model, respectively constructing a calculation flow and a communication flow, wherein the calculation flow is used for executing calculation tasks, and the communication flow is used for executing communication tasks;

In training iterations of the model, performing staggered pipeline scheduling for at least two micro-batch tasks;

The staggered pipeline dispatching comprises dispatching the calculation tasks of a first micro batch to the calculation flow for execution, dispatching the communication tasks of a second micro batch to the communication flow for execution, or dispatching the communication tasks of the first micro batch to the communication flow for execution, and dispatching the calculation tasks of the second micro batch to the calculation flow for execution, so that the tasks of the first micro batch and the tasks of the second micro batch are processed in parallel, and the first micro batch and the second micro batch are any two different micro batches in the at least two micro batches.

According to the model distributed training optimization method provided by the invention, the calculation flow and the communication flow are respectively constructed based on the calculation operation and the communication operation in the model, and the method comprises the following steps:

reconstructing computing operations in the model as at least one computing node and reconstructing communication operations in the model as at least one communication node;

And respectively constructing the computing flow and the communication flow based on the at least one computing node and the at least one communication node, wherein the computing flow is used for executing the computing task corresponding to the at least one computing node, and the communication flow is used for executing the communication task corresponding to the at least one communication node.

According to the method for optimizing the distributed training of the model, which is provided by the invention, the computing operation in the model is reconstructed into at least one computing node, and the method comprises the following steps:

Fusing the attention computation, residual computation, route computation and rearrangement operations in the model into a first computation node;

and fusing full-connection layer calculation and inverse rearrangement operation in the model into a second calculation node.

According to the method for optimizing the distributed training of the model, which is provided by the invention, the communication operation in the model is reconstructed into at least one communication node, and the method comprises the following steps:

reconstructing a distribution operation in the model as a first communication node and reconstructing a merging operation in the model as a second communication node, both the first communication node and the second communication node performing global exchange communication.

The invention provides a model distributed training optimization method, which further comprises the following steps:

acquiring an asynchronous communication handle based on the communication flow;

Before executing a target computing task on the computing stream, inserting a waiting operation, wherein the waiting operation is used for waiting for the execution completion of the target communication task associated with the asynchronous communication handle, and the input of the target computing task depends on the output of the target communication task.

According to the model distributed training optimization method provided by the invention, when the target communication task is a distributed communication task in a forward stage, the target calculation task is a calculation task corresponding to the second calculation node in the forward stage;

When the target communication task is a combined communication task in a reverse phase, the target calculation task is a calculation task corresponding to the second calculation node in the reverse phase;

And when the target communication task is a distributed communication task in a reverse stage, the target calculation task is a calculation task corresponding to the first calculation node in the reverse stage.

Pre-allocating separate communication buffers for communication tasks associated with the communication flows prior to performing the interleaved pipeline schedule;

and adjusting the scheduling of the tasks of the subsequent micro-batch or micro-batches if the target communication task associated with the asynchronous communication handle is not executed to complete within the preset duration of the waiting operation.

According to the model distributed training optimization method provided by the invention, the parallel processing comprises at least one of the following steps:

processing the reverse communication task of the first micro batch and the forward calculation task of the second micro batch in parallel;

and processing the reverse calculation task of the first micro batch and the forward communication task of the second micro batch in parallel.

The invention also provides a model distributed training optimization device, which comprises:

a construction unit configured to construct a computation flow and a communication flow based on a computation operation and a communication operation in a model, respectively, the computation flow being executed asynchronously and in parallel with the communication flow;

the execution unit is used for executing staggered pipeline dispatching aiming at least two micro-batch tasks in training iteration of the model;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, the processor implementing the model distributed training optimization method as described above when executing the computer program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model distributed training optimization method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a model distributed training optimization method as described in any one of the above.

According to the model distributed training optimization method, the electronic equipment and the storage medium, the independent computing flow and the independent communication flow are respectively constructed based on the computing operation and the communication operation in the model, so that conditions are fundamentally created for parallel processing of tasks. On the basis, by performing staggered pipeline scheduling on tasks of different micro-batches, the calculation tasks and the communication tasks which originally need to be serially waited are overlapped to be performed, and the communication time is effectively hidden within the calculation time, so that the serial execution bottleneck of calculation and communication is broken. Because the calculation task and the communication task can be executed in parallel, the calculation unit of the calculation device and the communication link in charge of data transmission can be in a busy state at the same time, the idle time of the device is greatly reduced, the hardware utilization rate is improved, and the model training efficiency is effectively improved. In addition, compared with an algorithm optimization scheme relying on an auxiliary loss function, the method is an optimization of an execution scheduling layer, the performance improvement effect of the method can be immediately achieved after training iteration starts, the method can take effect after waiting for multiple iterations like algorithm optimization, and the problem of insufficient instantaneity is solved. More importantly, the efficiency is improved by optimizing the task scheduling flow, no new calculation amount such as calculation auxiliary loss is introduced, and extra calculation expenditure is avoided.

Drawings

In order to more clearly illustrate the invention or the technical solutions in the related art, the following description will briefly explain the drawings used in the embodiments or the related art description, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic diagram of a computing device provided by the present invention;

FIG. 2 is a schematic flow chart of the model distributed training optimization method provided by the invention;

FIG. 3 is a schematic diagram of node reconstruction provided by the present invention;

FIG. 4 is a schematic diagram of the parallel processing of tasks of a first micro-batch and tasks of a second micro-batch provided by the present invention;

FIG. 5 is a schematic diagram of a model distributed training optimization device provided by the invention;

Fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, large-scale language models based on transformers have achieved significant success in natural language processing and other fields. In order to further increase the model capacity and performance, while controlling the computational cost, moE models are proposed. The MoE model introduces a plurality of parallel expert sub-networks (hereinafter referred to as "experts") and dynamically distributes input data (such as token) to a part of the experts for processing by a routing module, so that the total parameter number of the model can be greatly expanded without significantly increasing the single forward computation amount. It should be understood that in a deep learning model, token refers to the smallest unit of model processing information, which is essentially the process of converting raw data (e.g., text, image, audio) into a sequence of discrete symbols by segmentation. For example, in a text processing scenario, each word, punctuation or subword can be considered a token, in an image processing scenario, the image is segmented into fixed-size blocks of pixels, each block being a token, and in an audio processing scenario, the speech signal is segmented into short-time frames (e.g., 25ms audio clips), each frame being a token.

In the distributed training of MoE models, data exchange between devices becomes critical as the model is sliced and deployed across multiple computing devices. Here, the computing device may be a GPU (Graphics Processing Unit, graphics processor), a GPGPU (General-purpose computing on Graphics Processing Units, general-purpose graphics processor), a TPU (Tensor Processing Unit, tensor processor), or the like. Specifically, the data exchange process between devices involves two critical global exchange (All-to-All) communications, firstly, after a routing module selects a part of experts for a token, an All-to-All communication operation needs to be executed, the corresponding token is sent to the device where each expert is located to perform calculation, after each expert completes the calculation, another All-to-All communication operation needs to be executed, and the output result of each expert is returned to the original device to perform subsequent calculation. It should be understood that All-to-All is a multi-device full exchange communication mode, i.e., a communication mode in which All the transmitting sides broadcast information to All the receiving sides, and All the receiving sides also feed back the result to All the transmitting sides.

At present, the training optimization of the MoE model mainly depends on the following two schemes:

One is a synchronous execution mode. In this mode, the training process strictly follows a serial sequence, all devices first perform All-to-All communication in parallel and distribute data to multiple experts, and after All communication operations are completed, all devices perform computing tasks (such as matrix multiply-accumulate) in a unified manner. This mode, while simple to implement, has significant drawbacks. Since the calculation and communication are completely carried out serially, the calculation unit is in idle waiting state when the device carries out All-to-All communication, otherwise, the communication bandwidth is not utilized when the device carries out calculation. The serial dependence causes a large amount of idle period of computing resources such as GPU and the like in the training process, the overall utilization rate is low, and the multi-stream concurrent processing capability of modern hardware (such as GPU) cannot be fully exerted.

The other is an algorithm optimization mode. This approach attempts to improve the load balancing problem from an algorithmic level, typically by introducing an auxiliary Loss function (e.g., balance Loss). The loss function aims at punishing the unbalanced load distribution condition among experts, and the routing module is guided to learn a more balanced token distribution strategy by continuously optimizing the loss term in the training process. However, this approach also has limitations in that, first, it does not solve the fundamental problem of computing and communicating serially in a single iteration, and device idleness remains. Secondly, the optimization effect can be displayed only through multiple rounds of iterative accumulation, and response to single or sudden load imbalance conditions is delayed, so that training efficiency cannot be immediately improved. Finally, introducing the auxiliary loss function itself introduces additional computational overhead, typically increasing the total duration of training.

Aiming at the problems, the invention provides a model distributed training optimization method, which is used for effectively separating a calculation flow and a communication flow, executing staggered pipeline scheduling across micro batches, realizing parallel processing of calculation tasks and communication tasks, improving hardware utilization rate and model training efficiency, immediately improving single iteration efficiency, and avoiding additional calculation expenditure, thereby overcoming the defects.

Fig. 1 is a schematic structural diagram of a computing device according to the present invention, and as shown in fig. 1, an execution body of the model distributed training optimization method according to the present invention may be a computing device, where the computing device 100 includes a plurality of processing modules 101 and a memory 102, and where the processing modules 101 may be a stream processor cluster (STREAMING PROCESSOR CLUSTER, abbreviated as SPC). Each processing module 101 includes a plurality of computing units 103, and each computing unit 103 includes at least an on-chip cache 104 and a register 105.

The memory 102 may be a high bandwidth memory (High Bandwidth Memory, HBM) or other types of memory. The on-chip cache 104 is a temporary memory that has less capacity than the memory 102, but has a faster data exchange rate than the memory 102. The on-chip cache 104 may be a generic matrix main cache (Gemm Main Buffer, GMB for short). The capacity of the register 105 is smaller than the on-chip cache 104 compared to the on-chip cache 104, but the data exchange speed is faster than the on-chip cache 104. Register 105 may be specifically a thread local register (Thread Local Register, TLR for short).

It should be noted that, the computing device 100 in the embodiment of the present invention may include other structures besides the above-described structures, which is not limited in particular by the present invention.

Based on the architecture diagram of the computing device shown in fig. 1, the embodiment of the invention provides a model distributed training optimization method, which can be applied to various scenes, such as text processing, image processing, audio processing and the like, and model training data (i.e. data corresponding to different micro-batches) have different physical meanings under different application scenes. For example, in a text processing scenario, the model training data may be text data related to text generation, text recognition, and the like. For another example, in an image processing scenario, the model training data may be image data related to image preprocessing, image segmentation, object detection, and the like. Also for example, in an audio processing scenario, the model training data may be audio data related to speech recognition, speech synthesis, or the like. The specific flow of the model distributed training optimization method provided by the invention is described below.

Fig. 2 is a schematic flow chart of a model distributed training optimization method provided by the invention, and as shown in fig. 2, the method includes:

Step S10, based on the calculation operation and the communication operation in the model, respectively constructing a calculation flow and a communication flow, wherein the calculation flow is used for executing the calculation task, and the communication flow is used for executing the communication task.

It should be noted that the model in the embodiment of the present invention may be a large-scale hybrid expert (MoE) model, such as Switch Transformer. Of course, the model in the embodiments of the present invention is not limited to the MoE model, and the core concept can be applied to other models that require large-scale distributed training and have significant computation and communication phases, such as the traditional Dense (Dense) model.

Specifically, during model training, operations can be divided into two broad categories, namely computing operations and communication operations. Wherein, a computing operation refers to an operation that primarily consumes a computing unit of a computing device (e.g., GPU). In the MoE model, the computation operations may include attention computation, residual computation, routing computation, full-join layer computation (typically large-scale matrix multiplication computation), and reorder and inverse reorder operations for data alignment, among others.

Communication operations refer to operations that primarily occupy the bandwidth of the inter-device communication link for transferring data between different computing devices in a distributed training environment. In the distributed training of the MoE model, the most core communication operation is global exchange communication (All-to-All), which includes a distributing operation of distributing data to a specified expert and a merging operation of collecting expert calculation results back.

Specifically, to implement parallel execution of computation and communication, operations in the model training process first need to be decoupled, that is, independent computation flows and communication flows are respectively constructed according to the computation operations and the communication operations in the model. Herein, building a computational stream and a communication stream refers to creating two or more separate, parallel-executable instruction queues on a computing device (e.g., GPU). Wherein the computational flow is an execution queue dedicated to scheduling and executing computational tasks. The communication flow is an execution queue dedicated to scheduling and executing communication tasks. For example, the communication flow may be a communication queue created using a specialized communication library that is asynchronous with the computing flow.

It will be appreciated that computing tasks are constituted when computing operations in the model (e.g., attention calculations, full connection layer calculations, etc.) are scheduled for execution. Communication tasks are constituted when communication operations (e.g., distribution operations, merging operations, etc.) in the model are scheduled to be performed. By binding the calculation task and the communication task to independent streams respectively, decoupling of the calculation task and the communication task is realized from the hardware and software scheduling layers, and conditions are created for subsequent parallel execution.

Step S20, in training iteration of the model, performing staggered pipeline scheduling for at least two micro-batch tasks;

Specifically, after the calculation flow and the communication flow are constructed, in training iteration of the model, for tasks of at least two micro-batches, staggered pipeline scheduling is performed to alternately schedule tasks of different micro-batches to the calculation flow and the communication flow, so that parallel execution of the calculation task and the communication task is realized.

Here, training iteration of the model refers to the complete process of forward computation, back propagation and gradient update of the model. For efficient training in a distributed environment, a large training batch (batch) is typically split into multiple smaller micro-batches (micro-batch).

The task of at least two micro-lots refers to the fact that in one training iteration, there are two or more micro-lots of data on the pipeline being processed, e.g., micro-lot A (or first micro-lot) and micro-lot B (or second micro-lot). Each micro-batch contains a series of computing tasks and communication tasks.

It will be appreciated that cross-pipeline scheduling is an advanced scheduling strategy that is distinguished from conventional serial execution. In the conventional scheme, after all tasks (calculation and communication) of one micro lot a are performed, the task of the next micro lot B can be started. The staggered pipeline scheduling breaks the barrier, and the tasks of different micro batches are overlapped and executed by utilizing the independent calculation flow and the communication flow which are constructed, so that the aim is to realize parallel processing of the calculation tasks and the communication tasks, and the idle calculation time which is originally used for waiting for communication completion is utilized, and vice versa, so that the hardware utilization rate is maximized.

Specifically, in task execution cross-pipeline dispatch for any two of the at least two micro-lots, the computing tasks of the first micro-lot may be dispatched to computing flow execution while the communication tasks of the second micro-lot are dispatched to communication flow execution. This means that when the computation flow is processing a certain computation task of the first micro-batch, the communication flow can process a certain communication task of the second micro-batch synchronously, so that the computation of the first micro-batch and the communication of the second micro-batch achieve parallel processing (i.e. overlapping execution).

When the task execution cross-flow pipeline of any two micro-batches in the at least two micro-batches is dispatched, the communication task of the first micro-batch can be dispatched to the communication flow for execution, and the calculation task of the second micro-batch can be dispatched to the calculation flow for execution. This means that when the computational flow is processing a certain computational task of the second micro-batch, the communication flow can process a certain communication task of the first micro-batch synchronously, both enabling parallel processing. This complements the situation described above, ensuring that overlap is produced as much as possible at different stages of the pipeline.

It should be noted that the first micro-batch and the second micro-batch are relative concepts, and refer to any two different micro-batches of at least two micro-batches, rather than a fixed two micro-batches. In general, they may refer to any two micro-batches that are adjacent in time or at different processing stages, with the reference relationship being dynamically changing with execution of the pipeline.

Illustratively, assume that in one training iteration, micro-lot A, micro-lot B, and micro-lot C are entered sequentially in the pipeline. At some point, micro lot A performs the task of the forward phase, begins to enter the reverse phase, and at the same time micro lot B begins to perform its task of the forward phase. In this scenario, micro lot a may be considered a first micro lot, micro lot B may be considered a second micro lot, when micro lot a performs a communication task of a reverse phase, micro lot B performs a calculation task of a forward phase, and when micro lot a performs a calculation task of a reverse phase, micro lot B performs a communication task of a forward phase, both of which achieve parallel processing.

Along with the advance of the pipeline, the micro batch B starts to execute the tasks of the reverse stage after finishing the processing of the forward stage, and meanwhile, a new micro batch C enters the pipeline to start to execute the tasks of the forward stage. In this scenario, micro lot B may be considered a first micro lot, micro lot C may be considered a second micro lot, again enabling parallelism of computation and communication.

Thus, the designations of the first and second micro-batches are flowing, and the scheduler may dynamically pair and parallel any one micro-batch that is performing a computational task with another micro-batch that is performing a communication task in a pipeline that is full of micro-batches, without being limited to two fixed micro-batches. This mechanism ensures that computation and communication overlap can continue to occur as long as the pipeline is full (i.e., there are at least two micro-batches), thereby minimizing hardware idle time throughout the training process, and enabling stable and efficient pipelining.

According to the method provided by the embodiment of the invention, the independent calculation flow and the independent communication flow are respectively constructed based on the calculation operation and the communication operation in the model, so that conditions are fundamentally created for the parallel processing of tasks. On the basis, by performing staggered pipeline scheduling on tasks of different micro-batches, the calculation tasks and the communication tasks which originally need to be serially waited are overlapped to be performed, and the communication time is effectively hidden within the calculation time, so that the serial execution bottleneck of calculation and communication is broken. Because the calculation task and the communication task can be executed in parallel, the calculation unit of the calculation device and the communication link in charge of data transmission can be in a busy state at the same time, the idle time of the device is greatly reduced, the hardware utilization rate is improved, and the model training efficiency is effectively improved. In addition, compared with an algorithm optimization scheme relying on an auxiliary loss function, the method is an optimization of an execution scheduling layer, the performance improvement effect of the method can be immediately achieved after training iteration starts, the method can take effect after waiting for multiple iterations like algorithm optimization, and the problem of insufficient instantaneity is solved. More importantly, the efficiency is improved by optimizing the task scheduling flow, no new calculation amount such as calculation auxiliary loss is introduced, and extra calculation expenditure is avoided.

Based on any of the above embodiments, step S10 specifically includes:

step S11, reconstructing the computing operation in the model into at least one computing node, and reconstructing the communication operation in the model into at least one communication node;

And step S12, respectively constructing the computing flow and the communication flow based on the at least one computing node and the at least one communication node, wherein the computing flow is used for executing the computing task corresponding to the at least one computing node, and the communication flow is used for executing the communication task corresponding to the at least one communication node.

It should be noted that the above embodiments set forth the basic idea of separating the computation and communication tasks into different streams. However, in order to make this separation and subsequent pipeline scheduling more efficient, embodiments of the present invention introduce key steps for node reconstruction.

Node reconstruction refers to the logical and implementation aggregation of multiple fine-grained operations in a model's original computational graph, forming larger, more regular execution units, i.e., nodes. The method aims to reduce the starting overhead of kernel functions (such as kernel on GPU) on the computing equipment, improve locality through operator fusion and reduce unnecessary memory reading and writing. In addition, by node reconstruction, a calculation block and a communication block with thicker granularity and clearer boundaries can be formed, so that the scheduling logic is simplified, and the pipeline overlapping is more stable and efficient.

Specifically, the node reconstruction in the embodiment of the present invention is divided into the reconstruction of the computing node and the reconstruction of the communication node. Here, a compute node is a logical unit that is a fusion of one or more computing operations, whose execution of computing tasks consumes primarily computing unit resources. A communication node is a logical unit made up of one or more communication operations that perform communication tasks that primarily occupy communication link resources.

After reconstruction is complete, the execution flow may be constructed based on these well-defined nodes. In particular, the computing flow is used for executing the computing task corresponding to at least one computing node, and the communication flow is used for executing the communication task corresponding to at least one communication node, which means that the scheduling system submits the execution instructions of all the computing nodes to the computing flow, and submits the execution instructions of all the communication nodes to the communication flow, thereby realizing the separation of the two on the scheduling source head.

Based on any of the foregoing embodiments, in step S11, the reconstructing the computing operation in the model into at least one computing node includes:

Specifically, in reconstructing a compute node from a compute operation in a model, in order to maximize the granularity of the compute task, embodiments of the present invention fuse the attention computation, residual computation, route computation, and rearrangement operations in the transform layer of the model into a first compute node.

Here, attention (Attention) computation is the core of the transducer for computing the correlation weights between different parts in the input sequence. Residual (Residual) computation is a connection that adds the module inputs to the outputs for stabilizing deep web training. Routing (Router) computation is an operation specific to the MoE model for deciding which expert to assign to for processing based on the characteristics of the incoming token. The rearrangement (MoE Permute) operation rearranges the order of the token in the memory according to the allocation result after the routing decision, so that All the token addressed to the same expert are physically continuous, which is a precondition for efficient execution of the subsequent All-to-All communication.

Fig. 3 is a schematic diagram of node reconstruction provided in the present invention, and as shown in fig. 3, the operations of Attention computation, residual computation, route computation and rearrangement in the fransformer layer are fused into a single, larger computing node, namely the first computing node (denoted as Attention). The node reconstruction mechanism provided by the embodiment of the invention realizes the integration of the route calculation and the attention calculation in the MoE model for the first time, and can obviously reduce the starting expenditure of the GPU and the memory exchange of intermediate data, thereby reducing the expenditure of stream synchronization.

Further, as shown in fig. 3, the full-join-layer computation and inverse rearrangement operations in the model transfomer layer are fused into a second compute node. Here, the full-link Layer computation may be an MLP (Multi-Layer Perceptron) Layer computation, which is a body computation of the expert network itself, and is generally composed of two large-scale generic matrix multiplications (General Matrix Multiplication, abbreviated as GEMM) and a nonlinear activation function, which is the part with the largest computation amount in model training. The inverse reordering operation (i.e., moE Unpermute) is then used to restore the expert-processed output to its original order in the sequence after the expert computation is completed and the data is returned through the communication node so that the subsequent layers can be used and computed correctly. These two operations are fused to form a second computational node (denoted MLP) that represents the main expert computational load in the MoE model.

Based on any of the above embodiments, in step S11, the reconstructing the communication operation in the model into at least one communication node includes:

In particular, the communication operations in the model, while of a single type, are reconfigured as independent communication nodes to help clarify their roles and dependencies in the pipeline. In particular, the distributing operation in the model may be reconfigured as a first communication node and the merging operation in the model may be reconfigured as a second communication node.

Here, the Dispatch operation refers to sending the continuous token data block on each device to the target device with the corresponding expert through global exchange communication (All-to-All) after the computing task corresponding to the first computing node is executed. This operation is reconfigured as the first communication node (denoted Dispatch).

The merging (Combine) operation refers to that after each expert completes the calculation (i.e. after the calculation task corresponding to the second calculation node is executed), the operation returns the expert output results distributed on different devices to the original devices of the respective token through another global exchange communication (All-to-All). This operation is reconfigured as a second communication node (denoted Combine).

The method provided by the embodiment of the invention not only separates calculation and communication, but also aggregates the original fine-grained operation into coarse-grained and optimized calculation nodes and communication nodes by a node reconstruction mode. The reconstruction lays a solid foundation for subsequent efficient staggered pipeline scheduling. On one hand, operator fusion reduces the overhead of calculation per se, and on the other hand, clear and regular node division enables a scheduler to more easily identify massive tasks which can be parallel, so that more full and stable calculation and communication overlapping between different micro batches is realized.

Based on any of the above embodiments, the method further comprises:

acquiring an asynchronous communication handle based on the communication flow;

It should be noted that, the embodiments of the present invention further introduce a key technology for ensuring correctness of data dependency relationships under the framework of parallel execution after separation of the computation flow and the communication flow, that is, a dynamic synchronization control mechanism. In the foregoing embodiment, the present invention constructs independent computation flows and communication flows, and reconstructs operations into regular computation nodes and communication nodes, thereby implementing parallel scheduling of tasks.

However, these parallel tasks are not completely independent of each other, and there is a strict data dependency relationship. For example, the input of one computing task may be just the output of another communication task. If not controlled, the computing task may begin to execute when the data is not ready, resulting in a computing error. In this regard, embodiments of the present invention provide a dynamic synchronization control mechanism that aims to accurately manage such dependencies with minimal performance costs.

Specifically, the core of the dynamic synchronous control mechanism is to realize accurate synchronization of the cross-stream by using an asynchronous communication handle and inserting a waiting operation at a key point. Asynchronous mode initiation may be employed when a communication task (e.g., an All-to-All operation performed by the first communication node or the second communication node) is submitted to the communication flow. In this mode, the function that calls the communication operation does not block and wait for it to complete, but returns an asynchronous communication handle immediately.

Here, an asynchronous communication handle is a lightweight object or identifier that represents an unfinished communication task that is being performed in the background. It does not contain the communication results itself, but can query the task state or wait for its final completion by it. Specifically, to obtain an asynchronous communication handle, an asynchronous operation may be started by transferring a specific parameter (e.g., async_op=true) when a communication library function (e.g., combination. Backward, i.e., a merging operation in a back propagation stage) is called, and a returned handle may be obtained, which is an asynchronous communication handle, and may be denoted as combination_ bwd _handle.

It will be appreciated that the purpose of obtaining an asynchronous communication handle is to decouple the initiation and completion of tasks. The initiator (e.g., master dispatch thread) may not have to wait in place after submitting the communication task to the communication stream, but may immediately continue to execute to dispatch other unrelated tasks (e.g., submit another micro-batch of computing tasks to the computing stream) to maximize parallelism.

After the asynchronous communication handle is obtained, a wait operation may be inserted before the computing stream performs the target computing task. The wait operation is a synchronous instruction (e.g., combine bwd _handle ()) that when encountered by the computing stream, pauses execution of subsequent instructions until an event is occurred that the instruction waits for. The event is here the completion of the execution of the communication task represented by its associated asynchronous communication handle.

It should be appreciated that the target communication task and the target computing task are a pair of "producer-consumer" tasks with direct data dependencies. The target communication task is performed on the communication stream, the output of which is an input of the target computing task. The target computing task executes on the computing stream with its input dependent on the output of the target communication task, meaning that the target computing task must acquire all of its required input data after the target communication task is completed, otherwise it cannot execute properly. For example, as shown in FIG. 3, the computing task of the expert network (i.e., full connection layer computation) whose input is a token that is distributed to the current device, therefore it must wait for the distribution communication task to complete.

Based on any one of the above embodiments, when the target communication task is a distributed communication task in the forward stage, the target computing task is a computing task corresponding to the second computing node in the forward stage;

It should be noted that, in order to achieve accurate synchronization, the embodiment of the present invention inserts a waiting operation in a specific stage of training. Model training mainly includes a forward (forward) phase and a backward (backward) phase. The forward phase refers to the process of calculating model predicted values by data flowing from an input end to an output end, and the reverse phase refers to the process of calculating and updating model parameters by loss gradients flowing from the output end to the input end.

Specifically, when the target communication task is a dispatch communication task (denoted as dispatch. Forward) in the forward stage, the target computing task is a computing task (denoted as mlp. Forward) corresponding to a second computing node in the forward stage. Here, in the forward computation, the first communication node (Dispatch) is responsible for distributing the token to the experts. Only after these token have been transmitted over the network and received correctly by the current device, the second computing node (MLP) responsible for performing expert computation can begin its forward computation. Thus, a wait operation for dispatch_fwd_handle (an asynchronous communication handle representing a forward dispatch communication task dispatch. Forward) must be inserted before mlp.forward is performed on the computational stream.

When the target communication task is a merging communication task (named combine. Backward) in the reverse phase, the target computing task is a computing task (named mlp. Backward) corresponding to a second computing node in the reverse phase. In back propagation, the gradient is calculated in reverse order from the forward phase. Gradients from subsequent layers first need to be transferred back to the expert from the original location of the token through a second communication node (Combine). After receiving these gradients, the expert (corresponding to the MLP layer) can calculate its own parameter gradients and the gradients passed to the preamble module. Thus, the combine bwd handle (an asynchronous communication handle that represents the reverse merge communication task combine. Backward) must wait for completion before executing mlp. Backward on the computational stream.

When the target communication task is a dispatch communication task (referred to as dispatch) in the reverse phase, the target computing task is a computing task (referred to as attribute) corresponding to the first computing node in the reverse phase. After the expert (corresponding MLP) completes the reverse calculation, the resulting gradient needs to be returned to the original location of the token by the reverse operation of the first communication node (Dispatch). After the first computing node (Attention) receives these returned gradients, it can proceed with its inverse computation. Therefore, before executing the attention. Backword on the computation flow, it is necessary to wait for the dispatch_ bwd _handle (asynchronous communication handle representing the reverse distribution communication task dispatch. Backword) to complete.

The method provided by the embodiment of the invention establishes a high-efficiency and accurate dynamic synchronization mechanism. It performs necessary synchronous waiting only at the last moment before the data is really needed by means of asynchronous communication handle and delay waiting, rather than blocking the whole flow at the beginning of the communication task. The strategy of instant use and instant use solves the problem of data dependence in parallel execution, simultaneously reserves overlapping windows of calculation and communication to the maximum extent, and is a key technical guarantee for realizing efficient staggered assembly line, thereby realizing the maximization of training performance on the premise of ensuring calculation accuracy.

Based on any of the above embodiments, the method further comprises:

It should be noted that, in order to further improve stability and robustness of the distributed training process, the embodiment of the present invention further provides a fault tolerance mechanism. In complex distributed computing environments, communication delays or failures are common problems that can lead to a stalling or crashing of the entire training process. The embodiment of the invention can effectively cope with such abnormal situations by introducing a timeout detection and buffer pre-allocation mechanism.

Specifically, the embodiment of the invention pre-allocates an independent communication buffer for the communication task before performing the staggered pipeline scheduling. Specifically, the system may pre-allocate dedicated buffers in the memory of the computing device (e.g., GPU) for each expert communication operation (e.g., all-to-ALL DISPATCH and Combine) before training begins or during the initialization phase of each iteration. The method has the advantages that firstly, the running overhead and delay caused by dynamic memory allocation in the training process are avoided, and secondly, by providing independent memory space for each communication task, execution conflict or blockage possibly caused by memory resource competition is eliminated, and guarantee is provided for smooth parallel execution of calculation flow and communication flow.

In addition, the embodiment of the invention also introduces a timeout detection mechanism in dynamic synchronous control. To ensure correctness of data dependencies, the computational flow synchronizes the communication flow by waiting for operations before performing tasks that depend on the communication results. However, an indefinite wait can cause the system to become stuck in a dead office when communication is abnormal. Thus, the wait operation in the embodiments of the present invention is configured as timeout monitoring with a preset duration.

Specifically, the system sets a reasonable timeout threshold for each waiting operation. During training, when the computational flow performs a wait operation, a timer is started. If the target communication task associated with the waiting asynchronous communication handle has not been completed (e.g., due to an All-to-All communication timeout caused by network congestion or node failure) for a preset period of time, the system does not continue to wait, but triggers a preset fault tolerance process.

The core of the fault tolerant process is to adjust the scheduling of subsequent tasks. For example, the system may mark the current micro-lot experiencing the communication failure as failed and discard it from the current training pipeline while continuing to schedule subsequent other normal micro-lots to avoid blocking the entire training process by a single point of failure. In other implementations, the system may also attempt to reschedule failed communication tasks to the communication flow or report anomalies to a higher-level training management module, which decides whether expert routing allocation needs to be rescheduled or training parameters adjusted. In this way, the embodiment of the invention not only realizes high-efficiency calculation communication overlapping, but also ensures the robustness of the training process in the face of underlying hardware or network fluctuation, and remarkably improves the availability and reliability of large-scale model training.

Based on any of the above embodiments, the parallel processing includes at least one of:

Specifically, fig. 4 is a schematic diagram of parallel processing of tasks of a first micro-lot and tasks of a second micro-lot, as shown in fig. 4, which illustrates a case where two consecutive micro-lots (e.g., the first micro-lot and the second micro-lot) perform cross-pipeline dispatch on a computation flow and a communication flow after a training iteration enters a steady state. In this steady state, the pipeline is "filled" i.e., the first micro-batch (whose forward propagation has been completed) is performing its backward propagation task while the second micro-batch is performing its forward propagation task. The embodiment of the invention utilizes the time window to realize the overlapping of the calculation task and the communication task.

In an embodiment, as shown in fig. 4, when the tasks of the first micro batch and the tasks of the second micro batch are processed in parallel, the reverse communication task of the first micro batch and the forward calculation task of the second micro batch may be processed in parallel. This parallel mode utilizes the neutral of calculation of one micro-lot to perform reverse communication of another micro-lot.

Reference may be made in particular to the first row overlap of fig. 4. When the second micro batch is executing its forward computing task on the computing stream, i.e. the computing task (attention. Forward) corresponding to the first computing node, the computing unit of the hardware is occupied. At the same time, the second micro-batch is performing its reverse communication task, i.e. the reverse operation (combine. Backward) of the second communication node is scheduled to be performed on the separate communication flow. Since the computation flow and the communication flow can work in parallel, the computation process of the attention. Forward and the communication process of the combine. Backward achieve a temporal overlap.

Similarly, reference is made to the third row overlapping portion of fig. 4. When the second micro-batch is performing its other forward computing task on the computing stream, i.e., the computation (mlp. Forward) corresponding to the second computing node, the other reverse communication task of the first micro-batch, i.e., the reverse operation (dispatch. Backward) of the first communication node, is being performed on the communication stream. Both also achieve parallel processing.

In another embodiment, as shown in fig. 4, when the tasks of the first micro batch and the tasks of the second micro batch are processed in parallel, the reverse calculation tasks of the first micro batch and the forward communication tasks of the second micro batch may be processed in parallel. The parallel mode is complementary to the above situation, and ensures that hardware resources can be fully utilized in each stage of the pipeline.

Reference may be made in particular to the second row overlap portion of fig. 4. When the second micro-batch is performing its forward communication task, i.e., the forward operation (dispatch) of the first communication node, the task runs on the communication stream, occupying communication bandwidth. At the same time, the first micro-batch is performing its inverse computation task on the computation flow, i.e. the inverse operation of the second computation node (mlp. Backward). The communications process of dispatch, forward and the computation process of mlp, backward achieve a temporal overlap.

Similarly, reference is made to the fourth row overlapping portion of fig. 4. When the second micro-batch performs its forward communication task, i.e. the forward operation (combine. Forward) of the second communication node, the first micro-batch is performing its reverse calculation task, i.e. the reverse operation (attention. Backward) of the first calculation node, on the calculation flow, both again achieving parallelism.

According to the method provided by the embodiment of the invention, the idle waiting time of the computing equipment such as the GPU is greatly reduced by skillfully overlapping the forward phase task of one micro-batch with the reverse phase task of the other micro-batch, namely overlapping the computing task with the communication task or overlapping the communication task with the computing task. The strategy of hiding the execution time of one task in the execution time of another task is a root cause of breaking through the bottleneck of the prior art and obviously improving the utilization rate of hardware and the end-to-end training efficiency.

The model distributed training optimization device provided by the invention is described below, and the model distributed training optimization device described below and the model distributed training optimization method described above can be referred to correspondingly.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of a model distributed training optimization apparatus provided by the present invention, and as shown in fig. 5, the apparatus includes:

A constructing unit 510 for respectively constructing a computation flow for performing computation tasks and a communication flow for performing communication tasks based on computation operations and communication operations in the model;

An execution unit 520, configured to execute, in a training iteration of the model, interleaved pipeline scheduling for at least two micro-batch tasks;

The device provided by the embodiment of the invention constructs independent calculation flow and communication flow respectively based on calculation operation and communication operation in the model, thereby fundamentally creating conditions for parallel processing of tasks. On the basis, by performing staggered pipeline scheduling on tasks of different micro-batches, the calculation tasks and the communication tasks which originally need to be serially waited are overlapped to be performed, and the communication time is effectively hidden within the calculation time, so that the serial execution bottleneck of calculation and communication is broken. Because the calculation task and the communication task can be executed in parallel, the calculation unit of the calculation device and the communication link in charge of data transmission can be in a busy state at the same time, the idle time of the device is greatly reduced, the hardware utilization rate is improved, and the model training efficiency is effectively improved. In addition, compared with an algorithm optimization scheme relying on an auxiliary loss function, the method is an optimization of an execution scheduling layer, the performance improvement effect of the method can be immediately achieved after training iteration starts, the method can take effect after waiting for multiple iterations like algorithm optimization, and the problem of insufficient instantaneity is solved. More importantly, the efficiency is improved by optimizing the task scheduling flow, no new calculation amount such as calculation auxiliary loss is introduced, and extra calculation expenditure is avoided.

Based on any of the above embodiments, the constructing unit 510 includes:

a node reconstruction subunit configured to reconstruct a computing operation in the model into at least one computing node, and reconstruct a communication operation in the model into at least one communication node;

And the execution flow construction subunit is used for respectively constructing the calculation flow and the communication flow based on the at least one calculation node and the at least one communication node, wherein the calculation flow is used for executing the calculation task corresponding to the at least one calculation node, and the communication flow is used for executing the communication task corresponding to the at least one communication node.

Based on any of the above embodiments, the node reconstruction subunit is specifically configured to:

Based on any of the above embodiments, the apparatus further includes a synchronization unit, where the synchronization unit is configured to:

acquiring an asynchronous communication handle based on the communication flow;

Based on any of the above embodiments, the apparatus further comprises:

a buffer allocation unit for pre-allocating an independent communication buffer for a communication task associated with the communication flow before performing the interleaved pipeline scheduling;

and the scheduling adjustment unit is used for adjusting the scheduling of the tasks of the subsequent one or more micro-batches if the target communication task associated with the asynchronous communication handle is not executed to be completed within the preset duration of the waiting operation.

Fig. 6 illustrates a physical schematic diagram of an electronic device, which may include a processor 610, a communication interface 620, a memory 630, and a communication bus 640, where the processor 610, the communication interface 620, and the memory 630 communicate with each other via the communication bus 640, as shown in fig. 6. The processor 610 may invoke logic instructions in the memory 630 to perform a model distributed training optimization method comprising constructing a computational stream and a communication stream for performing computational tasks based on computational operations and communication operations, respectively, in a model training iteration, performing a cross-pipeline dispatch for tasks of at least two micro-batches, wherein the cross-pipeline dispatch comprises dispatching computational tasks of a first micro-batch to the computational stream execution while dispatching communication tasks of a second micro-batch to the communication stream execution, or dispatching computational tasks of the first micro-batch to the communication stream execution while dispatching computational tasks of the second micro-batch to the computational stream execution, such that the tasks of the first micro-batch and the tasks of the second micro-batch are processed in parallel, the first micro-batch and the second micro-batch being any two different micro-batches of the at least two micro-batches.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

In another aspect, the invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing a model distributed training optimization method provided by the methods described above, the method comprising constructing a computing stream and a communication stream, respectively, based on computing operations and communication operations in a model, the computing stream being used to execute computing tasks, the communication stream being used to execute communication tasks, and executing an interleaved pipeline schedule for tasks of at least two micro-lots in a training iteration of the model, wherein the interleaved pipeline schedule comprises scheduling computing tasks of a first micro-lot to the computing stream execution while scheduling communication tasks of a second micro-lot to the communication stream execution, or scheduling computing tasks of the second micro-lot to the computing stream execution, such that the computing tasks of the first micro-lot and the second micro-lot are any two micro-lot that are not in parallel with the first micro-lot and the second micro-lot.

In yet another aspect, the invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the model distributed training optimization method provided by the above methods, the method comprising constructing a computation flow and a communication flow, respectively, based on the computation operation and the communication operation in the model, the computation flow being used to execute the computation task and the communication flow being used to execute the communication task, performing an interleaved pipeline dispatch for tasks of at least two micro-batches in a training iteration of the model, wherein the interleaved pipeline dispatch comprises dispatching the computation task of a first micro-batch to the computation flow execution while dispatching the communication task of a second micro-batch to the communication flow execution, or dispatching the communication task of the first micro-batch to the communication flow execution while dispatching the computation task of the second micro-batch to the computation flow execution, such that the tasks of the first micro-batch and the tasks of the second micro-batch are processed in parallel, the first micro-batch and the tasks of the second micro-batch being any two micro-batches different from each other.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A model distributed training optimization method, comprising:

2. The model distributed training optimization method of claim 1, wherein the constructing a computation flow and a communication flow based on the computation operation and the communication operation in the model, respectively, comprises:

3. The model distributed training optimization method of claim 2, wherein the reconstructing computing operations in the model into at least one computing node comprises:

4. The model distributed training optimization method of claim 2, wherein the reconstructing the communication operations in the model into at least one communication node comprises:

5. The model distributed training optimization method of claim 3, further comprising:

acquiring an asynchronous communication handle based on the communication flow;

6. The model distributed training optimization method according to claim 5, wherein when the target communication task is a distributed communication task in a forward stage, the target calculation task is a calculation task corresponding to the second calculation node in the forward stage;

7. The model distributed training optimization method of claim 5, further comprising:

8. The model distributed training optimization method of any one of claims 1 to 7, wherein the parallel processing includes at least one of:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the model distributed training optimization method of any of claims 1 to 8 when executing the computer program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the model distributed training optimization method of any of claims 1 to 8.