CN114565007B

CN114565007B - A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Info

Publication number: CN114565007B
Application number: CN202210023028.1A
Authority: CN
Inventors: 徐辰; 毕倪飞; 陈梓浩; 周傲英
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2025-06-13
Anticipated expiration: 2042-01-10
Also published as: CN114565007A

Abstract

The invention discloses a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which comprises the steps that a data collector collects communication time t _cm and calculation time t _cp data required by an adaptive communication interval in the running process, an adaptive communication interval selector automatically adjusts the communication interval tau through the collected communication time and calculation time data, iterative training is carried out on the model, a correction technology is adopted in each iteration, and a global model is updated every tau iteration by using a skip communication strategy. According to the invention, the communication interval tau is selected in a self-adaptive manner through the acquired data, and based on the communication interval tau, the system updates the global model once through network communication every tau round of iteration, so that network communication expenditure is reduced, and finally, the time for training the model is shortened.

Description

Method for training model with low communication overhead and high statistical efficiency in distributed deep learning system

Technical Field

The invention belongs to the field of distributed deep learning, and relates to a method for training a model with low communication overhead and high statistical efficiency for synchronous data parallelism in a distributed deep learning system.

Background

In model training for synchronous data parallelism, a distributed deep learning system starts a plurality of training processes in a cluster, and each training process has a complete model backup. Meanwhile, the distributed deep learning system divides the whole data set into a plurality of data fragments, and distributes all the data fragments to each training process. The entire process of model training consists of a series of iterations, in each of which each training process selects a batch of data from the assigned data slices, and parameter updates are calculated on the model. To synchronize the parameter updates across all data slices, the parameter updates across all training processes are typically aggregated by means of PARAMETER SERVER or AllReduce communication architecture between training processes. Based on the aggregated parameter updates, each training process updates the parameters of the model. However, the specific process of updating the model parameters is determined by the method of training the model, and different methods of training the model correspond to different updating processes. At present, four methods of training models facing synchronous data parallelism in a distributed deep learning system mainly comprise SSGD, skipSSGD, localSGD and SMA.

SSGD is a method of training models that is widely employed in distributed deep learning systems. SSGD aggregate the computed gradients for each training session over a batch of data over network communications in each iteration, and update the model based on the aggregated gradients. Figure 1 shows three successive iterations of training the model using the SSGD method. In iteration 10, two training processes calculate gradients based on model w ₁₀, respectivelyAndThe gradient is then aggregated by network communicationAndAnd an average gradient is calculated. Then SSGD updates the model from w ₁₀ to w ₁₁ using the average gradient, and SSGD goes to the next iteration after the model update is completed. Since each iteration involves network communication, the SSGD method typically presents a communication bottleneck in a distributed environment.

On the basis of SSGD, skipSSGD employs a skip communication strategy to update the model. SkipSSGD do not aggregate gradients by network communication in each iteration, but only communicate once every τ iterations, thereby greatly reducing communication overhead. To retain the acquired gradient information while skipping communications, skipSSGD maintains a gradient accumulator for holding gradient information during each training session. Every τ round of iteration SkipSSGD aggregates the gradients in each accumulator by one network communication and updates the model according to the aggregated gradients. Figure 2 shows three successive iterations of the SkipSSGD method training model for τ=3. Based on model w ₁₀, the two training processes calculate gradients in iterations 10, 11 and 12, respectivelyAndAnd accumulate these gradients into respective gradient accumulators. After completing the gradient accumulation for three consecutive iterations SkipSSGD aggregate the accumulated gradients by one network communication and update the model from w ₁₀ to w ₁₁. Although SkipSSGD reduces communication overhead, the manner in which gradients are accumulated can result in large batch sizes, which in turn can lead to poor statistical efficiency.

LocalSGD also employ a skip communication strategy to reduce communication overhead. However, unlike SkipSSGD which skip communication by accumulating gradients, localSGD maintains multiple local models and one global model, which updates the local models directly with gradients in each iteration round, and aggregates parameters of each local model through network communication every τ iterations. Figure 3 shows three successive iterations of the LocalSGD method training model for τ=3. On iteration 10, a training process calculates the gradientAnd update the local model with the gradientIn iterations 11 and 12, the training process continues to calculate the gradientAndAnd updates the local model toAndSimilarly, another training process also updates the local model to be in iterations 10, 11, and 12, respectivelyAndAfter completing the local model updates for three successive iterations, localSGD aggregates the local models through one network communication and slaves the global modelUpdated toAlthough LocalSGD reduces communication overhead and maintains a small batch size, the individual local models diverge more and more as τ increases, resulting in poor statistical efficiency.

To reduce the divergence between the individual local models, SMA employs a correction technique that corrects each update of the individual local model by a global model. Figure 4 shows three successive iterations of the SMA method training model. In iteration 10, a training process follows the gradientAnd correctingSlave the local modelUpdated toAnother training process is based on gradientsAnd correctingSlave the local modelUpdated toMeanwhile, in order to ensure that the global model is up-to-date, the SMA updates the global model in each iteration. For example, in iteration 10, the SMA aggregates the computed corrections in the two training processes through network communication (i.eAnd) To slave the global modelUpdated toWhile SMA maintains a small batch size and reduces the degree of divergence between the individual local models, each iteration involves network communications to update the global model, often creating a communications bottleneck in a distributed environment.

In general, existing methods for training models for synchronous data parallelism have the defect that the methods do not ensure low communication overhead and high statistical efficiency at the same time.

Disclosure of Invention

In order to solve the deficiencies of the prior art, the present invention aims to provide a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which ensures low communication overhead by using a skip communication strategy based on adaptive communication interval, and the divergence degree among all local models is reduced by adopting a correction technology on the premise of keeping small batch size, so that high statistical efficiency is ensured, and finally, the time for training the models is shortened.

Existing training methods (e.g., skipSSGD and LocalSGD) employing skipped communication strategies all require the user to manually specify a communication interval τ. However, the communication overhead incurred by the same communication interval τ in different operating environments (including the computational power of the GPU, the bandwidth of the network, and the size of the model) tends to be different, that is, a communication interval τ that is suitable for one operating environment may not be suitable for another operating environment. Therefore, in order to automatically select a suitable communication interval τ in different operation environments, the invention also proposes a method for adaptively adjusting the communication interval during the model training process. The method is realized by a runtime data collector and an adaptive communication interval selector, wherein the runtime data collector collects communication and calculation time data in a first round of iteration, and the adaptive communication interval selector automatically adjusts the communication interval tau based on the collected communication and calculation time data, so that the communication time and the calculation time in each training period epoch are similar, and the communication expense is prevented from becoming a bottleneck in the whole training.

The specific technical scheme for realizing the aim of the invention is as follows:

a method of training a model in a distributed deep learning system with low communication overhead and high statistical efficiency, the method comprising the steps of:

Step A, the data collector collects the data of the communication time t _cm and the calculation time t _cp required by the self-adaptive communication interval during the running process, and the self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and the calculation time;

Step B, performing iterative training on the model, and updating the local model by adopting a correction technology in each round of iteration;

and C, updating the global model by using the skipped communication strategy every tau round of iteration.

The method comprises the steps that the data collector collects data of communication time t _cm and calculation time t _cp required by the self-adaptive communication interval in the running process, and the self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and the calculation time, wherein the step of automatically adjusting the communication interval tau by the self-adaptive communication interval selector comprises the following steps:

Step A1, the system automatically initializes the communication interval tau to 1, sets the communication interval tau=1 and sets an adaptive communication interval flag bit, wherein the adaptive communication interval flag bit is used for indicating whether an adaptive communication interval selector is started or not, the adaptive communication interval flag bit=true indicates that the adaptive communication interval selector is started, namely, the system adaptively selects a proper communication interval tau, and the flag=false indicates that the adaptive communication interval selector is forbidden, namely, the user is required to specify the communication interval tau;

Step A2, when the system is running, the data collector collects time consumption t _cm of communication and time consumption t _cp of calculation in the first iteration;

step A3 adaptive communication interval selector adjusts the communication interval τ, specifically, the communication interval τ, to be based on t _cm and t _cp acquired in the first round of iterations I.e. the communication interval is adjusted to round up the quotient of t _cm and t _cp.

After the adjustment of the communication interval τ is completed in the first iteration, the communication interval τ is used for all subsequent iterations, i.e. the collection of communication and calculation time data and the adjustment of the communication interval τ are no longer performed in the subsequent iterations.

The step of performing iterative training on the model and updating the local model by adopting a correction technology in each round of iteration comprises the following steps:

step B1, each training process calculates gradient according to the local model; Where i represents the number of training processes, k represents the number of iterations, Representing the gradient calculated by the training process numbered i in the kth round of iterations, b ⁽ⁱ⁾ represents the batch size over the training process numbered i,Representing the local model parameters of the training process numbered i in the kth iteration, L (x, w) representing the loss of sample x calculated on model parameters w;

step B2, each training process calculates a correction Where i represents the number of training processes, k represents the number of iterations,Representing the correction calculated by the training process numbered i in the kth round of iterations,Representing the local model parameters of the training process numbered i in the kth round of iterations,Representing global model parameters in a kth iteration, the correction being a difference between the local model parameters and the global model parameters;

And step B3, each training process updates the local model by using the gradient and correction obtained by calculation, wherein the specific updating form is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model parameters of the training process numbered i in the k and k +1 iterations,AndRespectively represent the local model momentum, gradient and correction of the training process numbered i in the kth iteration, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.

The local model momentum is initially empty and will also be updated according to the gradient in each iteration, in the form of: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model momentum of the training process numbered i in the k and k +1 iterations,Representing the gradient of the training process numbered i in the kth iteration, μ and γ represent the momentum coefficient and learning rate, respectively.

The step of updating the global model by using the skipped communication strategy every tau round of iteration comprises the following steps:

Step C1, using the current iteration number to make a remainder for the self-adaptive communication interval tau, if the remainder is 0, indicating that the current iteration needs to update the global model, and then entering step C2;

step C2, aggregating the corrections calculated by each training process through one-time network communication to obtain an aggregated correction Where n represents the total number of training processes, k represents the number of iterations,The correction calculated in the kth iteration represents the training process numbered i, and the global model is updated by the aggregated correction, and the specific updating form of the global model is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting global model parameters in the k-th and k +1 iterations respectively,Represents the global model momentum in the kth iteration, μ represents the momentum coefficient, and α represents the correction coefficient. The global model momentum is empty at the beginning, and is updated once according to the correction after aggregation every τ round of iteration, and the specific updating form of the global model momentum is as follows: wherein k represents the number of iterations, AndRepresenting the global model momentum in the k-th and k +1 iterations,Representing the correction after polymerization, μ and α represent the momentum coefficient and correction coefficient, respectively.

The communication overhead refers to the ratio of the communication time in one Epoch total time consuming.

The communication interval refers to the number of iterations spaced between two adjacent iterations in which global model updates occur.

The local model refers to a model in which each training process is independently updated by locally calculated gradients and corrections.

The global model refers to a model that is updated by correction of network traffic aggregate in common for all training processes, or the global model refers to a model that is used to correct local model updates in each iteration.

The invention also provides a system for realizing the method, which comprises a runtime data collector, an adaptive communication interval selector and a model updating module.

The runtime data collector collects communication time data t _cm and calculation time data t _cp of the system in the first iteration;

The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t _cm and the calculation time t _cp acquired by the runtime data collector And using the communication interval τ for all subsequent iterations;

The model updating module carries out iterative updating on the local model and the global model according to the communication interval tau obtained by the self-adaptive communication interval selecting module, namely, the local model is updated by adopting a correction technology in each iteration round, and the global model is updated by utilizing a skip communication strategy every tau iteration round.

The beneficial effects of the invention include:

The invention adopts correction technology to update local models in each round of iteration, reduces the divergence degree among the local models, and ensures high statistical efficiency. Meanwhile, communication and calculated time data are acquired in the running process of the system, a communication interval tau is adaptively selected according to the acquired data, and based on the communication interval tau, the system updates the global model through network communication every tau round of iteration, so that network communication expenditure is reduced, and finally the time for training the model is shortened. The results of the implementation show that SkipSMA shortens the total training time by 86.76%, 64.83%, 16.37% and 79.26% compared to SSGD, skipSSGD, localSGD and SMA, respectively. Because SkipSMA, while maintaining similar statistical efficiencies as SSGD, skipSSGD, localSGD and SMA, skipSMA reduces the communication overhead by 93.3%, 80% and 33.3% compared to SSGD, skipSSGD, localSGD, respectively, skipSMA reduces the calculation and communication overhead by 79.9% in each round of epoch compared to SMA.

Drawings

FIG. 1 is a schematic diagram of an embodiment of the prior art that involves training a model using the SSGD method;

FIG. 2 is a schematic diagram of an embodiment of the prior art that involves training a model using the SkipSSGD method;

FIG. 3 is a schematic diagram of an embodiment of the prior art that involves training a model using LocalSGD methods;

FIG. 4 is a schematic diagram of an embodiment of the prior art involving training a model using an SMA method;

FIG. 5 is a schematic diagram of a SkipSMA method training model of an embodiment of the present invention;

FIG. 6 is a graph comparing total training time of SkipSMA and prior training methods in the results of the practice of the present invention;

FIG. 7 is a graph comparing communication overhead of SkipSMA and the prior training method in the result of the present invention;

FIG. 8 is a graph comparing statistical efficiencies of SkipSMA and prior training methods in the results of the practice of the present invention;

Fig. 9 is a flow chart of the present invention.

Detailed Description

The invention will be described in further detail with reference to the following specific examples and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

To reduce the communication overhead of training models in a distributed environment SkipSMA employs a skip communication strategy to update the global model, i.e., network communication is performed every few iterations. Because if the global model is updated through network communication in each iteration like SMA, a high network communication overhead is incurred in a distributed environment, resulting in long training times. Therefore, to reduce network communication overhead, skipSMA updates the global model every τ rounds of iteration, thereby reducing the frequency of network communications. Where the communication interval τ is a critical factor affecting the communication overhead. However, the same communication interval τ tends to cause different communication overhead in different environments (including the computational power of the GPU, network bandwidth, and model size, etc.). Therefore, it is necessary to select the communication interval τ using an adaptive strategy. SkipSMA adaptively adjust the communication interval τ by collecting the required communication and computation time data at run-time. Specifically, at system initialization SkipSMA sets the communication interval τ to 1, then at system operation SkipSMA collects the time t _cm for communication and the time t _cp for computation in the first round of iterations. Based on the acquired t _cm and t _cp, the adaptive communication interval selector adjusts the communication interval τ to beThis allows the communication time and computation time in each epoch to be similar. That is, the adaptive communication interval is such that the communication overhead is no longer a bottleneck in the overall training process.

According to the communication interval tau, skipSMA calculated by the adaptive communication interval selector in the first iteration, the local model is updated in each subsequent iteration by using a correction technique, and the global model is updated every tau iteration by using a skipped communication strategy. Specifically, in the iteration with the iteration number pair communication interval τ not being 0, skipSMA only uses the correction technology to update the local model, and does not involve network communication to update the global model, and in the iteration with the iteration number pair communication interval τ not being 0, skipSMA updates the global model through network communication in addition to the local model.

In each iteration, each training process calculates gradients and corrections from the local model and the global model, and updates the local model using the gradients and corrections simultaneously. As shown in FIG. 5, in iteration 10, a training process follows the local modelCalculating gradientAnd combined with global modelsCalculate correctionThe local model is then subtracted from the calculated gradient and correctionUpdated toThe correction here represents a penalty when the local model deviates from the global model. Similarly, another training process is according to the local modelCalculating gradientAnd combined with global modelsCalculate correctionThe local model is then subtracted from the calculated gradient and correctionUpdated toAs does the updating of the local model in subsequent iterations.

Every tau iterations, the global model is updated by means of network communication between each training process. Fig. 5 illustrates, by way of example, a communication interval τ=3, which describes SkipSMA a process for updating a global model with a skipped communication policy. In round 10 and 11 iterations SkipSMA does not update the global model, i.e., skipSMA skips communications in round 10 and 11 iterations. In the 12 th round of iteration, since 12 has a remainder of 0 for the communication interval 3, the global model needs to be updated by network communication in the 12 th round of iteration. Specifically, skipSMA aggregate the corrections calculated over all training processes (i.e.And) And slave the global model according to the aggregated correctionUpdated to

The above is a specific implementation of a method SkipSMA for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, where the method can be implemented by the relevant code in method 1, and the code of method 1 is as follows:

examples

4 Machines with a CentOS 7 operating system are used for constructing a cluster for training a model, the machines are connected by a network with a bandwidth of 1000Mbps, and each machine has two GPUs with models TeslaV to 100. In this cluster, the ResNet model was trained using five methods SSGD, skipSSGD, localSGD, SMA and SkipSMA, respectively, which compared to the five methods, train the ResNet model to a total training time, communication overhead, and statistical efficiency of 68% accuracy. Notably, SSGD, skipSSGD, localSGD and SkipSMA both run on four machines in the cluster, while SMA focuses primarily on a single machine multi-card scenario, so SMA only runs on one machine in the cluster. Furthermore, the communication interval τ of SkipSMA is determined during operation by the adaptive communication interval selector, while the communication intervals τ of SkipSSGD and LocalSGD are designated 1, 2,3, 4, 5, 10, 20, 30, and 40 prior to operation, and one of them is selected so that the training time is the shortest to compare with SkipSMA.

As shown in fig. 6, skipSMA shortens the total training time by 86.76% compared to SSGD. This is mainly because SkipSMA has a lower communication overhead than SSGD, and in particular SSGD involves network communication in each iteration, whereas SkipSMA does not perform communication until every τ iterations. This results in SkipSMA a 93.3% reduction in communication overhead over SSGD, as shown in fig. 7.

As shown in fig. 6, skipSMA reduced the total training time by 64.83% compared to SkipSSGD. This is because SkipSSGD cannot train the model to 68% accuracy within 50 hours at communication interval τ >3, and SkipSSGD achieves the shortest total training time at communication interval τ=3, subject to batch size. However, skipSSGD at communication interval τ=3, the communication overhead is still relatively large, as shown in fig. 7, skipSMA employing an adaptive communication interval reduces the communication overhead by 80% over SkipSSGD of communication interval τ=3.

As shown in fig. 6, skipSMA reduced the total training time by 16.37% compared to LocalSGD. LocalSGD the total training time is the shortest at communication interval τ=10. While LocalSGD can further reduce communication overhead when the communication interval τ >10, it can result in more divergence among the various local models, which can lead to a reduction in statistical efficiency. And SkipSMA, due to the adoption of a correction technology, the divergence among all local models is effectively relieved, and further high statistical efficiency is obtained. As shown in fig. 8, skipSMA with adaptive communication interval maintains similar statistical efficiency as LocalSGD with communication interval τ=10. At the same time SkipSMA adaptively adjusts the communication interval τ to 15 so that SkipSMA has low communication overhead. As shown in fig. 7, skipSMA employing an adaptive communication interval reduces communication overhead by 33.3% compared to LocalSGD of communication interval τ=10.

As shown in fig. 6, skipSMA shortens the total training time by 79.26% compared to SMA. Although the communication time in each iteration of the SMA is similar to the computation time, that is, the communication is not a bottleneck, the SMA utilizes the computation force of only a single node. SkipSMA not only keep the communication overhead low by having a skipped communication policy with adaptive communication intervals, but also take advantage of the computational power of multiple nodes. As shown in FIG. 7, skipSMA reduces the computation and communication time by 79.9% over SMA in each epoch.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that would occur to one skilled in the art are included in the invention without departing from the spirit and scope of the inventive concept, and the scope of the invention is defined by the appended claims.

Claims

1. A method of training a model in a distributed deep learning system with low communication overhead and high statistical efficiency, the method comprising the steps of:

Step A, a data collector collects data of communication time t _cm and calculation time t _cp required by the self-adaptive communication interval during running, and a self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and calculation time;

the step A further comprises the following steps:

Step A1, the system automatically initializes the communication interval tau to 1, sets the communication interval tau=1 and sets the flag of the adaptive communication interval;

Step A3, the adaptive communication interval selector adjusts the communication interval according to t _cm and t _cp acquired in the first round of iteration

the step B further comprises the following steps:

step B1, each training process calculates gradient according to the local model;

Step B2, each training process calculates correction, wherein the correction refers to a difference value between a local model and a global model;

Step B3, each training process updates the local model by using the gradient and correction obtained by calculation;

step C, updating the global model by using a skip communication strategy every tau round of iteration;

the step C further comprises the following steps:

Step C2, aggregating the corrections calculated by each training process through one-time network communication, and updating the global model by using the aggregated corrections;

in step C2, the post-polymerization correction is expressed in the following formula:

Where n represents the total number of training processes, k represents the number of iterations, Representing the correction calculated by the training process numbered i in the kth iteration;

the global model is updated in the form of formula Where i represents the number of training processes, k represents the number of iterations,AndRepresenting global model parameters in the k-th and k +1 iterations respectively,Represents the global model momentum in the kth iteration, μ represents the momentum coefficient, and α represents the correction coefficient;

the momentum of the global model is initially empty, and is updated once every tau iterations according to the correction after aggregation, and the specific updating form is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the global model momentum in the k-th and k +1 iterations,Representing the correction after polymerization, μ and α represent the momentum coefficient and correction coefficient, respectively.

2. The method of claim 1, wherein in the step A1, the adaptive communication interval flag is used to indicate whether the adaptive communication interval selector is enabled, when the adaptive communication interval flag is set to true, the adaptive communication interval selector is enabled, the system adaptively selects an appropriate communication interval τ, when the adaptive communication interval flag is set to false, the adaptive communication interval selector is disabled, and the user needs to specify the communication interval τ.

3. The method of claim 1, wherein after the first iteration completes the adjustment of the communication interval τ, the communication interval τ is used for all subsequent iterations.

4. The method of claim 1, wherein in step B1, the gradient is calculated by the following formula:

Where i represents the number of training processes, k represents the number of iterations, Representing the gradient calculated by the training process numbered i in the kth round of iterations, b ⁽ⁱ⁾ represents the batch size over the training process numbered i,Representing the local model parameters of the training process numbered i in the kth iteration, L (x, w) represents the loss of sample x calculated on model parameters w.

5. The method of claim 1, wherein in step B2, the correction is calculated by the following formula:

Where i represents the number of training processes, k represents the number of iterations, Representing the correction calculated by the training process numbered i in the kth round of iterations,Representing the local model parameters of the training process numbered i in the kth round of iterations,Representing global model parameters in the kth round of iterations.

6. The method of claim 1, wherein in step B3, the specific updated form of the local model is represented by the following formula:

Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model parameters of the training process numbered i in the k and k +1 iterations,AndRespectively represent the local model momentum, gradient and correction of the training process numbered i in the kth iteration, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.

7. The method of claim 6, wherein the local model momentum is initially empty and is updated according to a gradient in each iteration, in the form of:

Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model momentum of the training process numbered i in the k and k +1 iterations,Representing the gradient of the training process numbered i in the kth iteration, μ and γ represent the momentum coefficient and learning rate, respectively.

8. A system implementing the method of any of claims 1-7, the system comprising a runtime data collector, an adaptive communication interval selector, and a model update module.

9. The system of claim 8, wherein the runtime data collector collects system runtime communication time data t _cm and computation time data t _cp in a first round of iterations;