[go: up one dir, main page]

CN114565007B - A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems - Google Patents

A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems Download PDF

Info

Publication number
CN114565007B
CN114565007B CN202210023028.1A CN202210023028A CN114565007B CN 114565007 B CN114565007 B CN 114565007B CN 202210023028 A CN202210023028 A CN 202210023028A CN 114565007 B CN114565007 B CN 114565007B
Authority
CN
China
Prior art keywords
communication
model
communication interval
iterations
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210023028.1A
Other languages
Chinese (zh)
Other versions
CN114565007A (en
Inventor
徐辰
毕倪飞
陈梓浩
周傲英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202210023028.1A priority Critical patent/CN114565007B/en
Publication of CN114565007A publication Critical patent/CN114565007A/en
Application granted granted Critical
Publication of CN114565007B publication Critical patent/CN114565007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which comprises the steps that a data collector collects communication time t cm and calculation time t cp data required by an adaptive communication interval in the running process, an adaptive communication interval selector automatically adjusts the communication interval tau through the collected communication time and calculation time data, iterative training is carried out on the model, a correction technology is adopted in each iteration, and a global model is updated every tau iteration by using a skip communication strategy. According to the invention, the communication interval tau is selected in a self-adaptive manner through the acquired data, and based on the communication interval tau, the system updates the global model once through network communication every tau round of iteration, so that network communication expenditure is reduced, and finally, the time for training the model is shortened.

Description

Method for training model with low communication overhead and high statistical efficiency in distributed deep learning system
Technical Field
The invention belongs to the field of distributed deep learning, and relates to a method for training a model with low communication overhead and high statistical efficiency for synchronous data parallelism in a distributed deep learning system.
Background
In model training for synchronous data parallelism, a distributed deep learning system starts a plurality of training processes in a cluster, and each training process has a complete model backup. Meanwhile, the distributed deep learning system divides the whole data set into a plurality of data fragments, and distributes all the data fragments to each training process. The entire process of model training consists of a series of iterations, in each of which each training process selects a batch of data from the assigned data slices, and parameter updates are calculated on the model. To synchronize the parameter updates across all data slices, the parameter updates across all training processes are typically aggregated by means of PARAMETER SERVER or AllReduce communication architecture between training processes. Based on the aggregated parameter updates, each training process updates the parameters of the model. However, the specific process of updating the model parameters is determined by the method of training the model, and different methods of training the model correspond to different updating processes. At present, four methods of training models facing synchronous data parallelism in a distributed deep learning system mainly comprise SSGD, skipSSGD, localSGD and SMA.
SSGD is a method of training models that is widely employed in distributed deep learning systems. SSGD aggregate the computed gradients for each training session over a batch of data over network communications in each iteration, and update the model based on the aggregated gradients. Figure 1 shows three successive iterations of training the model using the SSGD method. In iteration 10, two training processes calculate gradients based on model w 10, respectivelyAndThe gradient is then aggregated by network communicationAndAnd an average gradient is calculated. Then SSGD updates the model from w 10 to w 11 using the average gradient, and SSGD goes to the next iteration after the model update is completed. Since each iteration involves network communication, the SSGD method typically presents a communication bottleneck in a distributed environment.
On the basis of SSGD, skipSSGD employs a skip communication strategy to update the model. SkipSSGD do not aggregate gradients by network communication in each iteration, but only communicate once every τ iterations, thereby greatly reducing communication overhead. To retain the acquired gradient information while skipping communications, skipSSGD maintains a gradient accumulator for holding gradient information during each training session. Every τ round of iteration SkipSSGD aggregates the gradients in each accumulator by one network communication and updates the model according to the aggregated gradients. Figure 2 shows three successive iterations of the SkipSSGD method training model for τ=3. Based on model w 10, the two training processes calculate gradients in iterations 10, 11 and 12, respectivelyAndAnd accumulate these gradients into respective gradient accumulators. After completing the gradient accumulation for three consecutive iterations SkipSSGD aggregate the accumulated gradients by one network communication and update the model from w 10 to w 11. Although SkipSSGD reduces communication overhead, the manner in which gradients are accumulated can result in large batch sizes, which in turn can lead to poor statistical efficiency.
LocalSGD also employ a skip communication strategy to reduce communication overhead. However, unlike SkipSSGD which skip communication by accumulating gradients, localSGD maintains multiple local models and one global model, which updates the local models directly with gradients in each iteration round, and aggregates parameters of each local model through network communication every τ iterations. Figure 3 shows three successive iterations of the LocalSGD method training model for τ=3. On iteration 10, a training process calculates the gradientAnd update the local model with the gradientIn iterations 11 and 12, the training process continues to calculate the gradientAndAnd updates the local model toAndSimilarly, another training process also updates the local model to be in iterations 10, 11, and 12, respectivelyAndAfter completing the local model updates for three successive iterations, localSGD aggregates the local models through one network communication and slaves the global modelUpdated toAlthough LocalSGD reduces communication overhead and maintains a small batch size, the individual local models diverge more and more as τ increases, resulting in poor statistical efficiency.
To reduce the divergence between the individual local models, SMA employs a correction technique that corrects each update of the individual local model by a global model. Figure 4 shows three successive iterations of the SMA method training model. In iteration 10, a training process follows the gradientAnd correctingSlave the local modelUpdated toAnother training process is based on gradientsAnd correctingSlave the local modelUpdated toMeanwhile, in order to ensure that the global model is up-to-date, the SMA updates the global model in each iteration. For example, in iteration 10, the SMA aggregates the computed corrections in the two training processes through network communication (i.eAnd) To slave the global modelUpdated toWhile SMA maintains a small batch size and reduces the degree of divergence between the individual local models, each iteration involves network communications to update the global model, often creating a communications bottleneck in a distributed environment.
In general, existing methods for training models for synchronous data parallelism have the defect that the methods do not ensure low communication overhead and high statistical efficiency at the same time.
Disclosure of Invention
In order to solve the deficiencies of the prior art, the present invention aims to provide a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which ensures low communication overhead by using a skip communication strategy based on adaptive communication interval, and the divergence degree among all local models is reduced by adopting a correction technology on the premise of keeping small batch size, so that high statistical efficiency is ensured, and finally, the time for training the models is shortened.
Existing training methods (e.g., skipSSGD and LocalSGD) employing skipped communication strategies all require the user to manually specify a communication interval τ. However, the communication overhead incurred by the same communication interval τ in different operating environments (including the computational power of the GPU, the bandwidth of the network, and the size of the model) tends to be different, that is, a communication interval τ that is suitable for one operating environment may not be suitable for another operating environment. Therefore, in order to automatically select a suitable communication interval τ in different operation environments, the invention also proposes a method for adaptively adjusting the communication interval during the model training process. The method is realized by a runtime data collector and an adaptive communication interval selector, wherein the runtime data collector collects communication and calculation time data in a first round of iteration, and the adaptive communication interval selector automatically adjusts the communication interval tau based on the collected communication and calculation time data, so that the communication time and the calculation time in each training period epoch are similar, and the communication expense is prevented from becoming a bottleneck in the whole training.
The specific technical scheme for realizing the aim of the invention is as follows:
a method of training a model in a distributed deep learning system with low communication overhead and high statistical efficiency, the method comprising the steps of:
Step A, the data collector collects the data of the communication time t cm and the calculation time t cp required by the self-adaptive communication interval during the running process, and the self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and the calculation time;
Step B, performing iterative training on the model, and updating the local model by adopting a correction technology in each round of iteration;
and C, updating the global model by using the skipped communication strategy every tau round of iteration.
The method comprises the steps that the data collector collects data of communication time t cm and calculation time t cp required by the self-adaptive communication interval in the running process, and the self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and the calculation time, wherein the step of automatically adjusting the communication interval tau by the self-adaptive communication interval selector comprises the following steps:
Step A1, the system automatically initializes the communication interval tau to 1, sets the communication interval tau=1 and sets an adaptive communication interval flag bit, wherein the adaptive communication interval flag bit is used for indicating whether an adaptive communication interval selector is started or not, the adaptive communication interval flag bit=true indicates that the adaptive communication interval selector is started, namely, the system adaptively selects a proper communication interval tau, and the flag=false indicates that the adaptive communication interval selector is forbidden, namely, the user is required to specify the communication interval tau;
Step A2, when the system is running, the data collector collects time consumption t cm of communication and time consumption t cp of calculation in the first iteration;
step A3 adaptive communication interval selector adjusts the communication interval τ, specifically, the communication interval τ, to be based on t cm and t cp acquired in the first round of iterations I.e. the communication interval is adjusted to round up the quotient of t cm and t cp.
After the adjustment of the communication interval τ is completed in the first iteration, the communication interval τ is used for all subsequent iterations, i.e. the collection of communication and calculation time data and the adjustment of the communication interval τ are no longer performed in the subsequent iterations.
The step of performing iterative training on the model and updating the local model by adopting a correction technology in each round of iteration comprises the following steps:
step B1, each training process calculates gradient according to the local model; Where i represents the number of training processes, k represents the number of iterations, Representing the gradient calculated by the training process numbered i in the kth round of iterations, b (i) represents the batch size over the training process numbered i,Representing the local model parameters of the training process numbered i in the kth iteration, L (x, w) representing the loss of sample x calculated on model parameters w;
step B2, each training process calculates a correction Where i represents the number of training processes, k represents the number of iterations,Representing the correction calculated by the training process numbered i in the kth round of iterations,Representing the local model parameters of the training process numbered i in the kth round of iterations,Representing global model parameters in a kth iteration, the correction being a difference between the local model parameters and the global model parameters;
And step B3, each training process updates the local model by using the gradient and correction obtained by calculation, wherein the specific updating form is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model parameters of the training process numbered i in the k and k +1 iterations,AndRespectively represent the local model momentum, gradient and correction of the training process numbered i in the kth iteration, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.
The local model momentum is initially empty and will also be updated according to the gradient in each iteration, in the form of: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model momentum of the training process numbered i in the k and k +1 iterations,Representing the gradient of the training process numbered i in the kth iteration, μ and γ represent the momentum coefficient and learning rate, respectively.
The step of updating the global model by using the skipped communication strategy every tau round of iteration comprises the following steps:
Step C1, using the current iteration number to make a remainder for the self-adaptive communication interval tau, if the remainder is 0, indicating that the current iteration needs to update the global model, and then entering step C2;
step C2, aggregating the corrections calculated by each training process through one-time network communication to obtain an aggregated correction Where n represents the total number of training processes, k represents the number of iterations,The correction calculated in the kth iteration represents the training process numbered i, and the global model is updated by the aggregated correction, and the specific updating form of the global model is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting global model parameters in the k-th and k +1 iterations respectively,Represents the global model momentum in the kth iteration, μ represents the momentum coefficient, and α represents the correction coefficient. The global model momentum is empty at the beginning, and is updated once according to the correction after aggregation every τ round of iteration, and the specific updating form of the global model momentum is as follows: wherein k represents the number of iterations, AndRepresenting the global model momentum in the k-th and k +1 iterations,Representing the correction after polymerization, μ and α represent the momentum coefficient and correction coefficient, respectively.
The communication overhead refers to the ratio of the communication time in one Epoch total time consuming.
The communication interval refers to the number of iterations spaced between two adjacent iterations in which global model updates occur.
The local model refers to a model in which each training process is independently updated by locally calculated gradients and corrections.
The global model refers to a model that is updated by correction of network traffic aggregate in common for all training processes, or the global model refers to a model that is used to correct local model updates in each iteration.
The invention also provides a system for realizing the method, which comprises a runtime data collector, an adaptive communication interval selector and a model updating module.
The runtime data collector collects communication time data t cm and calculation time data t cp of the system in the first iteration;
The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t cm and the calculation time t cp acquired by the runtime data collector And using the communication interval τ for all subsequent iterations;
The model updating module carries out iterative updating on the local model and the global model according to the communication interval tau obtained by the self-adaptive communication interval selecting module, namely, the local model is updated by adopting a correction technology in each iteration round, and the global model is updated by utilizing a skip communication strategy every tau iteration round.
The beneficial effects of the invention include:
The invention adopts correction technology to update local models in each round of iteration, reduces the divergence degree among the local models, and ensures high statistical efficiency. Meanwhile, communication and calculated time data are acquired in the running process of the system, a communication interval tau is adaptively selected according to the acquired data, and based on the communication interval tau, the system updates the global model through network communication every tau round of iteration, so that network communication expenditure is reduced, and finally the time for training the model is shortened. The results of the implementation show that SkipSMA shortens the total training time by 86.76%, 64.83%, 16.37% and 79.26% compared to SSGD, skipSSGD, localSGD and SMA, respectively. Because SkipSMA, while maintaining similar statistical efficiencies as SSGD, skipSSGD, localSGD and SMA, skipSMA reduces the communication overhead by 93.3%, 80% and 33.3% compared to SSGD, skipSSGD, localSGD, respectively, skipSMA reduces the calculation and communication overhead by 79.9% in each round of epoch compared to SMA.
Drawings
FIG. 1 is a schematic diagram of an embodiment of the prior art that involves training a model using the SSGD method;
FIG. 2 is a schematic diagram of an embodiment of the prior art that involves training a model using the SkipSSGD method;
FIG. 3 is a schematic diagram of an embodiment of the prior art that involves training a model using LocalSGD methods;
FIG. 4 is a schematic diagram of an embodiment of the prior art involving training a model using an SMA method;
FIG. 5 is a schematic diagram of a SkipSMA method training model of an embodiment of the present invention;
FIG. 6 is a graph comparing total training time of SkipSMA and prior training methods in the results of the practice of the present invention;
FIG. 7 is a graph comparing communication overhead of SkipSMA and the prior training method in the result of the present invention;
FIG. 8 is a graph comparing statistical efficiencies of SkipSMA and prior training methods in the results of the practice of the present invention;
Fig. 9 is a flow chart of the present invention.
Detailed Description
The invention will be described in further detail with reference to the following specific examples and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.
To reduce the communication overhead of training models in a distributed environment SkipSMA employs a skip communication strategy to update the global model, i.e., network communication is performed every few iterations. Because if the global model is updated through network communication in each iteration like SMA, a high network communication overhead is incurred in a distributed environment, resulting in long training times. Therefore, to reduce network communication overhead, skipSMA updates the global model every τ rounds of iteration, thereby reducing the frequency of network communications. Where the communication interval τ is a critical factor affecting the communication overhead. However, the same communication interval τ tends to cause different communication overhead in different environments (including the computational power of the GPU, network bandwidth, and model size, etc.). Therefore, it is necessary to select the communication interval τ using an adaptive strategy. SkipSMA adaptively adjust the communication interval τ by collecting the required communication and computation time data at run-time. Specifically, at system initialization SkipSMA sets the communication interval τ to 1, then at system operation SkipSMA collects the time t cm for communication and the time t cp for computation in the first round of iterations. Based on the acquired t cm and t cp, the adaptive communication interval selector adjusts the communication interval τ to beThis allows the communication time and computation time in each epoch to be similar. That is, the adaptive communication interval is such that the communication overhead is no longer a bottleneck in the overall training process.
According to the communication interval tau, skipSMA calculated by the adaptive communication interval selector in the first iteration, the local model is updated in each subsequent iteration by using a correction technique, and the global model is updated every tau iteration by using a skipped communication strategy. Specifically, in the iteration with the iteration number pair communication interval τ not being 0, skipSMA only uses the correction technology to update the local model, and does not involve network communication to update the global model, and in the iteration with the iteration number pair communication interval τ not being 0, skipSMA updates the global model through network communication in addition to the local model.
In each iteration, each training process calculates gradients and corrections from the local model and the global model, and updates the local model using the gradients and corrections simultaneously. As shown in FIG. 5, in iteration 10, a training process follows the local modelCalculating gradientAnd combined with global modelsCalculate correctionThe local model is then subtracted from the calculated gradient and correctionUpdated toThe correction here represents a penalty when the local model deviates from the global model. Similarly, another training process is according to the local modelCalculating gradientAnd combined with global modelsCalculate correctionThe local model is then subtracted from the calculated gradient and correctionUpdated toAs does the updating of the local model in subsequent iterations.
Every tau iterations, the global model is updated by means of network communication between each training process. Fig. 5 illustrates, by way of example, a communication interval τ=3, which describes SkipSMA a process for updating a global model with a skipped communication policy. In round 10 and 11 iterations SkipSMA does not update the global model, i.e., skipSMA skips communications in round 10 and 11 iterations. In the 12 th round of iteration, since 12 has a remainder of 0 for the communication interval 3, the global model needs to be updated by network communication in the 12 th round of iteration. Specifically, skipSMA aggregate the corrections calculated over all training processes (i.e.And) And slave the global model according to the aggregated correctionUpdated to
The above is a specific implementation of a method SkipSMA for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, where the method can be implemented by the relevant code in method 1, and the code of method 1 is as follows:
examples
4 Machines with a CentOS 7 operating system are used for constructing a cluster for training a model, the machines are connected by a network with a bandwidth of 1000Mbps, and each machine has two GPUs with models TeslaV to 100. In this cluster, the ResNet model was trained using five methods SSGD, skipSSGD, localSGD, SMA and SkipSMA, respectively, which compared to the five methods, train the ResNet model to a total training time, communication overhead, and statistical efficiency of 68% accuracy. Notably, SSGD, skipSSGD, localSGD and SkipSMA both run on four machines in the cluster, while SMA focuses primarily on a single machine multi-card scenario, so SMA only runs on one machine in the cluster. Furthermore, the communication interval τ of SkipSMA is determined during operation by the adaptive communication interval selector, while the communication intervals τ of SkipSSGD and LocalSGD are designated 1, 2,3, 4, 5, 10, 20, 30, and 40 prior to operation, and one of them is selected so that the training time is the shortest to compare with SkipSMA.
As shown in fig. 6, skipSMA shortens the total training time by 86.76% compared to SSGD. This is mainly because SkipSMA has a lower communication overhead than SSGD, and in particular SSGD involves network communication in each iteration, whereas SkipSMA does not perform communication until every τ iterations. This results in SkipSMA a 93.3% reduction in communication overhead over SSGD, as shown in fig. 7.
As shown in fig. 6, skipSMA reduced the total training time by 64.83% compared to SkipSSGD. This is because SkipSSGD cannot train the model to 68% accuracy within 50 hours at communication interval τ >3, and SkipSSGD achieves the shortest total training time at communication interval τ=3, subject to batch size. However, skipSSGD at communication interval τ=3, the communication overhead is still relatively large, as shown in fig. 7, skipSMA employing an adaptive communication interval reduces the communication overhead by 80% over SkipSSGD of communication interval τ=3.
As shown in fig. 6, skipSMA reduced the total training time by 16.37% compared to LocalSGD. LocalSGD the total training time is the shortest at communication interval τ=10. While LocalSGD can further reduce communication overhead when the communication interval τ >10, it can result in more divergence among the various local models, which can lead to a reduction in statistical efficiency. And SkipSMA, due to the adoption of a correction technology, the divergence among all local models is effectively relieved, and further high statistical efficiency is obtained. As shown in fig. 8, skipSMA with adaptive communication interval maintains similar statistical efficiency as LocalSGD with communication interval τ=10. At the same time SkipSMA adaptively adjusts the communication interval τ to 15 so that SkipSMA has low communication overhead. As shown in fig. 7, skipSMA employing an adaptive communication interval reduces communication overhead by 33.3% compared to LocalSGD of communication interval τ=10.
As shown in fig. 6, skipSMA shortens the total training time by 79.26% compared to SMA. Although the communication time in each iteration of the SMA is similar to the computation time, that is, the communication is not a bottleneck, the SMA utilizes the computation force of only a single node. SkipSMA not only keep the communication overhead low by having a skipped communication policy with adaptive communication intervals, but also take advantage of the computational power of multiple nodes. As shown in FIG. 7, skipSMA reduces the computation and communication time by 79.9% over SMA in each epoch.
The protection of the present invention is not limited to the above embodiments. Variations and advantages that would occur to one skilled in the art are included in the invention without departing from the spirit and scope of the inventive concept, and the scope of the invention is defined by the appended claims.

Claims (9)

1. A method of training a model in a distributed deep learning system with low communication overhead and high statistical efficiency, the method comprising the steps of:
Step A, a data collector collects data of communication time t cm and calculation time t cp required by the self-adaptive communication interval during running, and a self-adaptive communication interval selector automatically adjusts the communication interval tau according to the collected data of the communication time and calculation time;
the step A further comprises the following steps:
Step A1, the system automatically initializes the communication interval tau to 1, sets the communication interval tau=1 and sets the flag of the adaptive communication interval;
Step A2, when the system is running, the data collector collects time consumption t cm of communication and time consumption t cp of calculation in the first iteration;
Step A3, the adaptive communication interval selector adjusts the communication interval according to t cm and t cp acquired in the first round of iteration
Step B, performing iterative training on the model, and updating the local model by adopting a correction technology in each round of iteration;
the step B further comprises the following steps:
step B1, each training process calculates gradient according to the local model;
Step B2, each training process calculates correction, wherein the correction refers to a difference value between a local model and a global model;
Step B3, each training process updates the local model by using the gradient and correction obtained by calculation;
step C, updating the global model by using a skip communication strategy every tau round of iteration;
the step C further comprises the following steps:
Step C1, using the current iteration number to make a remainder for the self-adaptive communication interval tau, if the remainder is 0, indicating that the current iteration needs to update the global model, and then entering step C2;
Step C2, aggregating the corrections calculated by each training process through one-time network communication, and updating the global model by using the aggregated corrections;
in step C2, the post-polymerization correction is expressed in the following formula:
Where n represents the total number of training processes, k represents the number of iterations, Representing the correction calculated by the training process numbered i in the kth iteration;
the global model is updated in the form of formula Where i represents the number of training processes, k represents the number of iterations,AndRepresenting global model parameters in the k-th and k +1 iterations respectively,Represents the global model momentum in the kth iteration, μ represents the momentum coefficient, and α represents the correction coefficient;
the momentum of the global model is initially empty, and is updated once every tau iterations according to the correction after aggregation, and the specific updating form is as follows: Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the global model momentum in the k-th and k +1 iterations,Representing the correction after polymerization, μ and α represent the momentum coefficient and correction coefficient, respectively.
2. The method of claim 1, wherein in the step A1, the adaptive communication interval flag is used to indicate whether the adaptive communication interval selector is enabled, when the adaptive communication interval flag is set to true, the adaptive communication interval selector is enabled, the system adaptively selects an appropriate communication interval τ, when the adaptive communication interval flag is set to false, the adaptive communication interval selector is disabled, and the user needs to specify the communication interval τ.
3. The method of claim 1, wherein after the first iteration completes the adjustment of the communication interval τ, the communication interval τ is used for all subsequent iterations.
4. The method of claim 1, wherein in step B1, the gradient is calculated by the following formula:
Where i represents the number of training processes, k represents the number of iterations, Representing the gradient calculated by the training process numbered i in the kth round of iterations, b (i) represents the batch size over the training process numbered i,Representing the local model parameters of the training process numbered i in the kth iteration, L (x, w) represents the loss of sample x calculated on model parameters w.
5. The method of claim 1, wherein in step B2, the correction is calculated by the following formula:
Where i represents the number of training processes, k represents the number of iterations, Representing the correction calculated by the training process numbered i in the kth round of iterations,Representing the local model parameters of the training process numbered i in the kth round of iterations,Representing global model parameters in the kth round of iterations.
6. The method of claim 1, wherein in step B3, the specific updated form of the local model is represented by the following formula:
Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model parameters of the training process numbered i in the k and k +1 iterations,AndRespectively represent the local model momentum, gradient and correction of the training process numbered i in the kth iteration, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.
7. The method of claim 6, wherein the local model momentum is initially empty and is updated according to a gradient in each iteration, in the form of:
Where i represents the number of training processes, k represents the number of iterations, AndRepresenting the local model momentum of the training process numbered i in the k and k +1 iterations,Representing the gradient of the training process numbered i in the kth iteration, μ and γ represent the momentum coefficient and learning rate, respectively.
8. A system implementing the method of any of claims 1-7, the system comprising a runtime data collector, an adaptive communication interval selector, and a model update module.
9. The system of claim 8, wherein the runtime data collector collects system runtime communication time data t cm and computation time data t cp in a first round of iterations;
The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t cm and the calculation time t cp acquired by the runtime data collector And using the communication interval τ for all subsequent iterations;
The model updating module carries out iterative updating on the local model and the global model according to the communication interval tau obtained by the self-adaptive communication interval selecting module, namely, the local model is updated by adopting a correction technology in each iteration round, and the global model is updated by utilizing a skip communication strategy every tau iteration round.
CN202210023028.1A 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems Active CN114565007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210023028.1A CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210023028.1A CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Publications (2)

Publication Number Publication Date
CN114565007A CN114565007A (en) 2022-05-31
CN114565007B true CN114565007B (en) 2025-06-13

Family

ID=81712664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210023028.1A Active CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Country Status (1)

Country Link
CN (1) CN114565007B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829321B (en) * 2023-11-29 2024-12-20 慧之安信息技术股份有限公司 Efficient IoT federated learning method and system based on adaptive differential privacy

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017156791A1 (en) * 2016-03-18 2017-09-21 Microsoft Technology Licensing, Llc Method and apparatus for training a learning machine
EP4100892A4 (en) * 2020-02-03 2024-03-13 Intel Corporation DISTRIBUTED LEARNING SYSTEMS AND METHODS FOR WIRELESS EDGE DYNAMICS
CN112800461B (en) * 2021-01-28 2023-06-27 深圳供电局有限公司 Electric power metering system network intrusion detection method based on federal learning framework
US11017322B1 (en) * 2021-01-28 2021-05-25 Alipay Labs (singapore) Pte. Ltd. Method and system for federated learning
CN113095407A (en) * 2021-04-12 2021-07-09 哈尔滨理工大学 Efficient asynchronous federated learning method for reducing communication times
CN113098806B (en) * 2021-04-16 2022-03-29 华南理工大学 A Channel Adaptive Gradient Compression Method for Edge-End Collaboration in Federated Learning
CN113435604B (en) * 2021-06-16 2024-05-07 清华大学 A federated learning optimization method and device
CN113591145B (en) * 2021-07-28 2024-02-23 西安电子科技大学 Federated learning global model training method based on differential privacy and quantization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向分布式同步数据并行训练的通信优化技术;毕倪飞;硕士电子期刊;20221215(第2022年第12期期) *

Also Published As

Publication number Publication date
CN114565007A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN109902818B (en) Distributed acceleration method and system for deep learning training task
CN114565007B (en) A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems
CN110968426A (en) A model optimization method for edge-cloud collaborative k-means clustering based on online learning
CN111191728A (en) Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN113708969B (en) A collaborative embedding method for cloud data center virtual network based on deep reinforcement learning
CN113516163B (en) Vehicle classification model compression method, device and storage medium based on network pruning
CN108898603A (en) Plot segmenting system and method on satellite image
CN118396048B (en) Distributed training system, method and apparatus, medium and computer program product
CN111209106A (en) A streaming graph partitioning method and system based on caching mechanism
CN116307260B (en) A Method and System for Urban Road Network Resilience Optimization Facing Disturbance of Defective Road Sections
CN117592580A (en) Energy federated learning data selection method, device and energy federated learning system
CN109510681B (en) Reference node selection method with minimum time synchronization series of communication network
CN112651488A (en) Method for improving training efficiency of large-scale graph convolution neural network
CN115169579A (en) Method and device for optimizing machine learning model parameters and storage medium
CN114253265A (en) On-time arrival probability maximum path planning algorithm and system based on fourth-order moment
CN119129780A (en) Federated learning method, electronic device and storage medium based on node selection strategy
CN118070920A (en) An accelerated adaptive federated learning algorithm at the client level in data heterogeneity scenarios
CN106209459B (en) A network connectivity correction method for intelligent optimization of power failure system recovery path
CN114285077A (en) Active power distribution network fault reconstruction method and system considering flexible load
CN109413431B (en) Adaptive multi-tile coding rate control method, device, equipment and storage medium
CN119182582B (en) Dual-cost driven network communication cascade intelligent perception and traceability method and device
CN119847767B (en) Heavy computation perception model splitting method and system for pipeline parallel training
CN119362491A (en) Reactive voltage optimization method and system for power distribution network
CN120075073A (en) Network propagation control method and device based on belief transfer
CN120342709A (en) Distributed denial of service attack defense method, device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant