[go: up one dir, main page]

CN114565007A - A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems - Google Patents

A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems Download PDF

Info

Publication number
CN114565007A
CN114565007A CN202210023028.1A CN202210023028A CN114565007A CN 114565007 A CN114565007 A CN 114565007A CN 202210023028 A CN202210023028 A CN 202210023028A CN 114565007 A CN114565007 A CN 114565007A
Authority
CN
China
Prior art keywords
communication
model
iterations
communication interval
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210023028.1A
Other languages
Chinese (zh)
Other versions
CN114565007B (en
Inventor
徐辰
毕倪飞
陈梓浩
周傲英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202210023028.1A priority Critical patent/CN114565007B/en
Publication of CN114565007A publication Critical patent/CN114565007A/en
Application granted granted Critical
Publication of CN114565007B publication Critical patent/CN114565007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which comprises the following steps: runtime data collector collects communication time t required for adaptive communication interval at runtimecmAnd calculating the time tcpThe self-adaptive communication interval selector automatically adjusts the communication interval tau through the acquired communication time and the calculation time data; performing iterative training on the model, and updating the local model by adopting a correction technology in each iteration; the global model is updated with a skip communication strategy every τ iterations. The invention adaptively selects a communication interval tau through the acquired data, and based on the communication interval tau, the system updates a global mode through network communication at intervals of tau iterationAnd therefore, the network communication overhead is reduced, and finally, the time for training the model is shortened.

Description

一种分布式深度学习系统中具有低通信开销和高统计效率的 训练模型的方法A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

技术领域technical field

本发明属于分布式深度学习领域,涉及一种分布式深度学习系统中面向同步数据并行的具有低通信开销和高统计效率的训练模型的方法。The invention belongs to the field of distributed deep learning, and relates to a method for training a model with low communication overhead and high statistical efficiency oriented to synchronous data parallelism in a distributed deep learning system.

背景技术Background technique

在面向同步数据并行的模型训练中,分布式深度学习系统在集群中启动多个训练进程,每个训练进程拥有一个完整的模型备份。同时,分布式深度学习系统将整个数据集划分为若干个数据分片,并将所有的数据分片分配到各个训练进程中。模型训练的整个过程由一系列迭代组成,在每一轮迭代中,各个训练进程从分配到的数据分片中选取出一批数据,在模型上计算出参数更新。为了同步所有数据分片上的参数更新,训练进程之间通常借助Parameter Server或AllReduce通信架构来聚合所有训练进程上的参数更新。基于聚合后的参数更新,各个训练进程对模型的参数进行更新。然而,对模型参数进行更新的具体过程是由训练模型的方法所决定的,不同的训练模型的方法对应不同的更新过程。目前,分布式深度学习系统中面向同步数据并行的训练模型的方法主要有SSGD、SkipSSGD、LocalSGD和SMA四种。In model training for synchronous data parallelism, the distributed deep learning system starts multiple training processes in the cluster, and each training process has a complete model backup. At the same time, the distributed deep learning system divides the entire data set into several data shards, and assigns all the data shards to each training process. The entire process of model training consists of a series of iterations. In each iteration, each training process selects a batch of data from the allocated data shards and calculates parameter updates on the model. In order to synchronize parameter updates on all data shards, training processes usually use the Parameter Server or AllReduce communication architecture to aggregate parameter updates on all training processes. Based on the aggregated parameter updates, each training process updates the parameters of the model. However, the specific process of updating the model parameters is determined by the method of training the model, and different methods of training the model correspond to different updating processes. At present, there are four main methods of training models for synchronous data parallelism in distributed deep learning systems: SSGD, SkipSSGD, LocalSGD and SMA.

SSGD是分布式深度学习系统中广泛采用的训练模型的方法。SSGD在每轮迭代中都通过网络通信来聚合每个训练进程在一批数据上计算所得的梯度,并根据聚合后的梯度来更新模型。图1显示了采用SSGD方法训练模型过程中三轮连续的迭代。在第10轮迭代中,两个训练进程基于模型w10分别计算出梯度

Figure BDA0003463260190000011
Figure BDA0003463260190000012
然后通过网络通信聚合梯度
Figure BDA0003463260190000013
Figure BDA0003463260190000014
并计算出平均梯度。接着,SSGD利用平均梯度将模型从w10更新为w11,完成模型的更新后,SSGD就进入下一轮迭代。由于每一轮迭代都涉及网络通信,SSGD方法在分布式环境中通常存在通信瓶颈。SSGD is a widely adopted method for training models in distributed deep learning systems. SSGD communicates through the network in each iteration to aggregate the gradients computed by each training process over a batch of data, and update the model based on the aggregated gradients. Figure 1 shows three consecutive iterations during model training using the SSGD method. In the 10th iteration, the two training processes calculate the gradients based on the model w 10
Figure BDA0003463260190000011
and
Figure BDA0003463260190000012
Then aggregate gradients through network communication
Figure BDA0003463260190000013
and
Figure BDA0003463260190000014
and calculate the average gradient. Next, SSGD uses the average gradient to update the model from w 10 to w 11 . After completing the model update, SSGD enters the next iteration. Since each iteration involves network communication, SSGD methods usually suffer from communication bottlenecks in distributed environments.

在SSGD的基础上,SkipSSGD采用了一种跳过通信策略来更新模型。SkipSSGD不在每一轮迭代中都通过网络通信来聚合梯度,而是每隔τ轮迭代才进行一次通信,从而大大降低了通信开销。为了在跳过通信的同时保留已得到的梯度信息,SkipSSGD在每个训练进程中维护了一个梯度累加器用于保存梯度信息。每隔τ轮迭代,SkipSSGD通过一次网络通信来聚合各个累加器中的梯度,并根据聚合后的梯度来更新模型。图2显示了τ=3的SkipSSGD方法训练模型过程中三轮连续的迭代。基于模型w10,两个训练进程在第10、11和12轮迭代中分别计算出梯度

Figure BDA0003463260190000015
Figure BDA0003463260190000016
并将这些梯度累加到各自的梯度累加器中。完成连续三轮迭代的梯度累加后,SkipSSGD通过一次网络通信来聚合这些累加的梯度,并将模型从w10更新为w11。虽然SkipSSGD降低了通信开销,但是累加梯度的方式会造成大的批量大小,进而导致低的统计效率。Based on SSGD, SkipSSGD adopts a skip communication strategy to update the model. SkipSSGD does not aggregate gradients through network communication in each iteration, but only communicates once every τ iterations, which greatly reduces the communication overhead. In order to preserve the obtained gradient information while skipping communication, SkipSSGD maintains a gradient accumulator in each training process to save the gradient information. Every τ iterations, SkipSSGD aggregates the gradients in each accumulator through a network communication, and updates the model according to the aggregated gradients. Figure 2 shows three successive iterations during the model training process of the SkipSSGD method with τ=3. Based on the model w 10 , the two training processes compute gradients at iterations 10, 11, and 12, respectively
Figure BDA0003463260190000015
and
Figure BDA0003463260190000016
and accumulate these gradients into their respective gradient accumulators. After completing the gradient accumulation for three consecutive iterations, SkipSSGD aggregates these accumulated gradients through one network communication and updates the model from w 10 to w 11 . Although SkipSSGD reduces communication overhead, the way of accumulating gradients results in a large batch size, which in turn leads to low statistical efficiency.

LocalSGD也采用了一种跳过通信策略来降低通信开销。然而,与SkipSSGD通过累加梯度来跳过通信的方式不同,LocalSGD维护了多个本地模型和一个全局模型,其直接在每轮迭代中用梯度来更新本地模型,并每隔τ轮迭代通过网络通信聚合每个本地模型的参数来更新全局模型。图3显示了τ=3的LocalSGD方法训练模型过程中三轮连续的迭代。在第10轮迭代,一个训练进程计算出梯度

Figure BDA0003463260190000021
并用该梯度将本地模型更新为
Figure BDA0003463260190000022
在第11和12轮迭代中,该训练进程继续计算出梯度
Figure BDA0003463260190000023
Figure BDA0003463260190000024
并将本地模型更新为
Figure BDA0003463260190000025
Figure BDA0003463260190000026
类似地,另一个训练进程也分别在第10、11和12轮迭代中将本地模型更新为
Figure BDA0003463260190000027
Figure BDA0003463260190000028
完成连续三轮迭代的本地模型更新后,LocalSGD通过一次网络通信来聚合各个本地模型,并将全局模型从
Figure BDA0003463260190000029
更新为
Figure BDA00034632601900000210
虽然LocalSGD降低了通信开销,且保持了小的批量大小,但是各个本地模型之间会随着τ的增大而越来越发散,进而导致低的统计效率。LocalSGD also adopts a skip communication strategy to reduce communication overhead. However, unlike SkipSSGD, which skips communication by accumulating gradients, LocalSGD maintains multiple local models and a global model, which directly updates the local model with gradients in each iteration, and communicates through the network every τ iterations Aggregate the parameters of each local model to update the global model. Figure 3 shows three successive iterations during the model training process of the LocalSGD method with τ=3. At iteration 10, a training process computes the gradient
Figure BDA0003463260190000021
and use that gradient to update the local model to
Figure BDA0003463260190000022
At iterations 11 and 12, the training process continues to compute gradients
Figure BDA0003463260190000023
and
Figure BDA0003463260190000024
and update the local model to
Figure BDA0003463260190000025
and
Figure BDA0003463260190000026
Similarly, another training process also updates the local model at iterations 10, 11, and 12 as
Figure BDA0003463260190000027
and
Figure BDA0003463260190000028
After completing the local model update for three consecutive rounds of iterations, LocalSGD aggregates each local model through a network communication, and converts the global model from
Figure BDA0003463260190000029
update to
Figure BDA00034632601900000210
Although LocalSGD reduces communication overhead and maintains a small batch size, each local model becomes more and more divergent as τ increases, resulting in low statistical efficiency.

为了降低各个本地模型之间的发散程度,SMA采用了一种纠正技术,即通过全局模型来纠正各个本地模型的每次更新。图4显示了SMA方法训练模型过程中三轮连续的迭代。在第10轮迭代中,一个训练进程根据梯度

Figure BDA00034632601900000211
和纠正
Figure BDA00034632601900000212
将本地模型从
Figure BDA00034632601900000213
更新为
Figure BDA00034632601900000214
另一个训练进程根据梯度
Figure BDA00034632601900000215
和纠正
Figure BDA00034632601900000216
将本地模型从
Figure BDA00034632601900000217
更新为
Figure BDA00034632601900000218
同时,为了保证全局模型是最新的,SMA在每轮迭代中都会更新全局模型。例如,在第10轮迭代中,SMA通过网络通信来聚合两个训练进程中计算所得的纠正(即
Figure BDA00034632601900000219
Figure BDA00034632601900000220
)来将全局模型从
Figure BDA00034632601900000221
更新为
Figure BDA00034632601900000222
虽然SMA保持了小的批量大小,且降低了各个本地模型之间的发散程度,但是其每一轮迭代都涉及网络通信来更新全局模型,在分布式环境中往往存在通信瓶颈。To reduce the degree of divergence among individual local models, SMA employs a correction technique that corrects each update of individual local models through the global model. Figure 4 shows three successive iterations during the model training process of the SMA method. In the 10th iteration, a training process according to the gradient
Figure BDA00034632601900000211
and correct
Figure BDA00034632601900000212
Change the local model from
Figure BDA00034632601900000213
update to
Figure BDA00034632601900000214
Another training process according to the gradient
Figure BDA00034632601900000215
and correct
Figure BDA00034632601900000216
Change the local model from
Figure BDA00034632601900000217
update to
Figure BDA00034632601900000218
At the same time, to ensure that the global model is up-to-date, SMA updates the global model in each iteration. For example, in the 10th iteration, the SMA communicates through the network to aggregate the corrections computed in the two training processes (i.e.
Figure BDA00034632601900000219
and
Figure BDA00034632601900000220
) to change the global model from
Figure BDA00034632601900000221
update to
Figure BDA00034632601900000222
Although SMA maintains a small batch size and reduces the degree of divergence among local models, each iteration of SMA involves network communication to update the global model, which often has a communication bottleneck in a distributed environment.

总的来说,现有的面向同步数据并行的训练模型的方法都存在缺陷,即这些方法都没有同时保证低的通信开销和高的统计效率。In general, the existing methods for training models oriented to synchronous data parallelism are flawed, that is, none of these methods guarantees low communication overhead and high statistical efficiency at the same time.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术存在的不足,本发明的目的是提出一种分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法,该方法利用一种基于自适应通信间隔的跳过通信策略保证了低的通信开销,并在保持小的批量大小的前提下采用纠正技术降低了各个本地模型之间的发散程度,从而保证了高的统计效率,最终缩短了训练模型的时间。In order to solve the shortcomings of the prior art, the purpose of the present invention is to propose a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which utilizes a skip based on adaptive communication interval The communication strategy ensures low communication overhead, and adopts correction techniques to reduce the degree of divergence between local models while maintaining a small batch size, thereby ensuring high statistical efficiency and ultimately shortening the time to train the model.

现有的采用跳过通信策略的训练方法(例如,SkipSSGD和LocalSGD)都要求用户手动指定一个通信间隔τ。然而,相同的通信间隔τ在不同的运行环境(包括GPU的算力、网络的带宽和模型的大小)中所带来的通信开销往往是不同的,也就是说,适合于一个运行环境的通信间隔τ可能不适合于另一个运行环境。因此,为了在不同的运行环境中自动选择一个合适的通信间隔τ,本发明还提出了一种在进行模型训练过程中自适应调整通信间隔的方法。该方法通过一个运行时数据收集器和一个自适应通信间隔选择器来实现,运行时数据收集器在第一轮迭代中采集通信和计算时间数据,自适应通信间隔选择器基于采集的通信和计算时间数据自动调整通信间隔τ,以使得每一个训练周期epoch中的通信时间和计算时间是相近的,从而避免了通信开销成为整个训练中的瓶颈。Existing training methods that employ skip communication strategies (eg, SkipSSGD and LocalSGD) all require the user to manually specify a communication interval τ. However, the communication overhead caused by the same communication interval τ in different operating environments (including GPU computing power, network bandwidth and model size) is often different, that is, communication suitable for one operating environment. The interval τ may not be suitable for another operating environment. Therefore, in order to automatically select an appropriate communication interval τ in different operating environments, the present invention also proposes a method for adaptively adjusting the communication interval during model training. The method is implemented by a runtime data collector that collects communication and computational time data in the first iteration and an adaptive communication interval selector that is based on the collected communication and computational The time data automatically adjusts the communication interval τ, so that the communication time and computing time in each training cycle epoch are similar, thus avoiding the communication overhead becoming the bottleneck in the whole training.

实现本发明目的的具体技术方案是:The concrete technical scheme that realizes the object of the present invention is:

一种分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法,该方法包括以下步骤:A method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, the method comprising the following steps:

步骤A:运行时数据收集器在运行时采集自适应通信间隔所需的通信时间tcm和计算时间tcp数据;自适应通信间隔选择器通过上述采集获得的通信时间和计算时间数据自动调整通信间隔τ;Step A: The runtime data collector collects the data of the communication time t cm and the computation time t cp required by the adaptive communication interval during the runtime; the adaptive communication interval selector automatically adjusts the communication through the communication time and computation time data obtained by the above collection interval τ;

步骤B:对模型进行迭代训练,在每一轮迭代中采用纠正技术更新本地模型;Step B: Perform iterative training on the model, and update the local model with correction techniques in each iteration;

步骤C:每隔τ轮迭代利用跳过通信策略更新全局模型。Step C: Update the global model with skip communication strategy every τ iterations.

其中,所述运行时数据收集器在运行时采集自适应通信间隔所需的通信时间tcm和计算时间tcp数据;自适应通信间隔选择器通过上述采集获得的通信时间和计算时间数据自动调整通信间隔τ的步骤包括:Wherein, the runtime data collector collects the data of the communication time t cm and the computation time t cp required by the adaptive communication interval during the runtime; the adaptive communication interval selector automatically adjusts the communication time and computation time data obtained by the above collection The steps of communication interval τ include:

步骤A1:系统自动将通信间隔τ初始化为1;设置通信间隔τ=1,并设置自适应通信间隔标志位flag;自适应通信间隔标志位用于指示是否启用自适应通信间隔选择器,自适应通信间隔标志位flag=true表示启用自适应通信间隔选择器,即系统自适应地选择一个合适的通信间隔τ;flag=false表示禁用自适应通信间隔选择器,即需要用户指定通信间隔τ;本发明始终启用自适应通信间隔选择器;Step A1: The system automatically initializes the communication interval τ to 1; sets the communication interval τ=1, and sets the adaptive communication interval flag bit; the adaptive communication interval flag bit is used to indicate whether to enable the adaptive communication interval selector, adaptive The communication interval flag bit flag=true indicates that the adaptive communication interval selector is enabled, that is, the system adaptively selects a suitable communication interval τ; flag=false indicates that the adaptive communication interval selector is disabled, that is, the user needs to specify the communication interval τ; this Invention always enables adaptive communication interval selector;

步骤A2:在系统运行时,运行时数据收集器采集第一轮迭代中通信的耗时tcm和计算的耗时tcpStep A2: when the system is running, the runtime data collector collects the time-consuming t cm of communication and the time-consuming t cp of calculation in the first iteration;

步骤A3:自适应通信间隔选择器根据第一轮迭代中采集的tcm和tcp来调整通信间隔τ,具体来说,将通信间隔τ调整为

Figure BDA0003463260190000031
即将通信间隔调整为对tcm和tcp的商向上取整。Step A3: The adaptive communication interval selector adjusts the communication interval τ according to the t cm and t cp collected in the first round of iterations. Specifically, the communication interval τ is adjusted as
Figure BDA0003463260190000031
That is, the communication interval is adjusted to round up the quotient of t cm and t cp .

在第一轮迭代中完成通信间隔τ的调整后,该通信间隔τ用于后续所有迭代,即后续迭代中不再进行通信和计算时间数据的采集以及通信间隔τ的调整。After completing the adjustment of the communication interval τ in the first round of iterations, the communication interval τ is used for all subsequent iterations, that is, the collection of communication and calculation time data and the adjustment of the communication interval τ are no longer performed in subsequent iterations.

其中,所述对模型进行迭代训练,在每一轮迭代中采用纠正技术更新本地模型的步骤包括:Wherein, in the iterative training of the model, the steps of using correction technology to update the local model in each round of iterations include:

步骤B1:每个训练进程根据本地模型计算出梯度;

Figure BDA0003463260190000041
其中,i代表训练进程的编号,k代表迭代数,
Figure BDA0003463260190000042
代表编号为i的训练进程在第k轮迭代中计算所得的梯度,b(i)代表编号为i的训练进程上的批量大小,
Figure BDA0003463260190000043
代表编号为i的训练进程在第k轮迭代中的本地模型参数,L(x,w)代表样本x在模型参数w上计算所得的损失;Step B1: Each training process calculates the gradient according to the local model;
Figure BDA0003463260190000041
Among them, i represents the number of the training process, k represents the number of iterations,
Figure BDA0003463260190000042
represents the gradient calculated by the training process number i in the k-th iteration, b (i) represents the batch size on the training process number i,
Figure BDA0003463260190000043
Represents the local model parameters of the training process numbered i in the k-th iteration, and L(x, w) represents the loss calculated by the sample x on the model parameter w;

步骤B2:每个训练进程计算出纠正

Figure BDA0003463260190000044
其中,i代表训练进程的编号,k代表迭代数,
Figure BDA0003463260190000045
代表编号为i的训练进程在第k轮迭代中计算所得的纠正,
Figure BDA0003463260190000046
代表编号为i的训练进程在第k轮迭代中的本地模型参数,
Figure BDA0003463260190000047
代表第k轮迭代中的全局模型参数,所述纠正即本地模型参数和全局模型参数之间的差值;Step B2: Each training process computes the correction
Figure BDA0003463260190000044
Among them, i represents the number of the training process, k represents the number of iterations,
Figure BDA0003463260190000045
represents the correction computed in the k-th iteration of the training process numbered i,
Figure BDA0003463260190000046
represents the local model parameters of the training process numbered i in the k-th iteration,
Figure BDA0003463260190000047
represents the global model parameters in the k-th iteration, and the correction is the difference between the local model parameters and the global model parameters;

步骤B3:每个训练进程用计算所得的梯度和纠正来更新本地模型,具体的更新形式为:

Figure BDA0003463260190000048
其中,i代表训练进程的编号,k代表迭代数,
Figure BDA0003463260190000049
Figure BDA00034632601900000410
代表编号为i的训练进程在第k和k+1轮迭代中的本地模型参数,
Figure BDA00034632601900000411
Figure BDA00034632601900000412
分别代表编号为i的训练进程在第k轮迭代中的本地模型动量、梯度和纠正,μ、γ和α分别代表动量系数、学习率和纠正系数。Step B3: Each training process uses the calculated gradient and correction to update the local model. The specific update form is:
Figure BDA0003463260190000048
Among them, i represents the number of the training process, k represents the number of iterations,
Figure BDA0003463260190000049
and
Figure BDA00034632601900000410
represents the local model parameters of the training process numbered i in the kth and k+1 iterations,
Figure BDA00034632601900000411
and
Figure BDA00034632601900000412
respectively represent the local model momentum, gradient and correction in the kth iteration of the training process numbered i, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.

本地模型动量初始时为空,并且本地模型动量也会在每一轮迭代中根据梯度来更新,具体的更新形式为:

Figure BDA00034632601900000413
其中,i代表训练进程的编号,k代表迭代数,
Figure BDA00034632601900000414
Figure BDA00034632601900000415
代表编号为i的训练进程在第k和k+1轮迭代中的本地模型动量,
Figure BDA00034632601900000416
代表编号为i的训练进程在第k轮迭代中的梯度,μ和γ分别代表动量系数和学习率。The momentum of the local model is initially empty, and the momentum of the local model will also be updated according to the gradient in each iteration. The specific update form is:
Figure BDA00034632601900000413
Among them, i represents the number of the training process, k represents the number of iterations,
Figure BDA00034632601900000414
and
Figure BDA00034632601900000415
represents the local model momentum of the training process numbered i in the kth and k+1 iterations,
Figure BDA00034632601900000416
represents the gradient of the training process numbered i in the k-th iteration, and μ and γ represent the momentum coefficient and learning rate, respectively.

其中,所述每隔τ轮迭代利用跳过通信策略更新全局模型的步骤包括:Wherein, the step of using skip communication strategy to update the global model every τ rounds of iterations includes:

步骤C1:用当前迭代数对自适应设置的通信间隔τ取余,如果余数为0表明当前迭代需要更新全局模型,之后进入步骤C2;Step C1: Use the current iteration number to take the remainder of the adaptively set communication interval τ, if the remainder is 0, it indicates that the current iteration needs to update the global model, and then go to step C2;

步骤C2:通过一次网络通信聚合各个训练进程计算出的纠正,得到聚合后的纠正

Figure BDA00034632601900000417
其中,n代表训练进程的总数,k代表迭代数,
Figure BDA00034632601900000418
代表编号为i的训练进程在第k轮迭代中计算所得的纠正,并用聚合后的纠正来更新全局模型,全局模型具体的更新形式为:
Figure BDA00034632601900000419
Figure BDA00034632601900000420
其中,i代表训练进程的编号,k代表迭代数,
Figure BDA00034632601900000421
Figure BDA00034632601900000422
分别代表在第k和k+1轮迭代中的全局模型参数,
Figure BDA00034632601900000423
代表第k轮迭代中的全局模型动量,μ代表动量系数,α代表纠正系数。全局模型动量初始时为空,并且全局模型动量也是每隔τ轮迭代根据聚合后的纠正进行一次更新,全局模型动量具体的更新形式为:
Figure BDA0003463260190000051
其中,k代表迭代数,
Figure BDA0003463260190000052
Figure BDA0003463260190000053
代表在第k和k+1轮迭代中的全局模型动量,
Figure BDA0003463260190000054
代表聚合后的纠正,μ和α分别代表动量系数和纠正系数。Step C2: Aggregate the corrections calculated by each training process through a network communication to obtain the aggregated corrections
Figure BDA00034632601900000417
where n represents the total number of training processes, k represents the number of iterations,
Figure BDA00034632601900000418
Represents the correction calculated by the training process numbered i in the k-th iteration, and uses the aggregated correction to update the global model. The specific update form of the global model is:
Figure BDA00034632601900000419
Figure BDA00034632601900000420
Among them, i represents the number of the training process, k represents the number of iterations,
Figure BDA00034632601900000421
and
Figure BDA00034632601900000422
represent the global model parameters in the k and k+1 iterations, respectively,
Figure BDA00034632601900000423
represents the global model momentum in the k-th iteration, μ represents the momentum coefficient, and α represents the correction coefficient. The global model momentum is initially empty, and the global model momentum is also updated every τ rounds of iterations according to the correction after aggregation. The specific update form of the global model momentum is:
Figure BDA0003463260190000051
where k represents the number of iterations,
Figure BDA0003463260190000052
and
Figure BDA0003463260190000053
represents the global model momentum in the kth and k+1 iterations,
Figure BDA0003463260190000054
represents the correction after aggregation, and μ and α represent the momentum coefficient and correction coefficient, respectively.

所述通信开销是指通信时间在一个Epoch总耗时中的占比。The communication overhead refers to the proportion of communication time in the total time-consuming of an Epoch.

所述通信间隔是指所述通信间隔是指相邻两次发生全局模型更新的迭代之间所隔的迭代数。The communication interval refers to the number of iterations between two adjacent iterations in which the global model update occurs.

所述本地模型是指所述本地模型是指每个训练进程独立地通过本地计算所得的梯度和纠正来更新的模型。The local model refers to that the local model refers to a model that is independently updated by each training process through locally calculated gradients and corrections.

所述全局模型是指所述全局模型是指所有训练进程共同地通过网络通信聚合的纠正来更新的模型,或者所述全局模型是指用于纠正每轮迭代中本地模型更新的模型。The global model means that the global model refers to a model that is commonly updated by corrections aggregated by network communication for all training processes, or the global model refers to a model that is used to correct the local model update in each iteration.

本发明还提出了一种用于实现上述方法的系统,所述系统包括:运行时数据收集器、自适应通信间隔选择器、模型更新模块。The present invention also provides a system for implementing the above method, the system comprising: a runtime data collector, an adaptive communication interval selector, and a model updating module.

所述运行时数据收集器在第一轮迭代中采集系统运行时的通信时间数据tcm和计算时间数据tcpThe runtime data collector collects the communication time data t cm and the computation time data t cp when the system is running in the first iteration;

所述自适应通信间隔选择器根据运行时数据收集器采集的通信时间tcm和计算时间tcp,自动调整通信间隔

Figure BDA0003463260190000055
并将该通信间隔τ用于后续所有迭代;The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t cm and the calculation time t cp collected by the runtime data collector
Figure BDA0003463260190000055
and use this communication interval τ for all subsequent iterations;

所述模型更新模块根据自适应通信间隔选择模块获得的通信间隔τ,对本地模型和全局模型进行迭代式更新,即在每一轮迭代中采用纠正技术对本地模型进行更新,每隔τ轮迭代利用跳过通信策略对全局模型进行更新。The model updating module iteratively updates the local model and the global model according to the communication interval τ obtained by the adaptive communication interval selection module, that is, in each round of iterations, a correction technique is used to update the local model, and every τ rounds of iterations are used to update the local model. The global model is updated with a skip communication strategy.

本发明的有益效果包括:The beneficial effects of the present invention include:

本发明在每一轮迭代中采用了纠正技术来更新本地模型,降低了各个本地模型之间的发散程度,从而保证了高的统计效率。同时在系统运行过程中采集了通信和计算的时间数据,并根据采集的数据自适应地选择一个通信间隔τ,基于该通信间隔τ,系统每隔τ轮迭代通过网络通信来更新一次全局模型,从而降低了网络通信开销,最终缩短了训练模型的时间。实施结果表明,SkipSMA相比于SSGD、SkipSSGD、LocalSGD和SMA分别缩短了86.76%、64.83%、16.37%和79.26%的总训练时间。因为SkipSMA在保持与SSGD、SkipSSGD、LocalSGD和SMA相似的统计效率的前提下,SkipSMA相比于SSGD、SkipSSGD、LocalSGD分别降低了93.3%、80%和33.3%的通信开销,SkipSMA相比于SMA在每轮epoch中降低了79.9%的计算和通信开销。The present invention adopts correction technology to update the local model in each round of iteration, reduces the degree of divergence among local models, and thus ensures high statistical efficiency. At the same time, the time data of communication and calculation are collected during the system operation, and a communication interval τ is adaptively selected according to the collected data. Based on the communication interval τ, the system updates the global model through network communication every τ rounds of iterations. This reduces the network communication overhead and ultimately reduces the time to train the model. The implementation results show that SkipSMA shortens the total training time by 86.76%, 64.83%, 16.37% and 79.26% compared to SSGD, SkipSSGD, LocalSGD and SMA, respectively. Compared with SSGD, SkipSSGD, and LocalSGD, SkipSMA reduces the communication overhead by 93.3%, 80% and 33.3%, respectively, because SkipSMA maintains a similar statistical efficiency to SSGD, SkipSSGD, LocalSGD and SMA. 79.9% reduction in computational and communication overhead in each epoch.

附图说明Description of drawings

图1是现有技术中涉及采用SSGD方法训练模型的实施方式示意图;Fig. 1 is the embodiment schematic diagram that relates to adopting SSGD method training model in the prior art;

图2是现有技术中涉及采用SkipSSGD方法训练模型的实施方式示意图;Fig. 2 is related to the implementation schematic diagram of adopting SkipSSGD method training model in the prior art;

图3是现有技术中涉及采用LocalSGD方法训练模型的实施方式示意图;3 is a schematic diagram of an embodiment of the prior art involving using the LocalSGD method to train a model;

图4是现有技术中涉及采用SMA方法训练模型的实施方式示意图;4 is a schematic diagram of an embodiment of the prior art involving using the SMA method to train a model;

图5是本发明实施方式的SkipSMA方法训练模型的示意图;Fig. 5 is the schematic diagram of the SkipSMA method training model of the embodiment of the present invention;

图6是本发明实施结果中SkipSMA和现有训练方法的总训练时间对比图;Fig. 6 is the total training time comparison diagram of SkipSMA and existing training method in the implementation result of the present invention;

图7是本发明实施结果中SkipSMA和现有训练方法的通信开销对比图;Fig. 7 is the communication overhead comparison diagram of SkipSMA and the existing training method in the implementation result of the present invention;

图8是本发明实施结果中SkipSMA和现有训练方法的统计效率对比图;Fig. 8 is the statistical efficiency comparison diagram of SkipSMA and existing training method in the implementation result of the present invention;

图9是本发明流程图。Figure 9 is a flow chart of the present invention.

具体实施方式Detailed ways

结合以下具体实施例和附图,对发明作进一步的详细说明。实施本发明的过程、条件、实验方法等,除以下专门提及的内容之外,均为本领域的普遍知识和公知常识,本发明没有特别限制内容。The invention will be further described in detail with reference to the following specific embodiments and accompanying drawings. Except for the content specifically mentioned below, the process, conditions, experimental methods, etc. for implementing the present invention are all common knowledge and common knowledge in the field, and the present invention is not particularly limited.

为了降低分布式环境中训练模型的通信开销,SkipSMA采用了一种跳过通信策略来更新全局模型,即每隔若干轮迭代才进行一次网络通信。因为如果像SMA那样在每轮迭代中都通过网络通信来更新全局模型,那么在分布式环境中就会导致高昂的网络通信开销,从而导致长的训练时间。因此,为了降低网络通信开销,SkipSMA每隔τ轮迭代才更新一次全局模型,从而降低了网络通信的频率。其中,通信间隔τ是影响通信开销的关键性因素。然而,相同的通信间隔τ在不同的环境(包括GPU的算力、网络带宽和模型大小等)中往往会造成不同的通信开销。因此,有必要采用一种自适应的策略来选择通信间隔τ。SkipSMA通过在运行时采集所需的通信和计算时间数据来自适应地调整通信间隔τ。具体来说,在系统初始化时,SkipSMA将通信间隔τ设置为1,然后在系统运行时,SkipSMA采集第一轮迭代中通信的耗时tcm和计算的耗时tcp。根据采集的tcm和tcp,自适应通信间隔选择器将通信间隔τ调整为

Figure BDA0003463260190000061
这使得在每一个epoch中的通信时间和计算时间是相近的。也就是说,自适应通信间隔使得通信开销不再会成为整个训练过程中的瓶颈。In order to reduce the communication overhead of training models in a distributed environment, SkipSMA adopts a skip communication strategy to update the global model, that is, network communication is only performed every several rounds of iterations. Because if the global model is updated through network communication in each iteration like SMA, it will lead to high network communication overhead in a distributed environment, resulting in long training time. Therefore, in order to reduce the network communication overhead, SkipSMA only updates the global model every τ rounds of iterations, thereby reducing the frequency of network communication. Among them, the communication interval τ is a key factor affecting the communication overhead. However, the same communication interval τ tends to cause different communication overheads in different environments (including GPU computing power, network bandwidth, and model size, etc.). Therefore, it is necessary to adopt an adaptive strategy to select the communication interval τ. SkipSMA adaptively adjusts the communication interval τ by collecting the required communication and computation time data at runtime. Specifically, when the system is initialized, SkipSMA sets the communication interval τ to 1, and then when the system is running, SkipSMA collects the communication time t cm and the computation time t cp in the first iteration. According to the collected t cm and t cp , the adaptive communication interval selector adjusts the communication interval τ as
Figure BDA0003463260190000061
This makes the communication time and computation time similar in each epoch. That is, the adaptive communication interval makes the communication overhead no longer a bottleneck in the whole training process.

根据自适应通信间隔选择器在第一轮迭代中所计算出的通信间隔τ,SkipSMA在后续的每轮迭代中采用纠正技术更新本地模型,并利用跳过通信策略每隔τ轮迭代更新一次全局模型。具体地,在迭代数对通信间隔τ取余不为0的迭代中,SkipSMA仅仅采用纠正技术来更新本地模型,不涉及网络通信来更新全局模型;在迭代数对通信间隔τ取余为0的迭代中,SkipSMA除了更新本地模型之外,还要通过网络通信来更新全局模型。According to the communication interval τ calculated by the adaptive communication interval selector in the first iteration, SkipSMA uses correction techniques to update the local model in each subsequent iteration, and uses the skip communication strategy to update the global model every τ iterations Model. Specifically, in the iterations in which the number of iterations takes the remainder of the communication interval τ not to be 0, SkipSMA only uses the correction technique to update the local model, and does not involve network communication to update the global model; In the iteration, in addition to updating the local model, SkipSMA also updates the global model through network communication.

在每轮迭代中,每个训练进程根据本地模型和全局模型计算出梯度和纠正,并同时使用梯度和纠正来更新本地模型。如图5所示,在第10轮迭代中,一个训练进程根据本地模型

Figure BDA0003463260190000071
计算出梯度
Figure BDA0003463260190000072
并结合全局模型
Figure BDA0003463260190000073
计算出纠正
Figure BDA0003463260190000074
然后根据计算出的梯度和纠正将本地模型从
Figure BDA0003463260190000075
更新为
Figure BDA0003463260190000076
此处的纠正代表当本地模型偏离全局模型时的惩罚。类似地,另一个训练进程根据本地模型
Figure BDA0003463260190000077
计算出梯度
Figure BDA0003463260190000078
并结合全局模型
Figure BDA0003463260190000079
计算出纠正
Figure BDA00034632601900000710
然后根据计算出的梯度和纠正将本地模型从
Figure BDA00034632601900000711
更新为
Figure BDA00034632601900000712
后续迭代中对本地模型的更新也是如此。In each iteration, each training process computes gradients and corrections from the local and global models, and uses both gradients and corrections to update the local model. As shown in Figure 5, in the 10th iteration, a training process based on the local model
Figure BDA0003463260190000071
Calculate the gradient
Figure BDA0003463260190000072
and combined with the global model
Figure BDA0003463260190000073
Calculate the correction
Figure BDA0003463260190000074
Then based on the computed gradients and corrections the local model from
Figure BDA0003463260190000075
update to
Figure BDA0003463260190000076
The correction here represents the penalty when the local model deviates from the global model. Similarly, another training process based on the local model
Figure BDA0003463260190000077
Calculate the gradient
Figure BDA0003463260190000078
and combined with the global model
Figure BDA0003463260190000079
Calculate the correction
Figure BDA00034632601900000710
Then based on the computed gradients and corrections the local model from
Figure BDA00034632601900000711
update to
Figure BDA00034632601900000712
The same goes for updates to the local model in subsequent iterations.

每隔τ轮迭代,各个训练进程之间借助网络通信更新一次全局模型。图5以通信间隔τ=3为例,描述了SkipSMA利用跳过通信策略来更新全局模型的过程。在第10和11轮迭代中,SkipSMA不更新全局模型,即在第10和11轮迭代中SkipSMA跳过了通信。在第12轮迭代中,由于12对通信间隔3取余,余数为0,因此在第12轮迭代中需要通过网络通信来更新全局模型。具体地,SkipSMA通过网络通信聚合所有训练进程中计算出来的纠正(即

Figure BDA00034632601900000713
Figure BDA00034632601900000714
),并根据聚合后的纠正将全局模型从
Figure BDA00034632601900000715
更新为
Figure BDA00034632601900000716
Every τ rounds of iterations, each training process updates the global model through network communication. Fig. 5 takes the communication interval τ=3 as an example to describe the process of SkipSMA updating the global model by using the skip communication strategy. In the 10th and 11th iterations, SkipSMA does not update the global model, i.e. SkipSMA skips the communication in the 10th and 11th iterations. In the 12th iteration, since 12 takes the remainder of the communication interval 3 and the remainder is 0, the global model needs to be updated through network communication in the 12th iteration. Specifically, SkipSMA aggregates corrections computed over all training processes (i.e.,
Figure BDA00034632601900000713
and
Figure BDA00034632601900000714
), and based on the aggregated corrections change the global model from
Figure BDA00034632601900000715
update to
Figure BDA00034632601900000716

以上是分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法SkipSMA的具体实施过程,在分布式深度学习系统中,该方法可以通过方法1中的相关代码实现,方法1的代码如下所示:The above is the specific implementation process of SkipSMA, a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system. The code looks like this:

Figure BDA00034632601900000717
Figure BDA00034632601900000717

Figure BDA0003463260190000081
Figure BDA0003463260190000081

实施例Example

使用4台装有CentOS 7操作系统的机器组建一个用于训练模型的集群,机器间用1000Mbps带宽的网络连接,每台机器拥有两块型号为TeslaV100的GPU。在该集群中,分别采用SSGD、SkipSSGD、LocalSGD、SMA和SkipSMA五种方法来训练ResNet50模型,对比这五种方法将ResNet50模型训练到68%准确率的总训练时间、通信开销和统计效率。值得注意的是,SSGD、SkipSSGD、LocalSGD和SkipSMA都运行于集群中的四台机器上,而SMA主要聚焦于单机多卡的场景,因此SMA只运行于集群中的一台机器上。此外,SkipSMA的通信间隔τ是在运行过程中由自适应通信间隔选择器所决定的,而SkipSSGD和LocalSGD的通信间隔τ则是在运行前指定为1、2、3、4、5、10、20、30和40,并从中选取使得训练时间最短的一个通信间隔τ来与SkipSMA进行对比。Four machines with CentOS 7 operating system were used to form a cluster for training the model. The network connection between the machines was 1000Mbps, and each machine had two GPUs of TeslaV100. In this cluster, five methods, SSGD, SkipSSGD, LocalSGD, SMA, and SkipSMA, are used to train the ResNet50 model, respectively, and the total training time, communication overhead, and statistical efficiency of the five methods to train the ResNet50 model to 68% accuracy are compared. It is worth noting that SSGD, SkipSSGD, LocalSGD and SkipSMA all run on four machines in the cluster, while SMA mainly focuses on single-machine multi-card scenarios, so SMA only runs on one machine in the cluster. In addition, the communication interval τ of SkipSMA is determined by the adaptive communication interval selector during operation, while the communication interval τ of SkipSSGD and LocalSGD is specified before operation as 1, 2, 3, 4, 5, 10, 20, 30 and 40, and select a communication interval τ that minimizes the training time to compare with SkipSMA.

如图6所示,与SSGD相比,SkipSMA缩短了86.76%的总训练时间。这主要是因为SkipSMA比SSGD具有更低的通信开销,具体来说,SSGD在每轮迭代中都涉及网络通信,而SkipSMA每隔τ轮迭代才进行一次通信。如图7所示,这使得SkipSMA比SSGD降低了93.3%的通信开销。As shown in Figure 6, SkipSMA reduces the total training time by 86.76% compared to SSGD. This is mainly because SkipSMA has lower communication overhead than SSGD, specifically, SSGD involves network communication in every iteration, while SkipSMA only communicates once every τ iterations. As shown in Figure 7, this enables SkipSMA to reduce communication overhead by 93.3% compared to SSGD.

如图6所示,与SkipSSGD相比,SkipSMA缩短了64.83%的总训练时间。这是由于SkipSSGD在通信间隔τ>3时,受批量大小的影响,无法在50小时以内将模型训练到68%的准确率;在通信间隔τ=3时,SkipSSGD获得最短的总训练时间。然而,SkipSSGD在通信间隔τ=3时,通信开销仍然比较大,如图7所示,采用自适应通信间隔的SkipSMA比通信间隔τ=3的SkipSSGD降低了80%的通信开销。As shown in Figure 6, SkipSMA shortens the total training time by 64.83% compared to SkipSSGD. This is because SkipSSGD cannot train the model to 68% accuracy within 50 hours due to the influence of batch size when the communication interval τ>3; SkipSSGD obtains the shortest total training time when the communication interval τ=3. However, when SkipSSGD has a communication interval τ=3, the communication overhead is still relatively large. As shown in Figure 7, SkipSMA with adaptive communication interval reduces the communication overhead by 80% compared with SkipSSGD with communication interval τ=3.

如图6所示,与LocalSGD相比,SkipSMA缩短了16.37%的总训练时间。LocalSGD在通信间隔τ=10时总训练时间最短。当通信间隔τ>10时,虽然LocalSGD能够进一步降低通信开销,然而会导致各个本地模型之间更加发散,进而造成统计效率的下降。而SkipSMA由于采用了纠正技术,有效地缓解了各个本地模型之间的发散,进而获得高的统计效率。如图8所示,拥有自适应通信间隔的SkipSMA保持了与通信间隔τ=10的LocalSGD相似的统计效率。同时,SkipSMA自适应地将通信间隔τ调整为15,使得SkipSMA拥有低的通信开销。如图7所示,与通信间隔τ=10的LocalSGD相比,采用自适应通信间隔的SkipSMA降低了33.3%的通信开销。As shown in Figure 6, SkipSMA shortens the total training time by 16.37% compared to LocalSGD. LocalSGD has the shortest total training time when the communication interval τ=10. When the communication interval τ>10, although LocalSGD can further reduce the communication overhead, it will lead to more divergence among the local models, which will lead to a decrease in statistical efficiency. However, SkipSMA effectively alleviates the divergence between local models due to the use of correction technology, thereby achieving high statistical efficiency. As shown in Fig. 8, SkipSMA with adaptive communication interval maintains similar statistical efficiency as LocalSGD with communication interval τ=10. At the same time, SkipSMA adaptively adjusts the communication interval τ to 15, which makes SkipSMA have low communication overhead. As shown in Figure 7, compared with LocalSGD with communication interval τ=10, SkipSMA with adaptive communication interval reduces the communication overhead by 33.3%.

如图6所示,与SMA相比,SkipSMA缩短了79.26%的总训练时间。虽然SMA每轮迭代中的通信时间与计算时间相近,也就是说,通信没有成为瓶颈,但是SMA仅利用了单个节点的算力。SkipSMA不仅通过具有自适应通信间隔的跳过通信策略保持了低的通信开销,而且利用了多个节点的算力。如图7所示,在每个epoch中,SkipSMA比SMA降低了79.9%的计算和通信时间。As shown in Figure 6, SkipSMA shortens the total training time by 79.26% compared to SMA. Although the communication time in each iteration of SMA is similar to the computation time, that is, communication does not become a bottleneck, SMA only utilizes the computing power of a single node. SkipSMA not only maintains low communication overhead through skip communication strategy with adaptive communication interval, but also utilizes the computing power of multiple nodes. As shown in Figure 7, SkipSMA reduces the computation and communication time by 79.9% compared to SMA in each epoch.

本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下,本领域技术人员能够想到的变化和优点都被包括在本发明中,并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Variations and advantages that can occur to those skilled in the art without departing from the spirit and scope of the inventive concept are included in the present invention, and the appended claims are the scope of protection.

Claims (14)

1.一种分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法,其特征在于,所述方法包括如下步骤:1. a method for a training model with low communication overhead and high statistical efficiency in a distributed deep learning system, wherein the method comprises the steps: 步骤A:运行时数据收集器在运行时采集自适应通信间隔所需的通信时间tcm和计算时间tcp数据,自适应通信间隔选择器通过上述采集获得的通信时间和计算时间数据自动调整通信间隔τ;Step A: The runtime data collector collects the communication time t cm and calculation time t cp data required by the adaptive communication interval during the runtime, and the adaptive communication interval selector automatically adjusts the communication through the communication time and calculation time data obtained by the above collection. interval τ; 步骤B:对模型进行迭代训练,在每一轮迭代中采用纠正技术更新本地模型;Step B: Perform iterative training on the model, and update the local model with correction techniques in each iteration; 步骤C:每隔τ轮迭代利用跳过通信策略更新全局模型。Step C: Update the global model with skip communication strategy every τ iterations. 2.如权利要求1所述的方法,其特征在于,所述步骤A进一步包括如下步骤:2. The method of claim 1, wherein the step A further comprises the steps: 步骤A1:系统自动将通信间隔τ初始化为1;设置通信间隔τ=1,并设置自适应通信间隔标志位flag;Step A1: The system automatically initializes the communication interval τ to 1; sets the communication interval τ=1, and sets the adaptive communication interval flag bit; 步骤A2:在系统运行时,运行时数据收集器采集第一轮迭代中通信的耗时tcm和计算的耗时tcpStep A2: when the system is running, the runtime data collector collects the time-consuming t cm of communication and the time-consuming t cp of calculation in the first iteration; 步骤A3:自适应通信间隔选择器根据第一轮迭代中采集的tcm和tcp来调整通信间隔
Figure FDA0003463260180000011
Step A3: The adaptive communication interval selector adjusts the communication interval according to the t cm and t cp collected in the first iteration
Figure FDA0003463260180000011
3.如权利要求2所述的方法,其特征在于,步骤A1中,所述自适应通信间隔标志位用于指示是否启用自适应通信间隔选择器;当自适应通信间隔标志位flag=true时,自适应通信间隔选择器启用,系统自适应地选择一个合适的通信间隔τ;当自适应通信间隔标志位flag=false时,自适应通信间隔选择器禁用,用户需要指定通信间隔τ。3. The method of claim 2, wherein in step A1, the adaptive communication interval flag bit is used to indicate whether to enable the adaptive communication interval selector; when the adaptive communication interval flag bit flag=true , the adaptive communication interval selector is enabled, and the system adaptively selects an appropriate communication interval τ; when the adaptive communication interval flag flag=false, the adaptive communication interval selector is disabled, and the user needs to specify the communication interval τ. 4.如权利要求1所述的方法,其特征在于,当第一轮迭代完成通信间隔τ的调整后,所述通信间隔τ用于后续所有迭代。4. The method of claim 1, wherein after the first round of iterations completes the adjustment of the communication interval τ, the communication interval τ is used for all subsequent iterations. 5.如权利要求1所述的方法,其特征在于,所述步骤B进一步包括如下步骤:5. The method of claim 1, wherein the step B further comprises the steps of: 步骤B1:每个训练进程根据本地模型计算出梯度;Step B1: Each training process calculates the gradient according to the local model; 步骤B2:每个训练进程计算出纠正,所述纠正是指本地模型和全局模型之间的差值;Step B2: each training process calculates a correction, and the correction refers to the difference between the local model and the global model; 步骤B3:每个训练进程用计算所得的梯度和纠正来更新本地模型。Step B3: Each training process updates the local model with the computed gradients and corrections. 6.如权利要求5所述的方法,其特征在于,步骤B1中,所述梯度通过下述公式计算:6. The method of claim 5, wherein in step B1, the gradient is calculated by the following formula:
Figure FDA0003463260180000012
Figure FDA0003463260180000012
其中,i代表训练进程的编号,k代表迭代数,
Figure FDA0003463260180000013
代表编号为i的训练进程在第k轮迭代中计算所得的梯度,b(i)代表编号为i的训练进程上的批量大小,
Figure FDA0003463260180000014
代表编号为i的训练进程在第k轮迭代中的本地模型参数,L(x,w)代表样本x在模型参数w上计算所得的损失。
Among them, i represents the number of the training process, k represents the number of iterations,
Figure FDA0003463260180000013
represents the gradient calculated by the training process number i in the k-th iteration, b (i) represents the batch size on the training process number i,
Figure FDA0003463260180000014
Represents the local model parameters of the training process numbered i in the k-th iteration, and L(x, w) represents the loss calculated by the sample x on the model parameter w.
7.如权利要求5所述的方法,其特征在于,步骤B2中,所述纠正通过下述公式计算:7. method as claimed in claim 5 is characterized in that, in step B2, described correction is calculated by following formula:
Figure FDA0003463260180000021
Figure FDA0003463260180000021
其中,i代表训练进程的编号,k代表迭代数,
Figure FDA0003463260180000022
代表编号为i的训练进程在第k轮迭代中计算所得的纠正,
Figure FDA0003463260180000023
代表编号为i的训练进程在第k轮迭代中的本地模型参数,
Figure FDA0003463260180000024
代表第k轮迭代中的全局模型参数。
Among them, i represents the number of the training process, k represents the number of iterations,
Figure FDA0003463260180000022
represents the correction computed in the k-th iteration of the training process numbered i,
Figure FDA0003463260180000023
represents the local model parameters of the training process numbered i in the k-th iteration,
Figure FDA0003463260180000024
represents the global model parameters in the k-th iteration.
8.如权利要求5所述的方法,其特征在于,步骤B3中,所述本地模型的具体更新形式如下述公式所示:8. The method of claim 5, wherein in step B3, the specific update form of the local model is shown in the following formula:
Figure FDA0003463260180000025
Figure FDA0003463260180000025
其中,i代表训练进程的编号,k代表迭代数,
Figure FDA0003463260180000026
Figure FDA0003463260180000027
代表编号为i的训练进程在第k和k+1轮迭代中的本地模型参数,
Figure FDA0003463260180000028
Figure FDA0003463260180000029
分别代表编号为i的训练进程在第k轮迭代中的本地模型动量、梯度和纠正,μ、γ和α分别代表动量系数、学习率和纠正系数。
Among them, i represents the number of the training process, k represents the number of iterations,
Figure FDA0003463260180000026
and
Figure FDA0003463260180000027
represents the local model parameters of the training process numbered i in the kth and k+1 iterations,
Figure FDA0003463260180000028
and
Figure FDA0003463260180000029
respectively represent the local model momentum, gradient and correction in the kth iteration of the training process numbered i, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.
9.如权利要求8所述的方法,其特征在于,所述本地模型动量初始时为空,在每一轮迭代中根据梯度进行更新,具体的更新形式为:9. The method of claim 8, wherein the momentum of the local model is initially empty, and is updated according to the gradient in each iteration, and the specific update form is:
Figure FDA00034632601800000210
Figure FDA00034632601800000210
其中,i代表训练进程的编号,k代表迭代数,
Figure FDA00034632601800000211
Figure FDA00034632601800000212
代表编号为i的训练进程在第k和k+1轮迭代中的本地模型动量,
Figure FDA00034632601800000213
代表编号为i的训练进程在第k轮迭代中的梯度,μ和γ分别代表动量系数和学习率。
Among them, i represents the number of the training process, k represents the number of iterations,
Figure FDA00034632601800000211
and
Figure FDA00034632601800000212
represents the local model momentum of the training process numbered i in the kth and k+1 iterations,
Figure FDA00034632601800000213
represents the gradient of the training process numbered i in the k-th iteration, and μ and γ represent the momentum coefficient and learning rate, respectively.
10.如权利要求1所述的方法,其特征在于,所述步骤C进一步包括如下步骤:10. The method of claim 1, wherein the step C further comprises the steps of: 步骤C1:用当前迭代数对自适应设置的通信间隔τ取余,如果余数为0表明当前迭代需要更新全局模型,之后进入步骤C2;Step C1: Use the current iteration number to take the remainder of the adaptively set communication interval τ, if the remainder is 0, it indicates that the current iteration needs to update the global model, and then go to step C2; 步骤C2:通过一次网络通信聚合各个训练进程计算出的纠正,并用聚合后的纠正来更新全局模型。Step C2: Aggregate the corrections calculated by each training process through one network communication, and update the global model with the aggregated corrections. 11.如权利要求10所述的方法,其特征在于,步骤C2中,所述聚合后的纠正以下述公式表示:11. The method of claim 10, wherein, in step C2, the correction after the aggregation is represented by the following formula:
Figure FDA00034632601800000214
Figure FDA00034632601800000214
其中,n代表训练进程的总数,k代表迭代数,
Figure FDA00034632601800000215
代表编号为i的训练进程在第k轮迭代中计算所得的纠正;
where n represents the total number of training processes, k represents the number of iterations,
Figure FDA00034632601800000215
represents the correction calculated in the k-th iteration of the training process numbered i;
所述全局模型的更新形式如公式
Figure FDA00034632601800000216
所示,其中,i代表训练进程的编号,k代表迭代数,
Figure FDA0003463260180000031
Figure FDA0003463260180000032
分别代表在第k和k+1轮迭代中的全局模型参数,
Figure FDA0003463260180000033
代表第k轮迭代中的全局模型动量,μ代表动量系数,α代表纠正系数。
The updated form of the global model is as follows
Figure FDA00034632601800000216
shown, where i represents the number of the training process, k represents the number of iterations,
Figure FDA0003463260180000031
and
Figure FDA0003463260180000032
represent the global model parameters in the k and k+1 iterations, respectively,
Figure FDA0003463260180000033
represents the global model momentum in the k-th iteration, μ represents the momentum coefficient, and α represents the correction coefficient.
12.如权利要求11所述的方法,其特征在于,所述全局模型动量初始时为空,每隔τ轮迭代根据聚合后的纠正进行一次更新,具体的更新形式为:
Figure FDA0003463260180000034
其中,i代表训练进程的编号,k代表迭代数,
Figure FDA0003463260180000035
Figure FDA0003463260180000036
代表在第k和k+1轮迭代中的全局模型动量,
Figure FDA0003463260180000037
代表聚合后的纠正,μ和α分别代表动量系数和纠正系数。
12. The method of claim 11, wherein the global model momentum is initially empty, and is updated once every τ round of iterations according to the correction after aggregation, and the specific update form is:
Figure FDA0003463260180000034
Among them, i represents the number of the training process, k represents the number of iterations,
Figure FDA0003463260180000035
and
Figure FDA0003463260180000036
represents the global model momentum in the kth and k+1 iterations,
Figure FDA0003463260180000037
represents the correction after aggregation, and μ and α represent the momentum coefficient and correction coefficient, respectively.
13.一种实现如权利要求1-12之任一项所述方法的系统,其特征在于,所述系统包括:运行时数据收集器、自适应通信间隔选择器、模型更新模块。13. A system for implementing the method according to any one of claims 1-12, wherein the system comprises: a runtime data collector, an adaptive communication interval selector, and a model update module. 14.如权利要求13所述的系统,其特征在于,所述运行时数据收集器在第一轮迭代中采集系统运行时的通信时间数据tcm和计算时间数据tcp14. The system of claim 13, wherein the runtime data collector collects communication time data tcm and computation time data tcp when the system is running in the first iteration; 所述自适应通信间隔选择器根据运行时数据收集器采集的通信时间tcm和计算时间tcp,自动调整通信间隔
Figure FDA0003463260180000038
并将该通信间隔τ用于后续所有迭代;
The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t cm and the calculation time t cp collected by the runtime data collector
Figure FDA0003463260180000038
and use this communication interval τ for all subsequent iterations;
所述模型更新模块根据自适应通信间隔选择模块获得的通信间隔τ,对本地模型和全局模型进行迭代式更新,即在每一轮迭代中采用纠正技术对本地模型进行更新,每隔τ轮迭代利用跳过通信策略对全局模型进行更新。The model updating module iteratively updates the local model and the global model according to the communication interval τ obtained by the adaptive communication interval selection module, that is, in each round of iterations, a correction technique is used to update the local model, and every τ rounds of iterations are used to update the local model. The global model is updated with a skip communication strategy.
CN202210023028.1A 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems Active CN114565007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210023028.1A CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210023028.1A CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Publications (2)

Publication Number Publication Date
CN114565007A true CN114565007A (en) 2022-05-31
CN114565007B CN114565007B (en) 2025-06-13

Family

ID=81712664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210023028.1A Active CN114565007B (en) 2022-01-10 2022-01-10 A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Country Status (1)

Country Link
CN (1) CN114565007B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829321A (en) * 2023-11-29 2024-04-05 慧之安信息技术股份有限公司 Efficient internet of things federal learning method and system based on self-adaptive differential privacy

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017156791A1 (en) * 2016-03-18 2017-09-21 Microsoft Technology Licensing, Llc Method and apparatus for training a learning machine
CN112800461A (en) * 2021-01-28 2021-05-14 深圳供电局有限公司 Network intrusion detection method for electric power metering system based on federal learning framework
US11017322B1 (en) * 2021-01-28 2021-05-25 Alipay Labs (singapore) Pte. Ltd. Method and system for federated learning
CN113098806A (en) * 2021-04-16 2021-07-09 华南理工大学 Method for compressing cooperative channel adaptability gradient of lower end in federated learning
CN113095407A (en) * 2021-04-12 2021-07-09 哈尔滨理工大学 Efficient asynchronous federated learning method for reducing communication times
WO2021158313A1 (en) * 2020-02-03 2021-08-12 Intel Corporation Systems and methods for distributed learning for wireless edge dynamics
CN113435604A (en) * 2021-06-16 2021-09-24 清华大学 Method and device for optimizing federated learning
CN113591145A (en) * 2021-07-28 2021-11-02 西安电子科技大学 Federal learning global model training method based on difference privacy and quantification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017156791A1 (en) * 2016-03-18 2017-09-21 Microsoft Technology Licensing, Llc Method and apparatus for training a learning machine
WO2021158313A1 (en) * 2020-02-03 2021-08-12 Intel Corporation Systems and methods for distributed learning for wireless edge dynamics
CN112800461A (en) * 2021-01-28 2021-05-14 深圳供电局有限公司 Network intrusion detection method for electric power metering system based on federal learning framework
US11017322B1 (en) * 2021-01-28 2021-05-25 Alipay Labs (singapore) Pte. Ltd. Method and system for federated learning
CN113095407A (en) * 2021-04-12 2021-07-09 哈尔滨理工大学 Efficient asynchronous federated learning method for reducing communication times
CN113098806A (en) * 2021-04-16 2021-07-09 华南理工大学 Method for compressing cooperative channel adaptability gradient of lower end in federated learning
CN113435604A (en) * 2021-06-16 2021-09-24 清华大学 Method and device for optimizing federated learning
CN113591145A (en) * 2021-07-28 2021-11-02 西安电子科技大学 Federal learning global model training method based on difference privacy and quantification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
毕倪飞: "面向分布式同步数据并行训练的通信优化技术", 硕士电子期刊, no. 2022, 15 December 2022 (2022-12-15) *
焦嘉烽;李云;: "大数据下的典型机器学习平台综述", 计算机应用, no. 11, 10 November 2017 (2017-11-10), pages 7 - 14 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829321A (en) * 2023-11-29 2024-04-05 慧之安信息技术股份有限公司 Efficient internet of things federal learning method and system based on self-adaptive differential privacy

Also Published As

Publication number Publication date
CN114565007B (en) 2025-06-13

Similar Documents

Publication Publication Date Title
CN108829441B (en) Distributed deep learning parameter updating and optimizing system
CN109902818B (en) Distributed acceleration method and system for deep learning training task
CN112862088A (en) Distributed deep learning method based on pipeline annular parameter communication
CN112712171B (en) Distributed training methods, devices and storage media for deep convolutional neural networks
US20190171952A1 (en) Distributed machine learning method and system
CN118586474B (en) Self-adaptive asynchronous federal learning method and system based on deep reinforcement learning
CN113438315B (en) Optimization method of Internet of Things information freshness based on dual-network deep reinforcement learning
CN111813858B (en) Hybrid synchronous training method of distributed neural network based on self-organized grouping of computing nodes
CN113206887A (en) Method for accelerating federal learning aiming at data and equipment isomerism under edge calculation
CN116680301A (en) Parallel strategy searching method oriented to artificial intelligence large model efficient training
CN111191728A (en) Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN113708969B (en) A collaborative embedding method for cloud data center virtual network based on deep reinforcement learning
CN118379593B (en) Semi-supervised semantic segmentation model training method and application based on multi-level label correction
WO2025112979A1 (en) Parallel strategy optimal selection method, and neural network solver training method and apparatus
CN116185604A (en) A pipelined parallel training method and system for a deep learning model
CN114565007A (en) A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems
CN116227599A (en) An optimization method, device, electronic equipment and storage medium for a reasoning model
CN118396048B (en) Distributed training system, method and apparatus, medium and computer program product
CN115129471A (en) Distributed local random gradient descent method for large-scale GPU cluster
CN118158092A (en) A computing power network scheduling method, device and electronic equipment
CN116306912A (en) A Model Parameter Update Method for Distributed Federated Learning
CN113918319A (en) Topology segmentation graph neural network training mechanism based on breadth-first segmentation
CN112306697B (en) Deep learning memory management method and system based on Tensor access
CN105681425B (en) Multi-node repair method and system based on distributed storage system
CN112651488A (en) Method for improving training efficiency of large-scale graph convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant