CN114565007A

CN114565007A - A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

Info

Publication number: CN114565007A
Application number: CN202210023028.1A
Authority: CN
Inventors: 徐辰; 毕倪飞; 陈梓浩; 周傲英
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-31
Anticipated expiration: 2042-01-10
Also published as: CN114565007B

Abstract

The invention discloses a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which comprises the following steps: runtime data collector collects communication time t required for adaptive communication interval at runtime_cmAnd calculating the time t_cpThe self-adaptive communication interval selector automatically adjusts the communication interval tau through the acquired communication time and the calculation time data; performing iterative training on the model, and updating the local model by adopting a correction technology in each iteration; the global model is updated with a skip communication strategy every τ iterations. The invention adaptively selects a communication interval tau through the acquired data, and based on the communication interval tau, the system updates a global mode through network communication at intervals of tau iterationAnd therefore, the network communication overhead is reduced, and finally, the time for training the model is shortened.

Description

A method for training models with low communication overhead and high statistical efficiency in distributed deep learning systems

技术领域technical field

本发明属于分布式深度学习领域，涉及一种分布式深度学习系统中面向同步数据并行的具有低通信开销和高统计效率的训练模型的方法。The invention belongs to the field of distributed deep learning, and relates to a method for training a model with low communication overhead and high statistical efficiency oriented to synchronous data parallelism in a distributed deep learning system.

背景技术Background technique

在面向同步数据并行的模型训练中，分布式深度学习系统在集群中启动多个训练进程，每个训练进程拥有一个完整的模型备份。同时，分布式深度学习系统将整个数据集划分为若干个数据分片，并将所有的数据分片分配到各个训练进程中。模型训练的整个过程由一系列迭代组成，在每一轮迭代中，各个训练进程从分配到的数据分片中选取出一批数据，在模型上计算出参数更新。为了同步所有数据分片上的参数更新，训练进程之间通常借助Parameter Server或AllReduce通信架构来聚合所有训练进程上的参数更新。基于聚合后的参数更新，各个训练进程对模型的参数进行更新。然而，对模型参数进行更新的具体过程是由训练模型的方法所决定的，不同的训练模型的方法对应不同的更新过程。目前，分布式深度学习系统中面向同步数据并行的训练模型的方法主要有SSGD、SkipSSGD、LocalSGD和SMA四种。In model training for synchronous data parallelism, the distributed deep learning system starts multiple training processes in the cluster, and each training process has a complete model backup. At the same time, the distributed deep learning system divides the entire data set into several data shards, and assigns all the data shards to each training process. The entire process of model training consists of a series of iterations. In each iteration, each training process selects a batch of data from the allocated data shards and calculates parameter updates on the model. In order to synchronize parameter updates on all data shards, training processes usually use the Parameter Server or AllReduce communication architecture to aggregate parameter updates on all training processes. Based on the aggregated parameter updates, each training process updates the parameters of the model. However, the specific process of updating the model parameters is determined by the method of training the model, and different methods of training the model correspond to different updating processes. At present, there are four main methods of training models for synchronous data parallelism in distributed deep learning systems: SSGD, SkipSSGD, LocalSGD and SMA.

SSGD是分布式深度学习系统中广泛采用的训练模型的方法。SSGD在每轮迭代中都通过网络通信来聚合每个训练进程在一批数据上计算所得的梯度，并根据聚合后的梯度来更新模型。图1显示了采用SSGD方法训练模型过程中三轮连续的迭代。在第10轮迭代中，两个训练进程基于模型w₁₀分别计算出梯度

和

然后通过网络通信聚合梯度

和

并计算出平均梯度。接着，SSGD利用平均梯度将模型从w₁₀更新为w₁₁，完成模型的更新后，SSGD就进入下一轮迭代。由于每一轮迭代都涉及网络通信，SSGD方法在分布式环境中通常存在通信瓶颈。SSGD is a widely adopted method for training models in distributed deep learning systems. SSGD communicates through the network in each iteration to aggregate the gradients computed by each training process over a batch of data, and update the model based on the aggregated gradients. Figure 1 shows three consecutive iterations during model training using the SSGD method. In the 10th iteration, the two training processes calculate the gradients based on the model w ₁₀

and

Then aggregate gradients through network communication

and

and calculate the average gradient. Next, SSGD uses the average gradient to update the model from w ₁₀ to w ₁₁ . After completing the model update, SSGD enters the next iteration. Since each iteration involves network communication, SSGD methods usually suffer from communication bottlenecks in distributed environments.

在SSGD的基础上，SkipSSGD采用了一种跳过通信策略来更新模型。SkipSSGD不在每一轮迭代中都通过网络通信来聚合梯度，而是每隔τ轮迭代才进行一次通信，从而大大降低了通信开销。为了在跳过通信的同时保留已得到的梯度信息，SkipSSGD在每个训练进程中维护了一个梯度累加器用于保存梯度信息。每隔τ轮迭代，SkipSSGD通过一次网络通信来聚合各个累加器中的梯度，并根据聚合后的梯度来更新模型。图2显示了τ＝3的SkipSSGD方法训练模型过程中三轮连续的迭代。基于模型w₁₀，两个训练进程在第10、11和12轮迭代中分别计算出梯度

和

并将这些梯度累加到各自的梯度累加器中。完成连续三轮迭代的梯度累加后，SkipSSGD通过一次网络通信来聚合这些累加的梯度，并将模型从w₁₀更新为w₁₁。虽然SkipSSGD降低了通信开销，但是累加梯度的方式会造成大的批量大小，进而导致低的统计效率。Based on SSGD, SkipSSGD adopts a skip communication strategy to update the model. SkipSSGD does not aggregate gradients through network communication in each iteration, but only communicates once every τ iterations, which greatly reduces the communication overhead. In order to preserve the obtained gradient information while skipping communication, SkipSSGD maintains a gradient accumulator in each training process to save the gradient information. Every τ iterations, SkipSSGD aggregates the gradients in each accumulator through a network communication, and updates the model according to the aggregated gradients. Figure 2 shows three successive iterations during the model training process of the SkipSSGD method with τ=3. Based on the model w ₁₀ , the two training processes compute gradients at

iterations

10, 11, and 12, respectively

and

and accumulate these gradients into their respective gradient accumulators. After completing the gradient accumulation for three consecutive iterations, SkipSSGD aggregates these accumulated gradients through one network communication and updates the model from w ₁₀ to w ₁₁ . Although SkipSSGD reduces communication overhead, the way of accumulating gradients results in a large batch size, which in turn leads to low statistical efficiency.

LocalSGD也采用了一种跳过通信策略来降低通信开销。然而，与SkipSSGD通过累加梯度来跳过通信的方式不同，LocalSGD维护了多个本地模型和一个全局模型，其直接在每轮迭代中用梯度来更新本地模型，并每隔τ轮迭代通过网络通信聚合每个本地模型的参数来更新全局模型。图3显示了τ＝3的LocalSGD方法训练模型过程中三轮连续的迭代。在第10轮迭代，一个训练进程计算出梯度

并用该梯度将本地模型更新为

在第11和12轮迭代中，该训练进程继续计算出梯度

和

并将本地模型更新为

和

类似地，另一个训练进程也分别在第10、11和12轮迭代中将本地模型更新为

和

完成连续三轮迭代的本地模型更新后，LocalSGD通过一次网络通信来聚合各个本地模型，并将全局模型从

更新为

虽然LocalSGD降低了通信开销，且保持了小的批量大小，但是各个本地模型之间会随着τ的增大而越来越发散，进而导致低的统计效率。LocalSGD also adopts a skip communication strategy to reduce communication overhead. However, unlike SkipSSGD, which skips communication by accumulating gradients, LocalSGD maintains multiple local models and a global model, which directly updates the local model with gradients in each iteration, and communicates through the network every τ iterations Aggregate the parameters of each local model to update the global model. Figure 3 shows three successive iterations during the model training process of the LocalSGD method with τ=3. At iteration 10, a training process computes the gradient

and use that gradient to update the local model to

At

iterations

11 and 12, the training process continues to compute gradients

and

and update the local model to

and

Similarly, another training process also updates the local model at

iterations

10, 11, and 12 as

and

After completing the local model update for three consecutive rounds of iterations, LocalSGD aggregates each local model through a network communication, and converts the global model from

update to

Although LocalSGD reduces communication overhead and maintains a small batch size, each local model becomes more and more divergent as τ increases, resulting in low statistical efficiency.

为了降低各个本地模型之间的发散程度，SMA采用了一种纠正技术，即通过全局模型来纠正各个本地模型的每次更新。图4显示了SMA方法训练模型过程中三轮连续的迭代。在第10轮迭代中，一个训练进程根据梯度

和纠正

将本地模型从

更新为

另一个训练进程根据梯度

和纠正

将本地模型从

更新为

同时，为了保证全局模型是最新的，SMA在每轮迭代中都会更新全局模型。例如，在第10轮迭代中，SMA通过网络通信来聚合两个训练进程中计算所得的纠正(即

和

)来将全局模型从

更新为

虽然SMA保持了小的批量大小，且降低了各个本地模型之间的发散程度，但是其每一轮迭代都涉及网络通信来更新全局模型，在分布式环境中往往存在通信瓶颈。To reduce the degree of divergence among individual local models, SMA employs a correction technique that corrects each update of individual local models through the global model. Figure 4 shows three successive iterations during the model training process of the SMA method. In the 10th iteration, a training process according to the gradient

and correct

Change the local model from

update to

Another training process according to the gradient

and correct

Change the local model from

update to

At the same time, to ensure that the global model is up-to-date, SMA updates the global model in each iteration. For example, in the 10th iteration, the SMA communicates through the network to aggregate the corrections computed in the two training processes (i.e.

and

) to change the global model from

update to

Although SMA maintains a small batch size and reduces the degree of divergence among local models, each iteration of SMA involves network communication to update the global model, which often has a communication bottleneck in a distributed environment.

总的来说，现有的面向同步数据并行的训练模型的方法都存在缺陷，即这些方法都没有同时保证低的通信开销和高的统计效率。In general, the existing methods for training models oriented to synchronous data parallelism are flawed, that is, none of these methods guarantees low communication overhead and high statistical efficiency at the same time.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术存在的不足，本发明的目的是提出一种分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法，该方法利用一种基于自适应通信间隔的跳过通信策略保证了低的通信开销，并在保持小的批量大小的前提下采用纠正技术降低了各个本地模型之间的发散程度，从而保证了高的统计效率，最终缩短了训练模型的时间。In order to solve the shortcomings of the prior art, the purpose of the present invention is to propose a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, which utilizes a skip based on adaptive communication interval The communication strategy ensures low communication overhead, and adopts correction techniques to reduce the degree of divergence between local models while maintaining a small batch size, thereby ensuring high statistical efficiency and ultimately shortening the time to train the model.

现有的采用跳过通信策略的训练方法(例如，SkipSSGD和LocalSGD)都要求用户手动指定一个通信间隔τ。然而，相同的通信间隔τ在不同的运行环境(包括GPU的算力、网络的带宽和模型的大小)中所带来的通信开销往往是不同的，也就是说，适合于一个运行环境的通信间隔τ可能不适合于另一个运行环境。因此，为了在不同的运行环境中自动选择一个合适的通信间隔τ，本发明还提出了一种在进行模型训练过程中自适应调整通信间隔的方法。该方法通过一个运行时数据收集器和一个自适应通信间隔选择器来实现，运行时数据收集器在第一轮迭代中采集通信和计算时间数据，自适应通信间隔选择器基于采集的通信和计算时间数据自动调整通信间隔τ，以使得每一个训练周期epoch中的通信时间和计算时间是相近的，从而避免了通信开销成为整个训练中的瓶颈。Existing training methods that employ skip communication strategies (eg, SkipSSGD and LocalSGD) all require the user to manually specify a communication interval τ. However, the communication overhead caused by the same communication interval τ in different operating environments (including GPU computing power, network bandwidth and model size) is often different, that is, communication suitable for one operating environment. The interval τ may not be suitable for another operating environment. Therefore, in order to automatically select an appropriate communication interval τ in different operating environments, the present invention also proposes a method for adaptively adjusting the communication interval during model training. The method is implemented by a runtime data collector that collects communication and computational time data in the first iteration and an adaptive communication interval selector that is based on the collected communication and computational The time data automatically adjusts the communication interval τ, so that the communication time and computing time in each training cycle epoch are similar, thus avoiding the communication overhead becoming the bottleneck in the whole training.

实现本发明目的的具体技术方案是：The concrete technical scheme that realizes the object of the present invention is:

一种分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法，该方法包括以下步骤：A method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system, the method comprising the following steps:

步骤A：运行时数据收集器在运行时采集自适应通信间隔所需的通信时间t_cm和计算时间t_cp数据；自适应通信间隔选择器通过上述采集获得的通信时间和计算时间数据自动调整通信间隔τ；Step A: The runtime data collector collects the data of the communication time t _cm and the computation time t _cp required by the adaptive communication interval during the runtime; the adaptive communication interval selector automatically adjusts the communication through the communication time and computation time data obtained by the above collection interval τ;

步骤B：对模型进行迭代训练，在每一轮迭代中采用纠正技术更新本地模型；Step B: Perform iterative training on the model, and update the local model with correction techniques in each iteration;

步骤C：每隔τ轮迭代利用跳过通信策略更新全局模型。Step C: Update the global model with skip communication strategy every τ iterations.

其中，所述运行时数据收集器在运行时采集自适应通信间隔所需的通信时间t_cm和计算时间t_cp数据；自适应通信间隔选择器通过上述采集获得的通信时间和计算时间数据自动调整通信间隔τ的步骤包括：Wherein, the runtime data collector collects the data of the communication time t _cm and the computation time t _cp required by the adaptive communication interval during the runtime; the adaptive communication interval selector automatically adjusts the communication time and computation time data obtained by the above collection The steps of communication interval τ include:

步骤A1：系统自动将通信间隔τ初始化为1；设置通信间隔τ＝1，并设置自适应通信间隔标志位flag；自适应通信间隔标志位用于指示是否启用自适应通信间隔选择器，自适应通信间隔标志位flag＝true表示启用自适应通信间隔选择器，即系统自适应地选择一个合适的通信间隔τ；flag＝false表示禁用自适应通信间隔选择器，即需要用户指定通信间隔τ；本发明始终启用自适应通信间隔选择器；Step A1: The system automatically initializes the communication interval τ to 1; sets the communication interval τ=1, and sets the adaptive communication interval flag bit; the adaptive communication interval flag bit is used to indicate whether to enable the adaptive communication interval selector, adaptive The communication interval flag bit flag=true indicates that the adaptive communication interval selector is enabled, that is, the system adaptively selects a suitable communication interval τ; flag=false indicates that the adaptive communication interval selector is disabled, that is, the user needs to specify the communication interval τ; this Invention always enables adaptive communication interval selector;

步骤A2：在系统运行时，运行时数据收集器采集第一轮迭代中通信的耗时t_cm和计算的耗时t_cp；Step A2: when the system is running, the runtime data collector collects the time-consuming t _cm of communication and the time-consuming t _cp of calculation in the first iteration;

步骤A3：自适应通信间隔选择器根据第一轮迭代中采集的t_cm和t_cp来调整通信间隔τ，具体来说，将通信间隔τ调整为

即将通信间隔调整为对t_cm和t_cp的商向上取整。Step A3: The adaptive communication interval selector adjusts the communication interval τ according to the t _cm and t _cp collected in the first round of iterations. Specifically, the communication interval τ is adjusted as

That is, the communication interval is adjusted to round up the quotient of t _cm and t _cp .

在第一轮迭代中完成通信间隔τ的调整后，该通信间隔τ用于后续所有迭代，即后续迭代中不再进行通信和计算时间数据的采集以及通信间隔τ的调整。After completing the adjustment of the communication interval τ in the first round of iterations, the communication interval τ is used for all subsequent iterations, that is, the collection of communication and calculation time data and the adjustment of the communication interval τ are no longer performed in subsequent iterations.

其中，所述对模型进行迭代训练，在每一轮迭代中采用纠正技术更新本地模型的步骤包括：Wherein, in the iterative training of the model, the steps of using correction technology to update the local model in each round of iterations include:

步骤B1：每个训练进程根据本地模型计算出梯度；

其中，i代表训练进程的编号，k代表迭代数，

代表编号为i的训练进程在第k轮迭代中计算所得的梯度，b⁽ⁱ⁾代表编号为i的训练进程上的批量大小，

代表编号为i的训练进程在第k轮迭代中的本地模型参数，L(x,w)代表样本x在模型参数w上计算所得的损失；Step B1: Each training process calculates the gradient according to the local model;

Among them, i represents the number of the training process, k represents the number of iterations,

represents the gradient calculated by the training process number i in the k-th iteration, b ⁽ⁱ⁾ represents the batch size on the training process number i,

Represents the local model parameters of the training process numbered i in the k-th iteration, and L(x, w) represents the loss calculated by the sample x on the model parameter w;

步骤B2：每个训练进程计算出纠正

其中，i代表训练进程的编号，k代表迭代数，

代表编号为i的训练进程在第k轮迭代中计算所得的纠正，

代表编号为i的训练进程在第k轮迭代中的本地模型参数，

代表第k轮迭代中的全局模型参数，所述纠正即本地模型参数和全局模型参数之间的差值；Step B2: Each training process computes the correction

represents the correction computed in the k-th iteration of the training process numbered i,

represents the local model parameters of the training process numbered i in the k-th iteration,

represents the global model parameters in the k-th iteration, and the correction is the difference between the local model parameters and the global model parameters;

步骤B3：每个训练进程用计算所得的梯度和纠正来更新本地模型，具体的更新形式为：

其中，i代表训练进程的编号，k代表迭代数，

和

代表编号为i的训练进程在第k和k+1轮迭代中的本地模型参数，

和

分别代表编号为i的训练进程在第k轮迭代中的本地模型动量、梯度和纠正，μ、γ和α分别代表动量系数、学习率和纠正系数。Step B3: Each training process uses the calculated gradient and correction to update the local model. The specific update form is:

and

represents the local model parameters of the training process numbered i in the kth and k+1 iterations,

and

respectively represent the local model momentum, gradient and correction in the kth iteration of the training process numbered i, and μ, γ and α represent the momentum coefficient, learning rate and correction coefficient, respectively.

本地模型动量初始时为空，并且本地模型动量也会在每一轮迭代中根据梯度来更新，具体的更新形式为：

其中，i代表训练进程的编号，k代表迭代数，

和

代表编号为i的训练进程在第k和k+1轮迭代中的本地模型动量，

代表编号为i的训练进程在第k轮迭代中的梯度，μ和γ分别代表动量系数和学习率。The momentum of the local model is initially empty, and the momentum of the local model will also be updated according to the gradient in each iteration. The specific update form is:

and

represents the local model momentum of the training process numbered i in the kth and k+1 iterations,

represents the gradient of the training process numbered i in the k-th iteration, and μ and γ represent the momentum coefficient and learning rate, respectively.

其中，所述每隔τ轮迭代利用跳过通信策略更新全局模型的步骤包括：Wherein, the step of using skip communication strategy to update the global model every τ rounds of iterations includes:

步骤C1：用当前迭代数对自适应设置的通信间隔τ取余，如果余数为0表明当前迭代需要更新全局模型，之后进入步骤C2；Step C1: Use the current iteration number to take the remainder of the adaptively set communication interval τ, if the remainder is 0, it indicates that the current iteration needs to update the global model, and then go to step C2;

步骤C2：通过一次网络通信聚合各个训练进程计算出的纠正，得到聚合后的纠正

其中，n代表训练进程的总数，k代表迭代数，

代表编号为i的训练进程在第k轮迭代中计算所得的纠正，并用聚合后的纠正来更新全局模型，全局模型具体的更新形式为：

其中，i代表训练进程的编号，k代表迭代数，

和

分别代表在第k和k+1轮迭代中的全局模型参数，

代表第k轮迭代中的全局模型动量，μ代表动量系数，α代表纠正系数。全局模型动量初始时为空，并且全局模型动量也是每隔τ轮迭代根据聚合后的纠正进行一次更新，全局模型动量具体的更新形式为：

其中，k代表迭代数，

和

代表在第k和k+1轮迭代中的全局模型动量，

代表聚合后的纠正，μ和α分别代表动量系数和纠正系数。Step C2: Aggregate the corrections calculated by each training process through a network communication to obtain the aggregated corrections

where n represents the total number of training processes, k represents the number of iterations,

Represents the correction calculated by the training process numbered i in the k-th iteration, and uses the aggregated correction to update the global model. The specific update form of the global model is:

and

represent the global model parameters in the k and k+1 iterations, respectively,

represents the global model momentum in the k-th iteration, μ represents the momentum coefficient, and α represents the correction coefficient. The global model momentum is initially empty, and the global model momentum is also updated every τ rounds of iterations according to the correction after aggregation. The specific update form of the global model momentum is:

where k represents the number of iterations,

and

represents the global model momentum in the kth and k+1 iterations,

represents the correction after aggregation, and μ and α represent the momentum coefficient and correction coefficient, respectively.

所述通信开销是指通信时间在一个Epoch总耗时中的占比。The communication overhead refers to the proportion of communication time in the total time-consuming of an Epoch.

所述通信间隔是指所述通信间隔是指相邻两次发生全局模型更新的迭代之间所隔的迭代数。The communication interval refers to the number of iterations between two adjacent iterations in which the global model update occurs.

所述本地模型是指所述本地模型是指每个训练进程独立地通过本地计算所得的梯度和纠正来更新的模型。The local model refers to that the local model refers to a model that is independently updated by each training process through locally calculated gradients and corrections.

所述全局模型是指所述全局模型是指所有训练进程共同地通过网络通信聚合的纠正来更新的模型，或者所述全局模型是指用于纠正每轮迭代中本地模型更新的模型。The global model means that the global model refers to a model that is commonly updated by corrections aggregated by network communication for all training processes, or the global model refers to a model that is used to correct the local model update in each iteration.

本发明还提出了一种用于实现上述方法的系统，所述系统包括：运行时数据收集器、自适应通信间隔选择器、模型更新模块。The present invention also provides a system for implementing the above method, the system comprising: a runtime data collector, an adaptive communication interval selector, and a model updating module.

所述运行时数据收集器在第一轮迭代中采集系统运行时的通信时间数据t_cm和计算时间数据t_cp；The runtime data collector collects the communication time data t _cm and the computation time data t _cp when the system is running in the first iteration;

所述自适应通信间隔选择器根据运行时数据收集器采集的通信时间t_cm和计算时间t_cp，自动调整通信间隔

并将该通信间隔τ用于后续所有迭代；The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t _cm and the calculation time t _cp collected by the runtime data collector

and use this communication interval τ for all subsequent iterations;

所述模型更新模块根据自适应通信间隔选择模块获得的通信间隔τ，对本地模型和全局模型进行迭代式更新，即在每一轮迭代中采用纠正技术对本地模型进行更新，每隔τ轮迭代利用跳过通信策略对全局模型进行更新。The model updating module iteratively updates the local model and the global model according to the communication interval τ obtained by the adaptive communication interval selection module, that is, in each round of iterations, a correction technique is used to update the local model, and every τ rounds of iterations are used to update the local model. The global model is updated with a skip communication strategy.

本发明的有益效果包括：The beneficial effects of the present invention include:

本发明在每一轮迭代中采用了纠正技术来更新本地模型，降低了各个本地模型之间的发散程度，从而保证了高的统计效率。同时在系统运行过程中采集了通信和计算的时间数据，并根据采集的数据自适应地选择一个通信间隔τ，基于该通信间隔τ，系统每隔τ轮迭代通过网络通信来更新一次全局模型，从而降低了网络通信开销，最终缩短了训练模型的时间。实施结果表明，SkipSMA相比于SSGD、SkipSSGD、LocalSGD和SMA分别缩短了86.76％、64.83％、16.37％和79.26％的总训练时间。因为SkipSMA在保持与SSGD、SkipSSGD、LocalSGD和SMA相似的统计效率的前提下，SkipSMA相比于SSGD、SkipSSGD、LocalSGD分别降低了93.3％、80％和33.3％的通信开销，SkipSMA相比于SMA在每轮epoch中降低了79.9％的计算和通信开销。The present invention adopts correction technology to update the local model in each round of iteration, reduces the degree of divergence among local models, and thus ensures high statistical efficiency. At the same time, the time data of communication and calculation are collected during the system operation, and a communication interval τ is adaptively selected according to the collected data. Based on the communication interval τ, the system updates the global model through network communication every τ rounds of iterations. This reduces the network communication overhead and ultimately reduces the time to train the model. The implementation results show that SkipSMA shortens the total training time by 86.76%, 64.83%, 16.37% and 79.26% compared to SSGD, SkipSSGD, LocalSGD and SMA, respectively. Compared with SSGD, SkipSSGD, and LocalSGD, SkipSMA reduces the communication overhead by 93.3%, 80% and 33.3%, respectively, because SkipSMA maintains a similar statistical efficiency to SSGD, SkipSSGD, LocalSGD and SMA. 79.9% reduction in computational and communication overhead in each epoch.

附图说明Description of drawings

图1是现有技术中涉及采用SSGD方法训练模型的实施方式示意图；Fig. 1 is the embodiment schematic diagram that relates to adopting SSGD method training model in the prior art;

图2是现有技术中涉及采用SkipSSGD方法训练模型的实施方式示意图；Fig. 2 is related to the implementation schematic diagram of adopting SkipSSGD method training model in the prior art;

图3是现有技术中涉及采用LocalSGD方法训练模型的实施方式示意图；3 is a schematic diagram of an embodiment of the prior art involving using the LocalSGD method to train a model;

图4是现有技术中涉及采用SMA方法训练模型的实施方式示意图；4 is a schematic diagram of an embodiment of the prior art involving using the SMA method to train a model;

图5是本发明实施方式的SkipSMA方法训练模型的示意图；Fig. 5 is the schematic diagram of the SkipSMA method training model of the embodiment of the present invention;

图6是本发明实施结果中SkipSMA和现有训练方法的总训练时间对比图；Fig. 6 is the total training time comparison diagram of SkipSMA and existing training method in the implementation result of the present invention;

图7是本发明实施结果中SkipSMA和现有训练方法的通信开销对比图；Fig. 7 is the communication overhead comparison diagram of SkipSMA and the existing training method in the implementation result of the present invention;

图8是本发明实施结果中SkipSMA和现有训练方法的统计效率对比图；Fig. 8 is the statistical efficiency comparison diagram of SkipSMA and existing training method in the implementation result of the present invention;

图9是本发明流程图。Figure 9 is a flow chart of the present invention.

具体实施方式Detailed ways

结合以下具体实施例和附图，对发明作进一步的详细说明。实施本发明的过程、条件、实验方法等，除以下专门提及的内容之外，均为本领域的普遍知识和公知常识，本发明没有特别限制内容。The invention will be further described in detail with reference to the following specific embodiments and accompanying drawings. Except for the content specifically mentioned below, the process, conditions, experimental methods, etc. for implementing the present invention are all common knowledge and common knowledge in the field, and the present invention is not particularly limited.

为了降低分布式环境中训练模型的通信开销，SkipSMA采用了一种跳过通信策略来更新全局模型，即每隔若干轮迭代才进行一次网络通信。因为如果像SMA那样在每轮迭代中都通过网络通信来更新全局模型，那么在分布式环境中就会导致高昂的网络通信开销，从而导致长的训练时间。因此，为了降低网络通信开销，SkipSMA每隔τ轮迭代才更新一次全局模型，从而降低了网络通信的频率。其中，通信间隔τ是影响通信开销的关键性因素。然而，相同的通信间隔τ在不同的环境(包括GPU的算力、网络带宽和模型大小等)中往往会造成不同的通信开销。因此，有必要采用一种自适应的策略来选择通信间隔τ。SkipSMA通过在运行时采集所需的通信和计算时间数据来自适应地调整通信间隔τ。具体来说，在系统初始化时，SkipSMA将通信间隔τ设置为1，然后在系统运行时，SkipSMA采集第一轮迭代中通信的耗时t_cm和计算的耗时t_cp。根据采集的t_cm和t_cp，自适应通信间隔选择器将通信间隔τ调整为

这使得在每一个epoch中的通信时间和计算时间是相近的。也就是说，自适应通信间隔使得通信开销不再会成为整个训练过程中的瓶颈。In order to reduce the communication overhead of training models in a distributed environment, SkipSMA adopts a skip communication strategy to update the global model, that is, network communication is only performed every several rounds of iterations. Because if the global model is updated through network communication in each iteration like SMA, it will lead to high network communication overhead in a distributed environment, resulting in long training time. Therefore, in order to reduce the network communication overhead, SkipSMA only updates the global model every τ rounds of iterations, thereby reducing the frequency of network communication. Among them, the communication interval τ is a key factor affecting the communication overhead. However, the same communication interval τ tends to cause different communication overheads in different environments (including GPU computing power, network bandwidth, and model size, etc.). Therefore, it is necessary to adopt an adaptive strategy to select the communication interval τ. SkipSMA adaptively adjusts the communication interval τ by collecting the required communication and computation time data at runtime. Specifically, when the system is initialized, SkipSMA sets the communication interval τ to 1, and then when the system is running, SkipSMA collects the communication time t _cm and the computation time t _cp in the first iteration. According to the collected t _cm and t _cp , the adaptive communication interval selector adjusts the communication interval τ as

This makes the communication time and computation time similar in each epoch. That is, the adaptive communication interval makes the communication overhead no longer a bottleneck in the whole training process.

根据自适应通信间隔选择器在第一轮迭代中所计算出的通信间隔τ，SkipSMA在后续的每轮迭代中采用纠正技术更新本地模型，并利用跳过通信策略每隔τ轮迭代更新一次全局模型。具体地，在迭代数对通信间隔τ取余不为0的迭代中，SkipSMA仅仅采用纠正技术来更新本地模型，不涉及网络通信来更新全局模型；在迭代数对通信间隔τ取余为0的迭代中，SkipSMA除了更新本地模型之外，还要通过网络通信来更新全局模型。According to the communication interval τ calculated by the adaptive communication interval selector in the first iteration, SkipSMA uses correction techniques to update the local model in each subsequent iteration, and uses the skip communication strategy to update the global model every τ iterations Model. Specifically, in the iterations in which the number of iterations takes the remainder of the communication interval τ not to be 0, SkipSMA only uses the correction technique to update the local model, and does not involve network communication to update the global model; In the iteration, in addition to updating the local model, SkipSMA also updates the global model through network communication.

在每轮迭代中，每个训练进程根据本地模型和全局模型计算出梯度和纠正，并同时使用梯度和纠正来更新本地模型。如图5所示，在第10轮迭代中，一个训练进程根据本地模型

计算出梯度

并结合全局模型

计算出纠正

然后根据计算出的梯度和纠正将本地模型从

更新为

此处的纠正代表当本地模型偏离全局模型时的惩罚。类似地，另一个训练进程根据本地模型

计算出梯度

并结合全局模型

计算出纠正

然后根据计算出的梯度和纠正将本地模型从

更新为

后续迭代中对本地模型的更新也是如此。In each iteration, each training process computes gradients and corrections from the local and global models, and uses both gradients and corrections to update the local model. As shown in Figure 5, in the 10th iteration, a training process based on the local model

Calculate the gradient

and combined with the global model

Calculate the correction

Then based on the computed gradients and corrections the local model from

update to

The correction here represents the penalty when the local model deviates from the global model. Similarly, another training process based on the local model

Calculate the gradient

and combined with the global model

Calculate the correction

Then based on the computed gradients and corrections the local model from

update to

The same goes for updates to the local model in subsequent iterations.

每隔τ轮迭代，各个训练进程之间借助网络通信更新一次全局模型。图5以通信间隔τ＝3为例，描述了SkipSMA利用跳过通信策略来更新全局模型的过程。在第10和11轮迭代中，SkipSMA不更新全局模型，即在第10和11轮迭代中SkipSMA跳过了通信。在第12轮迭代中，由于12对通信间隔3取余，余数为0，因此在第12轮迭代中需要通过网络通信来更新全局模型。具体地，SkipSMA通过网络通信聚合所有训练进程中计算出来的纠正(即

和

)，并根据聚合后的纠正将全局模型从

更新为

Every τ rounds of iterations, each training process updates the global model through network communication. Fig. 5 takes the communication interval τ=3 as an example to describe the process of SkipSMA updating the global model by using the skip communication strategy. In the 10th and 11th iterations, SkipSMA does not update the global model, i.e. SkipSMA skips the communication in the 10th and 11th iterations. In the 12th iteration, since 12 takes the remainder of the communication interval 3 and the remainder is 0, the global model needs to be updated through network communication in the 12th iteration. Specifically, SkipSMA aggregates corrections computed over all training processes (i.e.,

and

), and based on the aggregated corrections change the global model from

update to

以上是分布式深度学习系统中具有低通信开销和高统计效率的训练模型的方法SkipSMA的具体实施过程，在分布式深度学习系统中，该方法可以通过方法1中的相关代码实现，方法1的代码如下所示：The above is the specific implementation process of SkipSMA, a method for training a model with low communication overhead and high statistical efficiency in a distributed deep learning system. The code looks like this:

实施例Example

使用4台装有CentOS 7操作系统的机器组建一个用于训练模型的集群，机器间用1000Mbps带宽的网络连接，每台机器拥有两块型号为TeslaV100的GPU。在该集群中，分别采用SSGD、SkipSSGD、LocalSGD、SMA和SkipSMA五种方法来训练ResNet50模型，对比这五种方法将ResNet50模型训练到68％准确率的总训练时间、通信开销和统计效率。值得注意的是，SSGD、SkipSSGD、LocalSGD和SkipSMA都运行于集群中的四台机器上，而SMA主要聚焦于单机多卡的场景，因此SMA只运行于集群中的一台机器上。此外，SkipSMA的通信间隔τ是在运行过程中由自适应通信间隔选择器所决定的，而SkipSSGD和LocalSGD的通信间隔τ则是在运行前指定为1、2、3、4、5、10、20、30和40，并从中选取使得训练时间最短的一个通信间隔τ来与SkipSMA进行对比。Four machines with CentOS 7 operating system were used to form a cluster for training the model. The network connection between the machines was 1000Mbps, and each machine had two GPUs of TeslaV100. In this cluster, five methods, SSGD, SkipSSGD, LocalSGD, SMA, and SkipSMA, are used to train the ResNet50 model, respectively, and the total training time, communication overhead, and statistical efficiency of the five methods to train the ResNet50 model to 68% accuracy are compared. It is worth noting that SSGD, SkipSSGD, LocalSGD and SkipSMA all run on four machines in the cluster, while SMA mainly focuses on single-machine multi-card scenarios, so SMA only runs on one machine in the cluster. In addition, the communication interval τ of SkipSMA is determined by the adaptive communication interval selector during operation, while the communication interval τ of SkipSSGD and LocalSGD is specified before operation as 1, 2, 3, 4, 5, 10, 20, 30 and 40, and select a communication interval τ that minimizes the training time to compare with SkipSMA.

如图6所示，与SSGD相比，SkipSMA缩短了86.76％的总训练时间。这主要是因为SkipSMA比SSGD具有更低的通信开销，具体来说，SSGD在每轮迭代中都涉及网络通信，而SkipSMA每隔τ轮迭代才进行一次通信。如图7所示，这使得SkipSMA比SSGD降低了93.3％的通信开销。As shown in Figure 6, SkipSMA reduces the total training time by 86.76% compared to SSGD. This is mainly because SkipSMA has lower communication overhead than SSGD, specifically, SSGD involves network communication in every iteration, while SkipSMA only communicates once every τ iterations. As shown in Figure 7, this enables SkipSMA to reduce communication overhead by 93.3% compared to SSGD.

如图6所示，与SkipSSGD相比，SkipSMA缩短了64.83％的总训练时间。这是由于SkipSSGD在通信间隔τ>3时，受批量大小的影响，无法在50小时以内将模型训练到68％的准确率；在通信间隔τ＝3时，SkipSSGD获得最短的总训练时间。然而，SkipSSGD在通信间隔τ＝3时，通信开销仍然比较大，如图7所示，采用自适应通信间隔的SkipSMA比通信间隔τ＝3的SkipSSGD降低了80％的通信开销。As shown in Figure 6, SkipSMA shortens the total training time by 64.83% compared to SkipSSGD. This is because SkipSSGD cannot train the model to 68% accuracy within 50 hours due to the influence of batch size when the communication interval τ>3; SkipSSGD obtains the shortest total training time when the communication interval τ=3. However, when SkipSSGD has a communication interval τ=3, the communication overhead is still relatively large. As shown in Figure 7, SkipSMA with adaptive communication interval reduces the communication overhead by 80% compared with SkipSSGD with communication interval τ=3.

如图6所示，与LocalSGD相比，SkipSMA缩短了16.37％的总训练时间。LocalSGD在通信间隔τ＝10时总训练时间最短。当通信间隔τ>10时，虽然LocalSGD能够进一步降低通信开销，然而会导致各个本地模型之间更加发散，进而造成统计效率的下降。而SkipSMA由于采用了纠正技术，有效地缓解了各个本地模型之间的发散，进而获得高的统计效率。如图8所示，拥有自适应通信间隔的SkipSMA保持了与通信间隔τ＝10的LocalSGD相似的统计效率。同时，SkipSMA自适应地将通信间隔τ调整为15，使得SkipSMA拥有低的通信开销。如图7所示，与通信间隔τ＝10的LocalSGD相比，采用自适应通信间隔的SkipSMA降低了33.3％的通信开销。As shown in Figure 6, SkipSMA shortens the total training time by 16.37% compared to LocalSGD. LocalSGD has the shortest total training time when the communication interval τ=10. When the communication interval τ>10, although LocalSGD can further reduce the communication overhead, it will lead to more divergence among the local models, which will lead to a decrease in statistical efficiency. However, SkipSMA effectively alleviates the divergence between local models due to the use of correction technology, thereby achieving high statistical efficiency. As shown in Fig. 8, SkipSMA with adaptive communication interval maintains similar statistical efficiency as LocalSGD with communication interval τ=10. At the same time, SkipSMA adaptively adjusts the communication interval τ to 15, which makes SkipSMA have low communication overhead. As shown in Figure 7, compared with LocalSGD with communication interval τ=10, SkipSMA with adaptive communication interval reduces the communication overhead by 33.3%.

如图6所示，与SMA相比，SkipSMA缩短了79.26％的总训练时间。虽然SMA每轮迭代中的通信时间与计算时间相近，也就是说，通信没有成为瓶颈，但是SMA仅利用了单个节点的算力。SkipSMA不仅通过具有自适应通信间隔的跳过通信策略保持了低的通信开销，而且利用了多个节点的算力。如图7所示，在每个epoch中，SkipSMA比SMA降低了79.9％的计算和通信时间。As shown in Figure 6, SkipSMA shortens the total training time by 79.26% compared to SMA. Although the communication time in each iteration of SMA is similar to the computation time, that is, communication does not become a bottleneck, SMA only utilizes the computing power of a single node. SkipSMA not only maintains low communication overhead through skip communication strategy with adaptive communication interval, but also utilizes the computing power of multiple nodes. As shown in Figure 7, SkipSMA reduces the computation and communication time by 79.9% compared to SMA in each epoch.

本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下，本领域技术人员能够想到的变化和优点都被包括在本发明中，并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Variations and advantages that can occur to those skilled in the art without departing from the spirit and scope of the inventive concept are included in the present invention, and the appended claims are the scope of protection.

Claims

1. a method for a training model with low communication overhead and high statistical efficiency in a distributed deep learning system, wherein the method comprises the steps:

Step A: The runtime data collector collects the communication time t _cm and calculation time t _cp data required by the adaptive communication interval during the runtime, and the adaptive communication interval selector automatically adjusts the communication through the communication time and calculation time data obtained by the above collection. interval τ;

Step B: Perform iterative training on the model, and update the local model with correction techniques in each iteration;

Step C: Update the global model with skip communication strategy every τ iterations.

2. The method of claim 1, wherein the step A further comprises the steps:

Step A1: The system automatically initializes the communication interval τ to 1; sets the communication interval τ=1, and sets the adaptive communication interval flag bit;

Step A2: when the system is running, the runtime data collector collects the time-consuming t _cm of communication and the time-consuming t _cp of calculation in the first iteration;

Step A3: The adaptive communication interval selector adjusts the communication interval according to the t _cm and t _cp collected in the first iteration

3. The method of claim 2, wherein in step A1, the adaptive communication interval flag bit is used to indicate whether to enable the adaptive communication interval selector; when the adaptive communication interval flag bit flag=true , the adaptive communication interval selector is enabled, and the system adaptively selects an appropriate communication interval τ; when the adaptive communication interval flag flag=false, the adaptive communication interval selector is disabled, and the user needs to specify the communication interval τ.

4. The method of claim 1, wherein after the first round of iterations completes the adjustment of the communication interval τ, the communication interval τ is used for all subsequent iterations.

5. The method of claim 1, wherein the step B further comprises the steps of:

Step B1: Each training process calculates the gradient according to the local model;

Step B2: each training process calculates a correction, and the correction refers to the difference between the local model and the global model;

Step B3: Each training process updates the local model with the computed gradients and corrections.

6. The method of claim 5, wherein in step B1, the gradient is calculated by the following formula:

Represents the local model parameters of the training process numbered i in the k-th iteration, and L(x, w) represents the loss calculated by the sample x on the model parameter w.

7. method as claimed in claim 5 is characterized in that, in step B2, described correction is calculated by following formula:

represents the global model parameters in the k-th iteration.

8. The method of claim 5, wherein in step B3, the specific update form of the local model is shown in the following formula:

and

and

9. The method of claim 8, wherein the momentum of the local model is initially empty, and is updated according to the gradient in each iteration, and the specific update form is:

and

10. The method of claim 1, wherein the step C further comprises the steps of:

Step C1: Use the current iteration number to take the remainder of the adaptively set communication interval τ, if the remainder is 0, it indicates that the current iteration needs to update the global model, and then go to step C2;

Step C2: Aggregate the corrections calculated by each training process through one network communication, and update the global model with the aggregated corrections.

11. The method of claim 10, wherein, in step C2, the correction after the aggregation is represented by the following formula:

represents the correction calculated in the k-th iteration of the training process numbered i;

The updated form of the global model is as follows

shown, where i represents the number of the training process, k represents the number of iterations,

and

represents the global model momentum in the k-th iteration, μ represents the momentum coefficient, and α represents the correction coefficient.

12. The method of claim 11, wherein the global model momentum is initially empty, and is updated once every τ round of iterations according to the correction after aggregation, and the specific update form is:

and

represents the global model momentum in the kth and k+1 iterations,

13. A system for implementing the method according to any one of claims 1-12, wherein the system comprises: a runtime data collector, an adaptive communication interval selector, and a model update module.

14. The system of claim 13, wherein the runtime data collector collects communication time data _tcm and computation time data _tcp when the system is running in the first iteration;

The adaptive communication interval selector automatically adjusts the communication interval according to the communication time t _cm and the calculation time t _cp collected by the runtime data collector

and use this communication interval τ for all subsequent iterations;

The model updating module iteratively updates the local model and the global model according to the communication interval τ obtained by the adaptive communication interval selection module, that is, in each round of iterations, a correction technique is used to update the local model, and every τ rounds of iterations are used to update the local model. The global model is updated with a skip communication strategy.