CN114298277B

CN114298277B - A distributed deep learning training method and system based on layer sparsification

Info

Publication number: CN114298277B
Application number: CN202111627780.9A
Authority: CN
Inventors: 吕建成; 胡宴箐; 叶庆; 张钟宇; 郎九霖; 田煜鑫; 吕金地
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-09-12
Anticipated expiration: 2041-12-28
Also published as: CN114298277A

Abstract

The application discloses a distributed deep learning training method and system based on layer sparsification, which belong to the technical field of distributed training communication sparsification and comprise the following steps: obtaining a normalized window center list according to the convergence characteristic of the neural network model; obtaining a layer transmission list by using a layer sparsification method and a normalized window center list; performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity; the application solves the problem that the existing training frame is only thinned in the network layer, effectively improves the thinning degree and reduces the traffic.

Description

A distributed deep learning training method and system based on layer sparsification

技术领域Technical field

本发明属于分布式训练通信稀疏化技术领域，尤其涉及一种基于层稀疏化的分布式深度学习训练方法及系统。The invention belongs to the technical field of distributed training communication sparsification, and in particular relates to a distributed deep learning training method and system based on layer sparsification.

背景技术Background technique

近年来，随着深度学习不断发展，模型变得更大更复杂，在单机上训练这些复杂的大模型非常耗时。为了减少训练耗时，提出了分布式训练的方法来加速模型训练。具体来说，分布式训练主要包括模型并行和数据并行两种方法。其中模型并行由于将模型参数拆分给不同的计算节点，再加上不同层次的参数大小不平衡和节点间计算依赖性高等问题，加速较为困难。而数据并行在每个计算结点保持完整模型的同时将训练数据集进行拆分。在每一次迭代中，每个计算结点使用不同的训练数据计算局部梯度，再进行传递交换。In recent years, with the continuous development of deep learning, models have become larger and more complex, and training these complex large models on a single machine is very time-consuming. In order to reduce training time, a distributed training method is proposed to accelerate model training. Specifically, distributed training mainly includes two methods: model parallelism and data parallelism. Among them, model parallelism is difficult to accelerate due to the splitting of model parameters to different computing nodes, coupled with the imbalance of parameter sizes at different levels and high computing dependence between nodes. Data parallelism splits the training data set while maintaining the complete model at each computing node. In each iteration, each computing node uses different training data to calculate local gradients, and then transfers and exchanges them.

每个计算节点在进行局部梯度传递时，由于参数量大，网络带宽受限等原因，限制了分布式的扩大。为了解决这一瓶颈，提出了稀疏化和量化两种不同的降低通信量的方法。其中，稀疏化方法旨在减少每次迭代中传输的元素的数量，将绝大多数的元素置零，只传递最有价值的梯度来更新参数，从而保证训练的收敛性。When each computing node performs local gradient transfer, the expansion of distribution is limited due to the large amount of parameters and limited network bandwidth. In order to solve this bottleneck, two different methods of reducing communication volume, sparsification and quantization, are proposed. Among them, the sparsification method aims to reduce the number of elements transmitted in each iteration, set the vast majority of elements to zero, and only transmit the most valuable gradients to update parameters, thereby ensuring the convergence of training.

在一种不带通信压缩的基础同步分布式训练框架SSGD(Synchronous StochasticGradient Descent)中，每个计算结点必须等待所有结点完成当前迭代中所有参数的传输，没有进行压缩，过大的通信量造成的通信负载成为其最大的瓶颈，其在实际应用中，在计算资源有限的情况下，只能应用于一些较小模型的训练；在一种进行网络层内部的深度通信压缩的分布式深度学习训练框架DGC(Deep Gradient Compression)中，增加了很多现有的辅助技术，以克服网络层内部梯度稀疏化带来的损失问题，很大程度上降低了节点间的梯度通信量，但是DGC每次通信仍然需要传递模型的所有网络层，对于模型深度较深的网络，瓶颈依然存在。In a basic synchronous distributed training framework SSGD (Synchronous Stochastic Gradient Descent) without communication compression, each computing node must wait for all nodes to complete the transmission of all parameters in the current iteration. Without compression, excessive communication volume The communication load caused has become its biggest bottleneck. In practical applications, when computing resources are limited, it can only be applied to the training of some smaller models; in a distributed depth that performs deep communication compression within the network layer In the learning and training framework DGC (Deep Gradient Compression), many existing auxiliary technologies have been added to overcome the loss problem caused by gradient sparseness within the network layer and greatly reduce the gradient communication volume between nodes. However, DGC every Secondary communication still needs to pass all network layers of the model, and for networks with deeper model depth, bottlenecks still exist.

发明内容Contents of the invention

针对现有技术中的上述不足，本发明将神经网络模型收敛特性应用于分布式训练，提供的一种基于层稀疏化的分布式深度学习训练方法及系统，解决了现有的训练框架仅在网络层内部稀疏化的问题，有效提升了稀疏化程度，降低了通信量。In view of the above-mentioned deficiencies in the prior art, the present invention applies the convergence characteristics of the neural network model to distributed training, and provides a distributed deep learning training method and system based on layer sparsification, which solves the problem that the existing training framework only The problem of sparsity within the network layer effectively improves the degree of sparsity and reduces communication volume.

为了达到上述发明目的，本发明采用的技术方案为：In order to achieve the above-mentioned object of the invention, the technical solutions adopted by the present invention are:

本发明提供一种基于层稀疏化的分布式深度学习训练方法，包括如下步骤：The present invention provides a distributed deep learning training method based on layer sparsification, which includes the following steps:

S1、根据神经网络模型收敛特性，得到归一化窗口中心列表；S1. According to the convergence characteristics of the neural network model, obtain the normalized window center list;

S2、利用层稀疏化方法和归一化窗口中心列表，得到层传输列表；S2. Use the layer sparsification method and the normalized window center list to obtain the layer transmission list;

S3、根据层传输列表进行基于层稀疏化的分布式深度学习训练，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练。S3. Perform distributed deep learning training based on layer sparsification according to the layer transmission list, obtain weight update parameters, and complete distributed deep learning training based on layer sparsification.

本发明的有益效果为：本发明提供的一种基于层稀疏化的分布式深度学习训练方法，根据神经网络在训练的过程中，不同层的重点学习时机是不同的，选择对当前最需要学习的网络层进行通信同步，其他层留在本地进行累积，在进行节点间梯度传递的时候，选择一部分层的梯度进行节点间通信，有效降低了通信量。The beneficial effects of the present invention are: the present invention provides a distributed deep learning training method based on layer sparseness. According to the neural network training process, the key learning opportunities of different layers are different, and the most needed learning at the moment is selected. The network layer performs communication synchronization, and other layers stay local for accumulation. When transmitting gradients between nodes, the gradients of some layers are selected for inter-node communication, which effectively reduces the communication volume.

进一步地，所述步骤S1包括如下步骤：Further, the step S1 includes the following steps:

S11、将神经网络的所有层设为层连续序列，并设置神经网络总训练次数；S11. Set all layers of the neural network as a continuous sequence of layers, and set the total number of training times of the neural network;

S12、根据神经网络模型收敛特性，设置动态窗口，所述动态窗口随神经网络训练次数增加从后向前遍历层连续序列，并设置动态窗口整体遍历次数；S12. Set a dynamic window according to the convergence characteristics of the neural network model. The dynamic window traverses the continuous sequence of layers from back to front as the number of neural network training times increases, and sets the overall number of traversals of the dynamic window;

S13、根据神经网络总训练次数和动态窗口整体遍历次数，计算得到动态窗口单次遍历神经网络过程中神经网络的训练次数和剩余神经网络训练次数；S13. Based on the total number of training times of the neural network and the number of overall traversals of the dynamic window, calculate the number of training times of the neural network and the remaining number of neural network training times during a single traversal of the neural network in the dynamic window;

S14、根据动态窗口单次遍历神经网络过程中神经网络的训练次数，得到动态窗口遍历时移动的归一化步长；S14. According to the number of training times of the neural network during a single traversal of the neural network in the dynamic window, the normalized step size of the movement during the dynamic window traversal is obtained;

S15、基于归一化步长，迭代动态窗口整体遍历次数和动态窗口单次遍历神经网络模型过程中神经网络的训练次数，得到整周期窗口中心列表；S15. Based on the normalized step size, iterate the number of overall traversals of the dynamic window and the number of training times of the neural network in the process of a single traversal of the dynamic window through the neural network model, and obtain a list of the entire cycle window centers;

S16、判断剩余神经网络训练次数是否为零，若是则归一化整周期窗口中心列表作为归一化窗口中心列表，否则进入步骤S17；S16. Determine whether the remaining number of neural network training times is zero. If so, normalize the entire period window center list as the normalized window center list. Otherwise, proceed to step S17;

S17、基于动态窗口遍历时移动的归一化步长，迭代剩余神经网络的训练次数，得到剩余窗口中心列表；S17. Based on the normalized step size moved during dynamic window traversal, iterate the training times of the remaining neural network to obtain the remaining window center list;

S18、将整周期窗口中心列表末尾添加剩余窗口中心列表，作为归一化窗口中心列表。S18. Add the remaining window center list to the end of the entire period window center list as the normalized window center list.

采用上述进一步方案的有益效果为：根据神经网络模型收敛特性，建立动态窗口，在训练初期主要传递模型底部的网络层，随着训练周期的增加，传递重点层也随之上移，每一层梯度的重要性并不是一样的，为分布式训练的层稀疏化提供基础，且通过多次遍历有效防止了部分层选中概率过低。The beneficial effects of adopting the above further solution are: according to the convergence characteristics of the neural network model, a dynamic window is established. In the early stage of training, the network layer at the bottom of the model is mainly transmitted. As the training cycle increases, the key layers of transmission also move up. Each layer The importance of gradients is not the same. It provides the basis for layer sparseness in distributed training and effectively prevents the selection probability of some layers from being too low through multiple traversals.

进一步地，所述步骤S2的具体步骤如下：Further, the specific steps of step S2 are as follows:

S21、获取神经网络的热身周期和归一化窗口中心列表；S21. Obtain the warm-up period and normalized window center list of the neural network;

S22、根据层稀疏化方法，归一化神经网络中网络层的序号，得到归一化层序号列表；S22. According to the layer sparsification method, normalize the sequence numbers of the network layers in the neural network to obtain a list of normalized layer sequence numbers;

S23、针对当前训练周期处于热身周期，则传输神经网络所有层的所有参数，否则进入步骤S24；S23. As the current training period is in the warm-up period, transmit all parameters of all layers of the neural network, otherwise proceed to step S24;

S24、根据归一化窗口中心列表和归一化层序号列表，得到动态窗口列表和当前训练周期窗口内采样列表；S24. According to the normalized window center list and the normalized layer serial number list, obtain the dynamic window list and the sampling list within the current training cycle window;

S25、根据归一化层序号列表和动态窗口列表，得到当前训练周期窗口外采样列表；S25. According to the normalized layer serial number list and the dynamic window list, obtain the sampling list outside the window of the current training cycle;

S26、合并当前训练周期窗口内采样列表和当前训练周期窗口外采样列表，得到层传输列表。S26. Merge the sampling list within the current training cycle window and the sampling list outside the current training cycle window to obtain a layer transmission list.

采用上述进一步方案的有益效果为：在神经网络训练初期，动态窗口在模型最底层，随着神经网络训练进行，动态窗口逐渐向前滑动至顶层，通过设置动态窗口内部网络层采样预设比例和动态窗口外部网络层采样预设比例，降低传输数据量，且当模型足够深时，优势明显。The beneficial effects of adopting the above further solution are: in the early stage of neural network training, the dynamic window is at the bottom of the model. As the neural network training progresses, the dynamic window gradually slides forward to the top layer. By setting the preset ratio and sampling ratio of the network layer inside the dynamic window, The preset proportion of network layer sampling outside the dynamic window reduces the amount of transmitted data, and when the model is deep enough, the advantages are obvious.

进一步地，所述步骤S24包括如下步骤：Further, the step S24 includes the following steps:

S241、根据归一化窗口中心列表，获取当前训练周期的窗口中心；S241. Obtain the window center of the current training cycle according to the normalized window center list;

S242、将当前训练周期的窗口中心和归一化层序号列表分别作为期望和自变量列表，并计算得到标准正态分布列表；S242. Use the window center and the normalization layer sequence number list of the current training cycle as the expectation and independent variable lists respectively, and calculate the standard normal distribution list;

S243、选取标准正态分布列表头部预设量对应的网络层的序号，得到动态窗口列表；S243. Select the sequence number of the network layer corresponding to the preset amount at the head of the standard normal distribution list to obtain a dynamic window list;

S244、通过对动态窗口列表随机均匀采样预设比例k_in，得到当前训练周期窗口内采样列表。S244. Obtain the sampling list within the current training cycle window by randomly and uniformly sampling the preset proportion k _in of the dynamic window list.

采用上述进一步方案的有益效果为：通过对动态窗口列表进行随机均匀采样，有效降低传输数据量，且当模型足够深时，优势明显。The beneficial effects of adopting the above further solution are: by randomly and uniformly sampling the dynamic window list, the amount of transmitted data is effectively reduced, and when the model is deep enough, the advantages are obvious.

进一步地，所述步骤S25包括如下步骤：Further, the step S25 includes the following steps:

S251、根据归一化层序号列表和动态窗口列表，得到动态窗口外部列表；S251. Obtain the dynamic window external list according to the normalized layer serial number list and the dynamic window list;

S252、通过对动态窗口外部列表随机均匀采样预设比例k_out，得到当前训练周期窗口外采样列表。S252. Obtain the sampling list outside the window of the current training cycle by randomly and uniformly sampling the preset proportion k _out of the dynamic window external list.

采用上述进一步方案的有益效果为：通过对动态窗口列表外部的神经网络网络层进行随机均匀采样，有效降低传输数据量，且当模型足够深时，优势明显。The beneficial effect of adopting the above further solution is that by randomly and uniformly sampling the neural network layer outside the dynamic window list, the amount of transmitted data is effectively reduced, and when the model is deep enough, the advantages are obvious.

进一步地，所述步骤S3包括如下步骤：Further, the step S3 includes the following steps:

S31、通过神经网络样本前馈和反馈计算，得到神经网络层梯度列表；S31. Obtain the neural network layer gradient list through neural network sample feedforward and feedback calculations;

S32、逐层遍历神经网络层梯度列表中的各层，并判断各层是否在层传输列表中，若是则得到若干选中层，并进入步骤S33，否则得到若干本地累积梯度；S32. Traverse each layer in the neural network layer gradient list layer by layer, and determine whether each layer is in the layer transmission list. If so, obtain several selected layers and enter step S33. Otherwise, obtain several local accumulated gradients;

S33、判断各选中层是否有层内压缩，若是则得到若干有层内压缩的选中层，并进入步骤S34，否则得到若干无层内压缩的选中层传输梯度；S33. Determine whether each selected layer has intra-layer compression. If so, obtain a number of selected layers with intra-layer compression and enter step S34. Otherwise, obtain a number of selected layer transmission gradients without intra-layer compression;

S34、将各有层内压的缩选中层内部依次进行层内稀疏化、节点间通信和解压同步，得到若干有层内压缩的选中层传输梯度；S34. Perform intra-layer sparsification, inter-node communication and decompression synchronization in sequence within each compressed selected layer with intra-layer compression, and obtain a number of selected layer transmission gradients with intra-layer compression;

S35、将各本地累积梯度、各无层内压缩的选中层传输梯度或各有层内压缩的选中层传输梯度进行全局平均，得到完整梯度；S35. Globally average each local accumulated gradient, each selected layer transmission gradient without intra-layer compression, or each selected layer transmission gradient with intra-layer compression, to obtain a complete gradient;

S36、根据完整梯度，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练。S36. According to the complete gradient, obtain the weight update parameters and complete the distributed deep learning training based on layer sparsification.

采用上述进一步方案的有益效果为：根据判断神经网络中各层是否在层传输列表中以及是否有层内压缩的结果，分别通过本地积累，直接传输和节点间通信获取传输梯度，并通过全局平均实现梯度融合得到完整梯度，再根据完整梯度，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练。The beneficial effects of adopting the above further scheme are: according to the results of judging whether each layer in the neural network is in the layer transmission list and whether there is intra-layer compression, the transmission gradient is obtained through local accumulation, direct transmission and inter-node communication respectively, and through global average Implement gradient fusion to obtain complete gradients, and then obtain weight update parameters based on the complete gradients to complete distributed deep learning training based on layer sparsification.

本发明还提供一种基于层稀疏化的分布式深度学习训练方法的系统，包括：The present invention also provides a system for a distributed deep learning training method based on layer sparsification, including:

归一化窗口中心列表获取模块，用于根据神经网络模型收敛特性，得到归一化窗口中心列表；The normalized window center list acquisition module is used to obtain the normalized window center list based on the convergence characteristics of the neural network model;

层传输列表获取模块，用于利用层稀疏化方法和归一化窗口中心列表，得到层传输列表；The layer transmission list acquisition module is used to obtain the layer transmission list using the layer sparsification method and the normalized window center list;

基于层稀疏化的分布式深度学习训练模块，根据层传输列表进行基于层稀疏化的分布式深度学习训练，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练。The distributed deep learning training module based on layer sparsification performs distributed deep learning training based on layer sparsification according to the layer transmission list, obtains weight update parameters, and completes distributed deep learning training based on layer sparsification.

本发明的有益效果为：本发明提供的基于层稀疏化的分布式深度学习训练方法的系统为本发明提供的基于层稀疏化的分布式深度学习训练方法对应设置的系统，用于实现基于层稀疏化的分布式深度学习训练方法。The beneficial effects of the present invention are: the system of the distributed deep learning training method based on layer sparsification provided by the present invention is a system corresponding to the distributed deep learning training method based on layer sparse provided by the present invention, and is used to implement the distributed deep learning training method based on layer sparse. Sparse distributed deep learning training method.

附图说明Description of drawings

图1为本发明实施例中基于层稀疏化的分布式深度学习训练方法的步骤流程图。Figure 1 is a flow chart of the steps of a distributed deep learning training method based on layer sparsification in an embodiment of the present invention.

图2为本发明实施例中窗口中心随训练周期移动的示意图。Figure 2 is a schematic diagram of the window center moving with the training cycle in the embodiment of the present invention.

图3为本发明实施例中动态窗口移动的示意图。Figure 3 is a schematic diagram of dynamic window movement in an embodiment of the present invention.

图4为本发明实施例中根据标准正态分布列表得到动态窗口列表的示意图。Figure 4 is a schematic diagram of obtaining a dynamic window list based on a standard normal distribution list in an embodiment of the present invention.

图5为本发明实施例中all_reduce循环传输方法的示意图。Figure 5 is a schematic diagram of the all_reduce loop transmission method in the embodiment of the present invention.

图6为本发明实施例中Resnet110模型分别采用DGC和LS-DGC框架在Cifar10和Cifar100数据集训练的耗时情况的示意图。Figure 6 is a schematic diagram of the time-consuming situation of training the Resnet110 model on the Cifar10 and Cifar100 data sets using the DGC and LS-DGC frameworks respectively in the embodiment of the present invention.

图7为本发明实施例中Resnet18模型分别采用SSGD和LS-SSGD框架在Cifar10数据集训练的耗时情况以及Resnet50模型分别采用SSGD和LS-SSGD框架在Cifar100数据集训练的耗时情况的示意图。Figure 7 is a schematic diagram of the time-consuming situation of training the Resnet18 model in the Cifar10 data set using the SSGD and LS-SSGD frameworks respectively, and the time-consuming situation of the Resnet50 model using the SSGD and LS-SSGD frameworks respectively in training the Cifar100 data set in the embodiment of the present invention.

图8为本发明实施例中Resnet18模型分别采用SSGD和LS-SSGD框架在Cifar10上收敛情况以及Resnet50模型分别采用SSGD和LS-SSGD框架在Cifar100上收敛情况的示意图。Figure 8 is a schematic diagram of the convergence of the Resnet18 model on Cifar10 using SSGD and LS-SSGD frameworks respectively, and the convergence of the Resnet50 model on Cifar100 using SSGD and LS-SSGD frameworks respectively in the embodiment of the present invention.

图9为本发明实施例中基于层稀疏化的分布式深度学习训练方法的系统的框图。Figure 9 is a block diagram of a system of a distributed deep learning training method based on layer sparsification in an embodiment of the present invention.

具体实施方式Detailed ways

下面对本发明的具体实施方式进行描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。The specific embodiments of the present invention are described below to facilitate those skilled in the art to understand the present invention. However, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the technical field, as long as various changes These changes are obvious within the spirit and scope of the invention as defined and determined by the appended claims, and all inventions and creations utilizing the concept of the invention are protected.

根据深度表征学习动力学得出的神经网络模型收敛特性结论，本方案提出了基于层稀疏化的分布式深度学习训练方法及系统。神经网络在训练的过程中，不同层的重点学习时机是不同的，这为本方案从网络层的角度对梯度进行稀疏化提供了可行性支撑，通过对当前最需要学习的网络层进行通信同步，而其他层留在本地进行累积，考虑在训练初期主要传递模型底部的网络层，随着训练周期的增加，传递重点层也随之上移，为分布式训练的层稀疏化也提供了支撑，本发明因此提供一种有效降低通信量的方法及系统。Based on the conclusions on the convergence characteristics of the neural network model derived from the dynamics of deep representation learning, this solution proposes a distributed deep learning training method and system based on layer sparsification. During the training process of the neural network, the key learning timing of different layers is different. This provides feasible support for this solution to sparse the gradient from the perspective of the network layer, and communicates and synchronizes the network layer that currently needs to learn most. , while other layers remain local for accumulation. Consider that the network layer at the bottom of the model is mainly transferred in the early stage of training. As the training cycle increases, the transfer focus layer also moves up, which also provides support for the layer sparseness of distributed training. , the present invention therefore provides a method and system for effectively reducing communication traffic.

实施例1Example 1

如图1所示，在本发明的一个实施例中，本发明提供一种基于层稀疏化的分布式深度学习训练方法，包括如下步骤：As shown in Figure 1, in one embodiment of the present invention, the present invention provides a distributed deep learning training method based on layer sparsification, which includes the following steps:

根据神经网络模型收敛特性，建立动态窗口，在训练初期主要传递模型底部的网络层，随着训练周期的增加，传递重点层也随之上移，每一层梯度的重要性并不是一样的，为分布式训练的层稀疏化提供基础；According to the convergence characteristics of the neural network model, a dynamic window is established. In the early stage of training, the network layer at the bottom of the model is mainly transmitted. As the training cycle increases, the key layer of transmission also moves up. The importance of the gradient of each layer is not the same. Provides a basis for layer sparsification in distributed training;

在神经网络训练初期，动态窗口在模型最底层，随着神经网络训练进行，动态窗口逐渐向前滑动至顶层，通过设置动态窗口内部网络层采样预设比例和动态窗口外部网络层采样预设比例，降低传输数据量，且当模型足够深时，优势明显；In the early stage of neural network training, the dynamic window is at the bottom of the model. As the neural network training progresses, the dynamic window gradually slides forward to the top layer. By setting the preset sampling ratio of the network layer inside the dynamic window and the preset sampling ratio of the network layer outside the dynamic window , reducing the amount of data to be transmitted, and when the model is deep enough, the advantages are obvious;

根据判断神经网络中各层是否在层传输列表中以及是否有层内压缩的结果，分别通过本地积累，直接传输和节点间通信获取传输梯度，并通过全局平均实现梯度融合得到完整梯度，再根据完整梯度，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练。According to the results of judging whether each layer in the neural network is in the layer transmission list and whether there is intra-layer compression, the transmission gradient is obtained through local accumulation, direct transmission and inter-node communication, and the gradient fusion is achieved through global averaging to obtain the complete gradient. Complete gradient, obtain weight update parameters, and complete distributed deep learning training based on layer sparsification.

实施例2Example 2

针对实施例1的步骤S1，其包括以下分步骤S11～S18：Regarding step S1 of Embodiment 1, it includes the following sub-steps S11 to S18:

S18、将整周期窗口中心列表末尾添加剩余窗口中心列表，作为归一化窗口中心列表；S18. Add the remaining window center list to the end of the entire period window center list as the normalized window center list;

如图2所示，本实施例中设置神经网络模型有300层，训练了103个周期epoch，同时动态窗口对模型进行了4次整体遍历和1次部分遍历(103/4的余数)；则总训练次数为103次，整体遍历次数为4次，则单次遍历神经网络过程中神经网络训练次数为25次，剩余神经网络训练次数为3次，动态窗口移动的归一化步长为1/24(从0开始计数)；考虑到不是所有时候都是刚好整数次遍历，因此对存在剩余神经网络训练次数时进行相似的迭代产生剩余窗口中心，再通过尾部添加得到归一化窗口中心列表。As shown in Figure 2, in this embodiment, the neural network model is set to have 300 layers and is trained for 103 epochs. At the same time, the dynamic window performs 4 overall traversals and 1 partial traversal of the model (the remainder of 103/4); then The total number of training times is 103 times, and the overall number of traversals is 4 times. Then the number of neural network training times in a single traversal of the neural network is 25 times, the remaining neural network training times are 3 times, and the normalized step size of dynamic window movement is 1 /24 (counting from 0); considering that not all times are exactly integer traversals, similar iterations are performed when there are remaining neural network training times to generate the remaining window centers, and then the normalized window center list is obtained by adding the tail .

实施例3Example 3

针对实施例1中的步骤S2，其包括以下分步骤S21～S26：Regarding step S2 in Embodiment 1, it includes the following sub-steps S21 to S26:

通常设置热身周期为神经网络训练的前5个epoch，且在热身周期内不进行稀疏化防止神经网络走向错误方向；Usually the warm-up period is set to the first 5 epochs of neural network training, and sparsification is not performed during the warm-up period to prevent the neural network from going in the wrong direction;

所述步骤S24包括如下步骤：The step S24 includes the following steps:

S244、通过对动态窗口列表随机均匀采样预设比例k_in，得到当前训练周期窗口内采样列表；S244. Obtain the sampling list within the current training cycle window by randomly and uniformly sampling the preset proportion k _in of the dynamic window list;

所述步骤S25包括如下步骤：The step S25 includes the following steps:

S252、通过对动态窗口外部列表随机均匀采样预设比例k_out，得到当前训练周期窗口外采样列表；S252. Obtain the sampling list outside the window of the current training cycle by randomly and uniformly sampling the preset proportion k _out of the dynamic window external list;

S26、合并当前训练周期窗口内采样列表和当前训练周期窗口外采样列表，得到层传输列表；S26. Merge the sampling list within the current training cycle window and the sampling list outside the current training cycle window to obtain the layer transmission list;

如图3所示，在本实施例中设置神经网络有20层，在训练初期，窗口在模型最底层，随着训练进行，窗口逐渐向前滑动至顶层，且设置动态窗口列表随机均匀采样预设比例k_in＝50％，而动态窗口外部列表随机均匀采样预设比例k_out＝12.5％，则整体而言模型压缩率为k＝20％，进一步降低了传输数据量，当模型足够深的时候，该方法的优势是明显的；在热身周期内不进行稀疏化防止模型走向错误方向；根据归一化窗口中心列表，获取当前训练周期的窗口中心；将当前训练周期的窗口中心和归一化层序号列表分别作为期望和自变量列表，并计算得到标准正态分布列表；再然后选择标准正态分布列表中的top-20％对应的层序号作为动态窗口列表；As shown in Figure 3, in this embodiment, the neural network is set to have 20 layers. In the early stage of training, the window is at the bottom of the model. As the training progresses, the window gradually slides forward to the top layer, and a dynamic window list is set to randomly and uniformly sample pre-sets. Suppose the proportion k _in =50%, and the dynamic window external list random uniform sampling preset proportion k _out =12.5%, then the overall model compression rate k =20%, further reducing the amount of transmitted data. When the model is deep enough At this time, the advantages of this method are obvious; no sparsification is performed during the warm-up period to prevent the model from going in the wrong direction; according to the normalized window center list, the window center of the current training period is obtained; the sum of the window centers of the current training period is normalized The layer serial number list is used as the expectation and independent variable list respectively, and the standard normal distribution list is calculated; then the layer serial number corresponding to the top-20% in the standard normal distribution list is selected as the dynamic window list;

如图4所示，标准正态分布列表的顶点即为窗口中心，只截取top-20％，所对应的横坐标即为当前窗口中心对应的动态窗口列表，即网络层序列，在动态窗口列表中进行随机均匀采样，得到当前训练周期窗口内采样列表；去除top-20％的动态窗口并进行随机均匀采样，得到当前训练周期窗口外采样列表，在合并当前训练周期窗口内采样列表和当前训练周期窗口外采样列表，得到层传输列表。As shown in Figure 4, the vertex of the standard normal distribution list is the window center, and only the top-20% is intercepted. The corresponding abscissa is the dynamic window list corresponding to the current window center, that is, the network layer sequence. In the dynamic window list Perform random uniform sampling in the current training cycle window to obtain the sampling list within the current training cycle window; remove the top-20% dynamic window and perform random uniform sampling to obtain the sampling list outside the current training cycle window, and merge the sampling list within the current training cycle window with the current training Sample the list outside the period window to obtain the layer transmission list.

实施例4Example 4

针对实施例1中的步骤S3，其包括以下分步骤S31～S36：Regarding step S3 in Embodiment 1, it includes the following sub-steps S31 to S36:

S36、根据完整梯度，得到权重更新参数，完成基于层稀疏化的分布式深度学习训练；S36. According to the complete gradient, obtain the weight update parameters and complete the distributed deep learning training based on layer sparsification;

在本实施例中，当选中层有层内压缩时，在层内稀疏化和节点间通信之前通过上述基于层稀疏化的分布式深度学习训练方法中步骤S2中相同方法进行层稀疏化，选择要进行通信的梯度层，以减少不进行节点间通信的神经网络层的稀疏化开销。在热身周期中，不进行层稀疏化即传输所有层，以防止训练走向错误方向，待层内稀疏化程度稳定后，再调用层稀疏化策略来减少节点间通讯量，且如图5所示，各层内压缩选中层内部进行节点间通信时，采用all_reduce循环传输方法，先进行节点node间的循环传输，在进行4次轮传输后，每个结点就具备了其他结点的梯度信息；再进行平均，得到平均梯度，通过解压后得到若干有层内压缩选中层传输梯度。In this embodiment, when the selected layer has intra-layer compression, layer sparsification is performed by the same method in step S2 of the distributed deep learning training method based on layer sparsification before intra-layer sparsification and inter-node communication, and the desired layer is selected. Communicating gradient layers to reduce the sparsification overhead of neural network layers that do not communicate between nodes. During the warm-up period, all layers are transmitted without layer sparsification to prevent the training from going in the wrong direction. After the degree of sparsification within the layer is stable, the layer sparsification strategy is called to reduce the communication volume between nodes, as shown in Figure 5 , When communicating between nodes within the selected layer for compression within each layer, the all_reduce cyclic transmission method is used. First, cyclic transmission between nodes is performed. After 4 rounds of transmission, each node has the gradient information of other nodes. ; Then perform averaging to obtain the average gradient. After decompression, a number of intra-layer compression selected layer transmission gradients are obtained.

实施例5Example 5

在本发明的一个实用实例中，本方案在两个图像分类数据集上进行实验，包括简单的Cifair10数据集和复杂的Cifair100数据集，来证明本方案提出的基于层稀疏化的分布式深度学习训练方法的有效性；其中Cifar10由10个类的50000个训练图像和10000个验证图像组成，而Cifar100包含100个类别，每个类别有500个训练图像和100个测试图像。所有实验都基于Pytorch分布式训练框架，并使用在一台装有4块GeForce RTX 3090图形卡的机器上运行实验。In a practical example of the present invention, this solution was tested on two image classification data sets, including the simple Cifair10 data set and the complex Cifair100 data set, to prove the distributed deep learning based on layer sparsification proposed by this solution. Effectiveness of the training method; where Cifar10 consists of 50,000 training images and 10,000 validation images of 10 classes, while Cifar100 contains 100 categories, each category has 500 training images and 100 test images. All experiments are based on the Pytorch distributed training framework and are run on a machine equipped with four GeForce RTX 3090 graphics cards.

首先进行基于深度梯度压缩DGC的层稀疏化分布式框架(LS-DGC)训练实验，在164个训练周期中仍然设置预热期为4个周期。此外，由于DGC已经在层内进行了高达99％的压缩，所以在进行层稀疏化的时候不宜过低；本方案采用模型序列的20％作为滑动窗口大小，并将窗口内层全部选中k_in＝100％，在窗口外部设置k_out＝20％的层以防止外部梯度过期，则基于层稀疏化的DGC算法通信量降低为原有算法的36％左右；First, a layer sparse distributed framework (LS-DGC) training experiment based on deep gradient compression DGC was conducted, and the warm-up period was still set to 4 cycles in 164 training cycles. In addition, since DGC has already performed up to 99% compression within the layer, it should not be too low when performing layer sparsification; this solution uses 20% of the model sequence as the sliding window size, and selects all inner layers of the window k _in =100%, setting k _out =20% of the layer outside the window to prevent external gradient expiration, then the communication volume of the DGC algorithm based on layer sparsification is reduced to about 36% of the original algorithm;

如图6所示，在使用Resnet110模型进行的两个训练上，LS-DGC框架均低于DGC框架，由于DGC已经做过较大程度的层内压缩，所以时间已经大幅缩短，而在结合基于层稀疏化算法后，时间消耗进一步得到改善；此外，在周期内时间占比方面本方案对比分析两个数据集上，两种方法在整周期、压缩与通信、解压与同步等阶段的时间变化，其结果如表1所示：As shown in Figure 6, in the two trainings using the Resnet110 model, the LS-DGC framework is lower than the DGC framework. Since DGC has done a large degree of intra-layer compression, the time has been greatly shortened, and when combined with the DGC-based After layer sparsification algorithm, the time consumption is further improved; in addition, in terms of time proportion within the cycle, this scheme comparatively analyzes two data sets, and the time changes of the two methods in the whole cycle, compression and communication, decompression and synchronization stages. , the results are shown in Table 1:

表1Table 1

根据表1得到，在耗时方面，压缩通信与解压同步的平均耗时都大幅下降，在周期内占比也下降一半左右，进一步缓解了节点间通信瓶颈问题。此外，由于DGC是基于层进行流水线通信传递的，所以当LS-DGC进行层稀疏化后，部分层不再进行通信，从而节点间通信频率也大幅降低；According to Table 1, in terms of time consumption, the average time consumption of compression communication and decompression synchronization has dropped significantly, and the proportion within the cycle has also dropped by about half, further alleviating the bottleneck problem of communication between nodes. In addition, since DGC is based on layer-based pipeline communication, when LS-DGC performs layer sparsification, some layers no longer communicate, and the frequency of communication between nodes is also greatly reduced;

相比于基线DGC框架，LS-DGC进一步降低了节点间的通讯量以缓解通信瓶颈，这必然会一定程度上影响模型收敛速度，在保持训练周期数量不变的情况下LS-DGC准确度略有降低，是由于模型还未收敛造成的，适当增加训练周期数量(总耗时不变)，待模型充分收敛，其准确度进一步提升，甚至超越基线方法结果，如表2所示：Compared with the baseline DGC framework, LS-DGC further reduces the communication volume between nodes to alleviate communication bottlenecks, which will inevitably affect the model convergence speed to a certain extent. The accuracy of LS-DGC is slightly lower while keeping the number of training cycles unchanged. The decrease is due to the fact that the model has not converged yet. Increase the number of training cycles appropriately (the total time consuming remains unchanged). When the model fully converges, its accuracy will be further improved and even surpass the results of the baseline method, as shown in Table 2:

表2Table 2

方法method Cifar10准确度Cifar10 accuracy Cifar100准确度Cifar100 accuracy DGC(基线)DGC(baseline) 93.55％93.55% 72.04％72.04% LS-DGCLS-DGC 93.08％93.08% 71.74％71.74% LS-DGC(more)LS-DGC(more) 94.15％(↑)94.15%(↑) 72.55％(↑)72.55%(↑)

其次进行基于随机快速下降法SSGD的层稀疏化分布式框架(LS-SSGD)训练实验，为了验证层稀疏化的普适性，本发明在无层内压缩的SSGD上也进行了实验，在无层内压缩的SSGD上，我们将窗口内层选择概率降低至k_in＝50％，同时将窗口外部概率降低至k_out＝10％，从而SSGD的整体通信量降低为原有算法的18％左右；由于层内无压缩，梯度损失较小，采用Resnet18训练Cifar10数据集，采用Resnet50训练Cifar100数据集也能很好的收敛；Secondly, a layer sparsification distributed framework (LS-SSGD) training experiment based on the random fast descent method SSGD is performed. In order to verify the universality of layer sparsification, the present invention also conducts experiments on SSGD without intra-layer compression. On SSGD with intra-layer compression, we reduce the layer selection probability within the window to k _in =50%, and at the same time reduce the probability outside the window to k _out =10%, thus reducing the overall communication volume of SSGD to about 18% of the original algorithm. ; Since there is no compression within the layer and the gradient loss is small, using Resnet18 to train the Cifar10 data set and Resnet50 to train the Cifar100 data set can also converge well;

如图7所示，在时间方面，同样，LS-SSGD框架仍然比SSGD的训练耗时低，且其阶段内压缩占比通信同步和整周期耗时均降低，其中通信同步耗时降低50％以上，如表3所示：As shown in Figure 7, in terms of time, similarly, the LS-SSGD framework is still less time-consuming than the training of SSGD, and the compression ratio within the stage, communication synchronization and the entire cycle time are both reduced, of which the communication synchronization time is reduced by 50%. Above, as shown in Table 3:

表3table 3

如图8所示，在准确度方面，相比于基线SSGD结果，LS-SSGD在相同训练周期时就能够超过SSGD的结果，在整个训练过程中LS-SSGD框架均优于SSGD框架，本方案认为这是因为在无层内压缩的情况下，基于层稀疏化能够起到类似于Dropout的作用，提高了模型的泛化能力，从而达到更优的性能。As shown in Figure 8, in terms of accuracy, compared with the baseline SSGD results, LS-SSGD can surpass the SSGD results in the same training cycle. The LS-SSGD framework is better than the SSGD framework throughout the training process. This solution It is believed that this is because in the absence of intra-layer compression, layer-based sparsification can play a role similar to Dropout, improving the generalization ability of the model, thereby achieving better performance.

本方案的有益效果为：将神经网络模型收敛特性应用于分布式训练，提出了一种神经网络的层稀疏化的分布式训练框架，解决了以往训练框架仅在网络层内部稀疏化的问题，进一步提升稀疏化程度；结合现有的层内深度稀疏化框架DGC和无层内稀疏化框架SSGD，通过实验实现了本方案提出的基于层稀疏化的分布式深度学习框架LS-DGC和LS-SSGD；在多个分类模型和多个图像数据集上进行实验，从整体耗时、通信量、通信占比、精确度等方面分析对比，充分证明了我们方法的有效性和先进性。The beneficial effects of this solution are: applying the convergence characteristics of the neural network model to distributed training, and proposing a distributed training framework for layer sparsification of the neural network, which solves the problem of previous training frameworks being only sparse within the network layer. Further improve the degree of sparsification; combined with the existing intra-layer deep sparsification framework DGC and no intra-layer sparsification framework SSGD, the distributed deep learning frameworks LS-DGC and LS- based on layer sparsification proposed in this scheme were implemented through experiments. SSGD; experiments were conducted on multiple classification models and multiple image data sets, and analysis and comparison were conducted in terms of overall time-consuming, communication volume, communication proportion, accuracy, etc., which fully proved the effectiveness and advancement of our method.

实施例6Example 6

如图9所示，本方案还提供一种基于层稀疏化的分布式深度学习训练方法的系统，包括：As shown in Figure 9, this solution also provides a system for distributed deep learning training methods based on layer sparsification, including:

实施例提供的基于层稀疏化的分布式深度学习训练方法的系统可以执行上述方法实施例基于层稀疏化的分布式深度学习训练方法所示的技术方案，其实现原理与有益效果类似，此处不再赘述。The system of the distributed deep learning training method based on layer sparseness provided by the embodiment can execute the technical solution shown in the above method embodiment based on the distributed deep learning training method based on layer sparseness. Its implementation principle and beneficial effects are similar. Here No longer.

本发明实施例中，本申请可以根据基于层稀疏化的分布式深度学习训练方法进行功能单元的划分，例如可以将各个功能划分为各个功能单元，也可以将两个或两个以上的功能集成在一个处理单元中。上述集成单元即可以采用硬件的形式来实现，也可以采用软件功能单元的形式来实现。需要说明的是，本发明中对单元的划分是示意性的，仅仅为一种逻辑划分，实际实现时可以有另外的划分方式。In the embodiment of the present invention, the application can divide functional units according to the distributed deep learning training method based on layer sparsification. For example, each function can be divided into functional units, or two or more functions can be integrated. in a processing unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of software functional unit. It should be noted that the division of units in the present invention is schematic and is only a logical division. In actual implementation, there may be other division methods.

本发明实施例中，基于层稀疏化的分布式深度学习训练方法的系统为了基于层稀疏化的分布式深度学习训练方法的原理与有益效果，其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到，结合本发明所公开的实施例描述的各示意单元及算法步骤，本发明能够以硬件和/或硬件和计算机软件结合的形式来实现，某个功能以硬件还是计算机软件驱动的方式来执行，取决于技术方案的特定应用和设计约束条件，可以对每个特定的应用来使用不同的方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。In the embodiment of the present invention, the system of the distributed deep learning training method based on layer sparsification includes the hardware structure and/or corresponding hardware structure for executing each function in order to realize the principles and beneficial effects of the distributed deep learning training method based on layer sparsification. Software modules. Those skilled in the art should easily realize that, in conjunction with the schematic units and algorithm steps described in the disclosed embodiments of the present invention, the present invention can be implemented in the form of hardware and/or a combination of hardware and computer software. A certain function can be implemented in hardware. Whether it is executed in a computer software-driven manner depends on the specific application and design constraints of the technical solution. Different methods can be used to implement the described functions for each specific application, but this implementation should not be considered beyond the scope of this application. range.

Claims

1. A distributed deep learning training method based on layer sparsification, which is characterized by including the following steps:

S1. According to the convergence characteristics of the neural network model, obtain the normalized window center list;

S2. Use the layer sparsification method and the normalized window center list to obtain the layer transmission list;

The specific steps of step S2 are as follows:

S21. Obtain the warm-up period and normalized window center list of the neural network;

S22. According to the layer sparsification method, normalize the sequence numbers of the network layers in the neural network to obtain a list of normalized layer sequence numbers;

S23. As the current training period is in the warm-up period, transmit all parameters of all layers of the neural network, otherwise proceed to step S24;

S24. According to the normalized window center list and the normalized layer serial number list, obtain the dynamic window list and the sampling list within the current training cycle window;

S25. According to the normalized layer serial number list and the dynamic window list, obtain the sampling list outside the window of the current training cycle;

S26. Merge the sampling list within the current training cycle window and the sampling list outside the current training cycle window to obtain the layer transmission list;

S3. Perform distributed deep learning training based on layer sparsification according to the layer transmission list, obtain weight update parameters, and complete distributed deep learning training based on layer sparsification;

The step S3 includes the following steps:

S31. Obtain the neural network layer gradient list through neural network sample feedforward and feedback calculations;

S32. Traverse each layer in the neural network layer gradient list layer by layer, and determine whether each layer is in the layer transmission list. If so, obtain several selected layers and enter step S33. Otherwise, obtain several local accumulated gradients;

S33. Determine whether each selected layer has intra-layer compression. If so, obtain a number of selected layers with intra-layer compression and enter step S34. Otherwise, obtain a number of selected layer transmission gradients without intra-layer compression;

S34. Perform intra-layer sparsification, inter-node communication and decompression synchronization in sequence within each compressed selected layer with intra-layer compression, and obtain a number of selected layer transmission gradients with intra-layer compression;

S35. Globally average each local accumulated gradient, each selected layer transmission gradient without intra-layer compression, or each selected layer transmission gradient with intra-layer compression, to obtain a complete gradient;

S36. According to the complete gradient, obtain the weight update parameters and complete the distributed deep learning training based on layer sparsification.

2. The distributed deep learning training method based on layer sparsification according to claim 1, characterized in that the step S1 includes the following steps:

S11. Set all layers of the neural network as a continuous sequence of layers, and set the total number of training times of the neural network;

S12. Set a dynamic window according to the convergence characteristics of the neural network model. The dynamic window traverses the continuous sequence of layers from back to front as the number of neural network training times increases, and sets the overall number of traversals of the dynamic window;

S13. Based on the total number of training times of the neural network and the number of overall traversals of the dynamic window, calculate the number of training times of the neural network and the remaining number of neural network training times during a single traversal of the neural network in the dynamic window;

S14. According to the number of training times of the neural network during a single traversal of the neural network in the dynamic window, the normalized step size of the movement during the dynamic window traversal is obtained;

S15. Based on the normalized step size, iterate the number of overall traversals of the dynamic window and the number of training times of the neural network in the process of a single traversal of the dynamic window through the neural network model, and obtain a list of the entire cycle window centers;

S16. Determine whether the remaining number of neural network training times is zero. If so, normalize the entire period window center list as the normalized window center list. Otherwise, proceed to step S17;

S17. Based on the normalized step size moved during dynamic window traversal, iterate the training times of the remaining neural network to obtain the remaining window center list;

S18. Add the remaining window center list to the end of the entire period window center list as the normalized window center list.

3. The distributed deep learning training method based on layer sparsification according to claim 1, characterized in that the step S24 includes the following steps:

S241. Obtain the window center of the current training cycle according to the normalized window center list;

S242. Use the window center and the normalization layer sequence number list of the current training cycle as the expectation and independent variable lists respectively, and calculate the standard normal distribution list;

S243. Select the sequence number of the network layer corresponding to the preset amount at the head of the standard normal distribution list to obtain a dynamic window list;

S244. Obtain the sampling list within the current training cycle window by randomly and uniformly sampling the preset proportion k _in of the dynamic window list.

4. The distributed deep learning training method based on layer sparsification according to claim 1, characterized in that the step S25 includes the following steps:

S251. Obtain the dynamic window external list according to the normalized layer serial number list and the dynamic window list;

S252. Obtain the sampling list outside the window of the current training cycle by randomly and uniformly sampling the preset proportion k _out of the dynamic window external list.

5. A system of distributed deep learning training method based on layer sparsification, which is characterized by including:

The normalized window center list acquisition module is used to obtain the normalized window center list based on the convergence characteristics of the neural network model;

The layer transmission list acquisition module is used to obtain the layer transmission list using the layer sparsification method and the normalized window center list, which is specifically:

A1. Obtain the warm-up period and normalized window center list of the neural network;

A2. According to the layer sparsification method, normalize the sequence numbers of the network layers in the neural network to obtain a list of normalized layer sequence numbers;

A3. As the current training period is in the warm-up period, transmit all parameters of all layers of the neural network, otherwise proceed to step A4;

A4. According to the normalized window center list and the normalized layer sequence number list, obtain the dynamic window list and the sampling list within the current training cycle window;

A5. According to the normalized layer serial number list and the dynamic window list, obtain the sampling list outside the window of the current training cycle;

A6. Merge the sampling list within the current training cycle window and the sampling list outside the current training cycle window to obtain the layer transmission list;

The distributed deep learning training module based on layer sparsification performs distributed deep learning training based on layer sparsification according to the layer transmission list, obtains weight update parameters, and completes distributed deep learning training based on layer sparsification. The details are:

B1. Obtain the neural network layer gradient list through neural network sample feedforward and feedback calculations;

B2. Traverse each layer in the neural network layer gradient list layer by layer, and determine whether each layer is in the layer transmission list. If so, get a number of selected layers and enter step B3. Otherwise, get a number of local accumulated gradients;

B3. Determine whether each selected layer has intra-layer compression. If so, obtain a number of selected layers with intra-layer compression and enter step B4. Otherwise, obtain a number of selected layer transmission gradients without intra-layer compression;

B4. Perform intra-layer sparsification, inter-node communication and decompression synchronization in sequence within each compressed selected layer with intra-layer compression, and obtain several selected layer transmission gradients with intra-layer compression;

B5. Globally average each local accumulated gradient, each selected layer transmission gradient without intra-layer compression, or each selected layer transmission gradient with intra-layer compression, to obtain the complete gradient;

B6. Based on the complete gradient, obtain the weight update parameters and complete distributed deep learning training based on layer sparsification.