CN111858058A

CN111858058A - SGD load balancing method, device and storage medium based on parallel computing

Info

Publication number: CN111858058A
Application number: CN202010723846.3A
Authority: CN
Inventors: 王彪; 王亚强; 刘魁
Original assignee: Chengdu Cheng Xin High Tech Information Technology Co ltd; Chengdu University of Information Technology
Current assignee: Chengdu Cheng Xin High Tech Information Technology Co ltd; Chengdu University of Information Technology
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-30

Abstract

The invention discloses an SGD load balancing method based on parallel computing, which comprises the following steps: realizing distributed parallel gpu calculation based on a design mode combining model parallel and data parallel; and a semaphore mechanism is adopted to realize synchronous communication between the main node and the sub-nodes, and the optimizer in the sub-container updates the weight by adopting a random gradient descent algorithm. The main node constructs a minimum spanning tree by taking the error in the control table of the child nodes as the weight, finds out the key nodes in the graph nodes, eliminates the nodes without nodes in sequence and redistributes the hardware resources of the nodes. The method realizes that a plurality of model copies simultaneously process different subsets of training samples, periodically carries out interactive combination on the model copies, and optimizes a distributed algorithm. The invention provides a new framework thought to realize the strategy of load balancing calculation, improves the model development efficiency and reduces the development cost, and the algorithm has better adaptability to the data scale and simultaneously realizes the asynchronous communication among the dynamic management sub-containers.

Description

SGD load balancing method, device and storage medium based on parallel computing

技术领域technical field

本发明涉及机器学习领域，尤其涉及基于并行计算的SGD负载均衡方法、装置及存储介质。The present invention relates to the field of machine learning, and in particular, to an SGD load balancing method, device and storage medium based on parallel computing.

背景技术Background technique

目前，人们已经领略到人工智能在多个领域的巨大优势。机器学习是人工智能中重要的一环，通过对海量的数据进行建模、训练，帮助人们进行决策。At present, people have realized the huge advantages of artificial intelligence in many fields. Machine learning is an important part of artificial intelligence. It helps people make decisions by modeling and training massive amounts of data.

然而随着大数据的兴起，数据规模越来越庞大，单机模式下的存储及计算能力已经无法满足海量数据的要求。分布式机器学习应运而生，采用分布式机器学习来加快模型收敛的速度已经成为业界主流的方式，目前分布式机器学习较为通用的做法有两种：模型并行和数据并行。However, with the rise of big data, the scale of data is getting larger and larger, and the storage and computing power in stand-alone mode can no longer meet the requirements of massive data. Distributed machine learning emerges as the times require. Using distributed machine learning to speed up model convergence has become a mainstream approach in the industry. Currently, there are two common approaches to distributed machine learning: model parallelism and data parallelism.

然而当前的并行计算受限于木桶效应，往往要等到最慢的节点计算完才能进行下一步计算。实现对多个模型副本同时处理训练样本的不同子集，周期性的对各模型副本结果进行交互合并，提供大规模数据下的计算效率，技术难度要求较高。However, the current parallel computing is limited by the barrel effect, and it is often necessary to wait for the slowest node to complete the calculation before proceeding to the next step. It realizes the simultaneous processing of different subsets of training samples for multiple model copies, and periodically merges the results of each model copy to provide computing efficiency under large-scale data, which requires high technical difficulty.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提供基于并行计算的SGD负载均衡方法、装置及存储介质，采用基于模型并行模式和数据并行模式相结合的方式。与现有技术相比，本发明有效的实现了多个模型副本同时处理训练样本的不同子集，周期性的对各模型副本的结果进行交互合并，对分布式算法进行优化。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide an SGD load balancing method, device and storage medium based on parallel computing, using a combination of model-based parallel mode and data parallel mode. Compared with the prior art, the present invention effectively realizes that multiple model copies process different subsets of training samples at the same time, periodically merges the results of each model copy interactively, and optimizes the distributed algorithm.

本发明的目的是通过以下技术方案来实现的：The purpose of this invention is to realize through the following technical solutions:

基于并行计算的SGD负载均衡方法，包括以下步骤：The SGD load balancing method based on parallel computing includes the following steps:

步骤1：搭建并行gpu计算架构，采用基于模型并行模式和数据并行模式相结合的方式，构建单向联通图，在图节点之间周期性的进行模型流通，使模型覆盖数据集，并为图节点择优分配硬件设备；Step 1: Build a parallel GPU computing architecture, build a one-way connected graph based on a combination of model-parallel mode and data-parallel mode, and periodically circulate models between graph nodes, so that the model covers the data set and creates a graph for the graph. Nodes preferentially allocate hardware devices;

步骤2：动态管理节点硬件资源，采用信号量机制实现主节点对子节点间同步通信，并在子容器中优化器采用随机梯度下降算法更新权重。Step 2: Dynamically manage the node hardware resources, use the semaphore mechanism to realize synchronous communication between the master node and the child nodes, and update the weights by the stochastic gradient descent algorithm in the optimizer in the child container.

具体的，所述步骤1中搭建并行gpu计算架构具体包括以下子步骤：Specifically, building a parallel GPU computing architecture in step 1 specifically includes the following sub-steps:

S101，配置一个管理节点Manager，在创建N个容器部署在不同的机器上，记为节点Node，在子节点上创建节点控制表，记录节点ID、节点数据集和当前批次误差；S101, configure a management node Manager, create N containers and deploy them on different machines, denoted as node Node, create a node control table on child nodes, record node ID, node data set and current batch error;

S102，在子节点间建立连接，形成单向连通图，在子节点中搭建神经网络，设置一个周期的时间片T；S102, establishing connections between child nodes to form a one-way connected graph, building a neural network in the child nodes, and setting a period of time slice T;

S103，将数据样本平均分为N份，按顺序送入节点中，使用SGD算法在不同的节点上训练，每份数据样本经过前向传播和反向传播得到一个局部的梯度值，并更新梯度；S103: Divide the data samples into N equally, send them to the nodes in sequence, use the SGD algorithm to train on different nodes, each data sample obtains a local gradient value through forward propagation and back propagation, and updates the gradient ;

S104，在每个训练周期中按照图的层次进行遍历，记录该模型误差的无偏估计量，将误差值记录在节点控制表中。S104, traverse according to the level of the graph in each training cycle, record the unbiased estimator of the model error, and record the error value in the node control table.

具体的，所述子步骤S104中图的遍历过程具体包括：将上层节点输出的权值和偏置等参数封装成一个NN对象进行传输；在当前节点收到上层节点传来的NN对象后，将NN对象作为隐含层进行训练；若当前节点有多个上层节点，则对上层节点传来的NN对象进行合并，求出NN对象的均值作为隐含层进行训练。Specifically, the traversal process of the graph in the sub-step S104 specifically includes: encapsulating parameters such as weights and biases output by the upper-layer node into a NN object for transmission; after the current node receives the NN object transmitted by the upper-layer node, The NN object is used as the hidden layer for training; if the current node has multiple upper-layer nodes, the NN objects from the upper-layer nodes are merged, and the mean value of the NN objects is obtained as the hidden layer for training.

具体的，所述步骤2中动态管理节点硬件资源过程具体包括以下子步骤：Specifically, the process of dynamically managing node hardware resources in step 2 specifically includes the following sub-steps:

S201，在每个周期中，通过主节点查询节点控制表，以节点控制表中的误差作为权值构建最小生成树，对最小生成树中的权值进行排序；S201, in each cycle, query the node control table through the master node, use the error in the node control table as the weight to construct a minimum spanning tree, and sort the weights in the minimum spanning tree;

S202，当训练模型将要收敛时，主节点根据节点控制表中的每个周期的最小生成树，按照权值对节点进行排序，对关键节点发送同步信号；S202, when the training model is about to converge, the master node sorts the nodes according to the weights according to the minimum spanning tree of each cycle in the node control table, and sends a synchronization signal to the key nodes;

S203，主节点按次序回收单向联通图中未收到同步信号的节点的任务，并将该节点的硬件资源分配给相邻的关键节点，加快相邻关键节点的计算速度，直至完成所有节点完成训练任务。S203, the master node reclaims the tasks of the nodes that have not received the synchronization signal in the one-way Unicom diagram in order, and allocates the hardware resources of the node to the adjacent key nodes to speed up the calculation speed of the adjacent key nodes until all nodes are completed. Complete training tasks.

一种计算装置，包括存储器，存储器中存储有计算机可执行指令；处理器，用于执行所述计算机程序时实现上述负载均衡方法的步骤。A computing device includes a memory in which computer-executable instructions are stored; and a processor for implementing the steps of the above load balancing method when executing the computer program.

一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现上述负载均衡方法的步骤。A computer-readable storage medium stores a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, implements the steps of the above load balancing method.

本发明的有益效果：本发明提出了一种新的架构思路来实现负载均衡计算的策略，提高了模型开发效率和减少了开发成本，使该算法对数据规模有较好的适应性，同时实现了动态管理子容器间的异步通信。Beneficial effects of the present invention: The present invention proposes a new architectural idea to realize the strategy of load balancing calculation, which improves the model development efficiency and reduces the development cost, so that the algorithm has better adaptability to the data scale, and at the same time realizes the Asynchronous communication between subcontainers is dynamically managed.

附图说明Description of drawings

图1是本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2是本发明的并行计算架构示意图。FIG. 2 is a schematic diagram of a parallel computing architecture of the present invention.

图3是本发明采用信号量机制实现动态管理节点硬件资源的示意图。FIG. 3 is a schematic diagram of the present invention using a semaphore mechanism to dynamically manage node hardware resources.

具体实施方式Detailed ways

为了对本发明的技术特征、目的和效果有更加清楚的理解，现对照附图说明本发明的具体实施方式。In order to have a clearer understanding of the technical features, objects and effects of the present invention, the specific embodiments of the present invention will now be described with reference to the accompanying drawings.

本实施例中，如图1所示，基于并行计算的SGD负载均衡方法，主要包括以下步骤：In this embodiment, as shown in FIG. 1 , the SGD load balancing method based on parallel computing mainly includes the following steps:

本实施例中，如图2所示，本发明提供了基于并行计算SGD负载均衡方法的结构示意图，其具体实现过程包括：首先配置一个管理节点Manager，在创建N个容器部署在不同的机器上，记为节点Node，在子节点上创建节点控制表，用来记录节点ID、节点数据集、当前批次误差。在子节点间建立连接，形成单向连通图（图节点为GPU硬件设备），在子节点中搭建神经网络，设置一个周期的时间片T。将数据样本平均分为N份，按顺序送入节点中，使用SGD算法在不同的节点上训练，每份数据样本经过前向传播和反向传播得到一个局部的梯度值，并更新梯度。在每个训练周期中按照图的层次遍历记录该模型误差的无偏估计量，将误差值记录在节点控制表中。其中，在图的遍历过程中相邻节点之间需要传输节点之间的权值和偏置，由于神经网络复杂、参数众多，所以将参数封装成一个NN对象进行传输，在节点收到上层节点传来的NN对象后，将NN对象作为隐含层进行训练。若节点有多个上层节点，则对上层节点传来的NN对象进行合并，求出NN对象的均值作为隐含层进行训练。周期性的进行模型流通，使模型在所有数据上运行。In this embodiment, as shown in FIG. 2 , the present invention provides a schematic structural diagram of the SGD load balancing method based on parallel computing. The specific implementation process includes: firstly configuring a management node Manager, and then creating N containers and deploying them on different machines , denoted as node Node, create a node control table on child nodes to record node ID, node data set, and current batch error. Establish connections between child nodes to form a one-way connected graph (the graph nodes are GPU hardware devices), build a neural network in the child nodes, and set a period of time slice T. The data samples are equally divided into N parts, which are sent to the nodes in sequence, and the SGD algorithm is used to train on different nodes. Each data sample obtains a local gradient value through forward propagation and back propagation, and the gradient is updated. In each training cycle, the unbiased estimator of the model error is traversed and recorded according to the level of the graph, and the error value is recorded in the node control table. Among them, in the process of graph traversal, the weights and offsets between adjacent nodes need to be transmitted between adjacent nodes. Because the neural network is complex and has many parameters, the parameters are encapsulated into a NN object for transmission, and the node receives the upper-layer node. After the incoming NN object, the NN object is used as a hidden layer for training. If the node has multiple upper-layer nodes, the NN objects from the upper-layer nodes are merged, and the mean value of the NN objects is obtained as the hidden layer for training. Periodically circulate the model so that the model runs on all data.

基于步骤1所述的架构，在训练一段时间后，部分节点的误差会下降的非常缓慢，需要非常长的训练时间才能达到收敛，非常影响训练效率，同时也会产生大量的无效计算，造成硬件资源的浪费。因此本发明引进信号量机制实现主节点与子节点的同步通信，管理对节点硬件资源进行动态管理。Based on the architecture described in step 1, after training for a period of time, the error of some nodes will decrease very slowly, and it will take a very long training time to achieve convergence, which will greatly affect the training efficiency, and will also generate a large number of invalid calculations, causing hardware failure. waste of resources. Therefore, the present invention introduces a semaphore mechanism to realize the synchronous communication between the master node and the child nodes, and manages the dynamic management of the node hardware resources.

本实施例中，图3是本发明采用信号量机制实现动态管理节点硬件资源的示意图，其具体的实现过程包括：在每个周期中，主节点查询节点控制表，以节点控制表中的误差作为权值构建最小生成树，对最小生成树中的权值进行排序。训练一定周期后（模型将要收敛时），主节点根据节点控制表中的每个周期的最小生成树，按照权值对节点进行排序，对关键节点发送同步信号。随后主节点按次序回收未收到同步信号的节点的任务，并将该节点的硬件资源分配给相邻的关键节点，用来加快相邻节点的计算速度，以提升整个模型的效率。In this embodiment, FIG. 3 is a schematic diagram of the present invention using a semaphore mechanism to implement dynamic management of node hardware resources. The specific implementation process includes: in each cycle, the master node queries the node control table, and uses the error in the node control table. Build a minimum spanning tree as weights, and sort the weights in the minimum spanning tree. After a certain period of training (when the model is about to converge), the master node sorts the nodes according to the weights according to the minimum spanning tree of each cycle in the node control table, and sends a synchronization signal to the key nodes. Then the master node reclaims the tasks of the nodes that have not received the synchronization signal in order, and allocates the hardware resources of the node to the adjacent key nodes to speed up the calculation speed of the adjacent nodes and improve the efficiency of the entire model.

本发明所采用的架构思路能有效的降低Loss值，提供模型的开发效率，减少开发成本，且对数据规模有较好的适应性。The architectural idea adopted by the present invention can effectively reduce the Loss value, improve the development efficiency of the model, reduce the development cost, and have better adaptability to the data scale.

此外，本发明还提供一种计算装置和一种计算机可读存储介质。其中，一种计算装置包括存储器，存储器中存储有计算机可执行指令；处理器，用于执行所述计算机程序时实现实施例中负载均衡方法的所有实现过程和步骤。一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，计算机程序被处理器执行时实现上述负载均衡方法的所有方法和步骤。In addition, the present invention also provides a computing device and a computer-readable storage medium. Wherein, a computing device includes a memory, in which computer-executable instructions are stored; and a processor for implementing all implementation processes and steps of the load balancing method in the embodiment when executing the computer program. A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements all the methods and steps of the above load balancing method.

以上显示和描述了本发明的基本原理和主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明范围内。本发明要求保护的范围由所附的权利要求书及其等效物界定。The basic principles and main features of the present invention and the advantages of the present invention have been shown and described above. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments. The above-mentioned embodiments and descriptions only illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will also have Various changes and modifications fall within the scope of the claimed invention. The claimed scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. The SGD load balancing method based on parallel computing is characterized by comprising the following steps of:

step 1: constructing a parallel gpu computing architecture, constructing a one-way communication graph by adopting a mode of combining a model parallel mode and a data parallel mode, periodically carrying out model circulation among graph nodes, enabling a model to cover a data set, and preferentially distributing hardware equipment for the graph nodes;

step 2: and dynamically managing node hardware resources, realizing synchronous communication between the main node and the sub-nodes by adopting a semaphore mechanism, and updating the weight by adopting a random gradient descent algorithm in the optimizer in the sub-container.

2. The SGD load balancing method based on parallel computing according to claim 1, wherein the building of the parallel gpu computing architecture in step 1 specifically includes the following sub-steps:

s101, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, creating a Node control table on a child Node, and recording a Node ID, a Node data set and a current batch error;

s102, establishing connection among the sub-nodes to form a one-way connection graph, building a neural network in the sub-nodes, and setting a time slice T of one period;

s103, evenly dividing the data samples into N parts, sending the N parts into nodes in sequence, training the nodes on different nodes by using an SGD algorithm, obtaining a local gradient value by each part of the data samples through forward propagation and backward propagation, and updating the gradient; and S104, traversing according to the hierarchy of the graph in each training period, recording the unbiased estimation quantity of the model error, and recording the error value in the node control table.

3. The SGD load balancing method according to claim 2, wherein the traversal process of the graph in the sub-step S104 specifically includes: packing parameters such as weight and bias output by an upper node into an NN object for transmission; after the current node receives the NN object transmitted by the upper node, training the NN object as a hidden layer; and if the current node has a plurality of upper nodes, merging NN objects transmitted from the upper nodes, and solving the mean value of the NN objects as a hidden layer for training.

4. The SGD load balancing method based on parallel computing according to claim 1, wherein the step 2 of dynamically managing hardware resources of nodes specifically comprises the following sub-steps:

s201, in each period, inquiring a node control table through a main node, constructing a minimum spanning tree by taking an error in the node control table as a weight, and sequencing the weights in the minimum spanning tree;

s202, when the training model is to be converged, the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends a synchronization signal to the key node;

and S203, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals in the unidirectional communication graph, distributes the hardware resources of the nodes to the adjacent key nodes, and accelerates the calculation speed of the adjacent key nodes until all the nodes finish the training tasks.

5. A computing device, comprising

A memory having computer-executable instructions stored therein;

a processor for implementing the steps of the load balancing method according to any one of claims 1 to 4 when executing the computer program.

6. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the load balancing method according to any one of claims 1 to 4.