CN112350842B

CN112350842B - Method for resetting data transmission network in distributed training task training process

Info

Publication number: CN112350842B
Application number: CN201910731784.8A
Authority: CN
Inventors: 张翔宇; 郭昊; 张曼妮; 孙军欢; 赵来松
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2023-04-07
Anticipated expiration: 2039-08-08
Also published as: CN112350842A

Abstract

The invention provides a method for resetting a data transmission network in a distributed training task training process, which is characterized in that after a distributed training task is scheduled to a training cluster with an RDMA network, by respectively acquiring the RDMA network IP of each subtask computing node and determining a main node (the rest are slave nodes) from each subtask computing node, the main node is used for collecting the RDMA network information of the training cluster; after the completion of the aggregation, updating the environment configuration parameters of each subtask according to the cluster RDMA network information so as to realize communication according to the updated environment configuration parameters in the distributed training process and achieve the purpose of resetting the data transmission network into the RDMA network.

Description

Method for resetting data transmission network in training process of distributed training task

Technical Field

The invention relates to the field of distributed machine learning and the technical field of container cloud; and more particularly, to a method of resetting a data transmission network during a distributed training task training process.

Background

Machine learning, particularly deep learning, has enjoyed wide success in artificial intelligence driven services. As models become more complex, their training becomes computationally more costly. If efficient and timely training is to be achieved, the advantages of parallel computing of a distributed system need to be explored. Industry leadership enterprises such as microsoft, facebook, and Google have begun attempting to run distributed machine learning training tasks on production clusters of hundreds or thousands of servers.

However, a practical physical cluster for distributed training, from building deployment to operation and maintenance, is a very professional and complex or even cumbersome task. The container cloud technology is applied to the field of distributed machine learning, and the difficulty of constructing, deploying, operating and maintaining work can be simplified undoubtedly.

The container cloud technology can not only realize the rapid deployment of the container cluster, but also be a lightweight solution, and can effectively integrate and manage bare computer resources. Taking the Kubernetes platform to run a distributed machine learning training task as an example, kubernetes not only provides a consistent method for packaging applications, ensures the running consistency of the applications on different devices, provides resource isolation for the running environment of the applications, abstracts the complexity and node management of a hardware bottom layer, and supports the scheduling of a GPU.

However, whether the physical cluster is built by a plurality of host servers for training or the training cluster is deployed on a container cloud platform, data transmission between the computing nodes is usually realized by network communication based on a TCP/IP protocol (also a network protocol commonly used by a wide area network and a local area network at present). The network communication process needs the intervention of an operating system and a protocol stack, but as the training set is larger and larger, a large amount of CPU resources are inevitably occupied in the parameter exchange (parameter exchange) process, so that a larger network delay is caused, and the training efficiency is seriously restricted.

A Remote Direct Memory Access technology, i.e., an RDMA (Remote Direct Memory Access) technology, is a Direct Memory Access technology; it transfers data directly from the memory of one computer to another without the intervention of operating systems of both parties. Therefore, compared with the conventional network based on the general TCP/IP protocol, the RDMA network communication can avoid a large amount of CPU resource occupation in the network transmission process, and simultaneously reduces the network delay. Then, building/deploying a training cluster with an RDMA network for a distributed training task, and providing RDMA network communication for training data (e.g., data communication in a parameter exchange process) in a training process is obviously an effective way to break through a communication bottleneck of a parameter interaction network and improve distributed training efficiency.

In the distributed training process, the dependency relationship among the subtasks allocated to each computing node and the data consistency among the control subtasks are generally guaranteed by environment configuration parameters. Specifically, in general, the environment configuration parameters corresponding to each subtask will include all subtasks and some information of the current subtask (e.g., subtask number, network connection parameters, etc.). In the actual deployment and training process, besides scheduling the distributed tasks to the training cluster (i.e., distributing each subtask to each computing node of the training cluster) by using the environment configuration parameters, the method also includes implementing data communication between the training applications running on different computing nodes through network connection parameters in the environment configuration parameters in the training process.

Therefore, in practice, taking the example of deploying a distributed training task in a physical cluster with an RDMA network as an example, to implement efficient distributed training in an RDMA network environment, generally, RDMA network IPs of computing nodes of a training cluster are first obtained, environment configuration parameters including the RDMA network IPs (as network connection parameters) are generated manually/by using a script, and then efficient distributed training after the task is scheduled to the training cluster is implemented.

However, deploying distributed training on a container cloud platform is often considered to be more efficient in utilizing platform resources. To better utilize resources, when a container cloud platform deploys a training task, it is usually: the method comprises the steps of firstly decomposing a training task into a plurality of subtasks, generating environment configuration parameters for the subtasks, and then creating corresponding container/container groups for the subtasks (the container/container group refers to the minimum unit of a container cluster during arrangement management, wherein a container refers to a container running an independent application in a container environment, and the container group refers to a logic host running the independent application in the container environment and runs one or more tightly coupled application containers, such as Pod of a Kubernetes platform). After distributed training is started, communication between subtask training applications running on each computing node is achieved through a connection access service via a conventional network (i.e., a TCP/IP-based network, which is generally used as a default network of a multi-network cluster). This communication mechanism requires just the intervention of the system kernel. The key to achieving efficient communication in RDMA networks is not to rely on system kernel intervention.

In summary, since RDMA network information (e.g., RDMA network IP) cannot be obtained in advance, even if the current container training cluster has an RDMA network, the training applications running on the computing nodes (i.e., the containers/container groups used for training) cannot discover and efficiently use the RDMA network.

In addition, although some of the aforementioned methods can also implement the deployment of training tasks in the RDMA network physical cluster, manual configuration (RDMA network IP is an environment configuration parameter of a network connection parameter) is difficult to avoid errors; even if the configuration is only slightly wrong, the whole training cluster cannot provide an effective RDMA network for data transmission in the training task, and the whole deployment fails.

Disclosure of Invention

In view of this, the present invention provides a method for resetting a data transmission network in a distributed training task training process, after a distributed training task is scheduled to a training cluster with an RDMA network, RDMA network IPs of subtask computing nodes are respectively obtained, a master node (the rest are slave nodes) is determined from the subtask computing nodes to collect RDMA network information of the training cluster, and then an environment configuration parameter of each subtask is updated according to the collected RDMA network information, so as to achieve a purpose of resetting the data transmission network to the RDMA network by using the updated environment configuration parameter in the distributed training process.

In one aspect, an embodiment of the present invention provides a method for resetting a data transmission network during a training process of a distributed training task.

The method for resetting the data transmission network in the training process of the distributed training task comprises the following steps:

when a distributed training task is scheduled to a training cluster with an RDMA network,

before the distributed training is started up,

or after the distributed training is started, namely after each computing node of the training cluster respectively starts the training application program of the corresponding subtask, and before each subtask application program executes the training,

respectively acquiring RDMA network IP of each computing node of the cluster;

and determining the role of each computing node in the network resetting process:

namely, determining that one of the computing nodes is a master node and the other computing nodes are slave nodes;

the slave nodes IP-transmitting the RDMA network of the respective computing node to the master node via a conventional network (i.e., a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;

the primary node then aggregates the RDMA network information of the training cluster:

recording a main node and an RDMA network IP thereof, and each slave node and the RDMA network IP thereof;

after all RDMA network information of the training cluster is collected, updating the environment configuration parameters of each subtask according to the RDMA network information:

replacing default network connection parameters in the environment configuration parameters by RDMA network IP in the training cluster RDMA network information;

thus, the data transmission network of the distributed training task is reset;

after the training application program of each subtask starts to execute training, data transmission, especially transmission of a large amount of training data, of the subtask can use the RDMA network of the training cluster according to the updated environment configuration parameters, and efficient data transmission is achieved.

In another aspect, an embodiment of the present invention provides a system for resetting a data transmission network during a training process of a distributed training task.

The system for resetting the data transmission network in the training process of the distributed training task comprises the following steps:

RDMA network information collection unit and task environment configuration parameter updating unit; wherein,

the RDMA network information gathering unit is used for respectively acquiring the RDMA network IP of each computing node of the training cluster and determining a main node to gather the RDMA network information of the training cluster;

and the task environment configuration parameter updating units update the environment configuration parameters of each subtask according to the default network connection parameters in the RDMA network IP replacement environment configuration parameters of the training cluster RDMA network information collected by the RDMA network information collecting unit.

In the method and system for resetting the data transmission network in the training process of the distributed training task in the embodiment, after the distributed training task is scheduled to the training cluster with the RDMA network, the RDMA network IP of each computing node of the training cluster is acquired and the RDMA network information of the training cluster is collected by the fixed main node, and the environment configuration parameters of each subtask are updated according to the RDMA network information, so that the distributed training process is communicated according to the updated environment configuration parameters, and the purpose of resetting the data transmission network to the RDMA network is achieved. The method and the system solve the problem of communication bottleneck in the training process, and can avoid modifying and complicating the task scheduling mechanism.

Drawings

To more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings related to a part of the embodiments of the present invention or the description in the prior art will be briefly introduced below.

Fig. 1 is a schematic flow diagram of updating TF _ CONFIG of each subtask (i.e., corresponding Pod) in a distributed tensoflow task deployment process of a kubernets platform based on a method for resetting a data transmission network in a distributed training task training process provided in a preferred embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the described embodiments are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of protection of the present invention.

Some preferred embodiments of the invention are as follows. Some of these preferred embodiments provide a method of resetting a data transmission network during training of a distributed training task. The method comprises the following steps:

before distributed training is started, starting a process at each computing node, and respectively acquiring RDMA network IP of each computing node;

determining a master node in the network resetting process from each computing node according to a task ID (preset) in the environment configuration parameters or by means of ZooKeeper, wherein the other nodes are slave nodes;

the slave nodes IP-transmit the RDMA network of the respective computing node to the master node via a conventional network (i.e., a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;

thus, the data transmission network of the distributed training task is reset; after the reset is complete, the start of distributed training may be notified.

Still other of the preferred embodiments provide another method of resetting a data transmission network during training of a distributed training task. The method can still reset the transmission network after the distributed training is started, and specifically comprises the following steps:

after distributed training is started, namely after each computing node of the training cluster respectively starts a training application program of a corresponding subtask, and before each subtask application program executes training,

respectively acquiring RDMA network IP of each computing node by each training application program;

determining a main node in the network resetting process from each computing node according to the task ID (preset) in the environment configuration parameters of the corresponding subtasks by the training application programs, wherein the other nodes are slave nodes;

or determining a master node in the network resetting process from each computing node by means of ZooKeeper, and the other nodes are slave nodes;

after confirming the master node and the slave node, the training application program of each slave node transmits RDMA network IP of each computing node to the master node through a conventional network (namely a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;

the training application program of the master node creates a list to record the RDMAIP of the master node, the slave nodes and the RDMA network IPs thereof (namely the RDMA network information of the training cluster);

after the main node collects all RDMA network information of the training cluster, updating the environment configuration parameters of each subtask according to the information:

subsequently, each node training application can start to execute training, and data transmission in the training process can use the RDMA network by using the updated environment configuration parameters, namely, reset of the data transmission network is realized.

In some of the methods for resetting a data transmission network in a distributed training task training process provided in the foregoing preferred embodiments, a specific process of updating an environment configuration parameter of each subtask according to cluster RDMA network information collected by a master node may be that the master node generates a corresponding new environment configuration parameter for each subtask according to the cluster RDMA network information, and transmits the new environment configuration parameter to a corresponding slave node via a conventional network, so as to replace an original environment configuration parameter of each computing node;

or the main node transmits the cluster RDMA information to each slave node through a conventional network, and each computing node generates new environment configuration parameters for each task to replace the original environment configuration parameters of each computing node.

In some of the methods for resetting a data transmission network during training of a distributed training task provided in the foregoing preferred embodiments, in the computing nodes of the training cluster, in addition to a Worker node (Worker) for data parallel computation during the training process, a Parameter server node (Parameter server) is further included, which is responsible for maintaining parameters of global sharing.

Some of the above preferred embodiments provide a method for resetting a data transmission network in a distributed training task training process, wherein a distributed training task is deployed on a container cloud platform, and a computing node of the distributed training task is not a physical host in a general sense, but a virtualized computer resource such as a container/container group.

Fig. 1 is a schematic flow diagram of updating the TF _ CONFIG of each subtask (i.e., corresponding Pod) in the process of deploying the distributed tensolflow task on the kubernets platform based on the method for resetting the data transmission network in the distributed training task training process provided in an preferred embodiment of the present invention. As shown in figure 1 of the drawings, in which,

after the distributed tenserflow task is scheduled to a container cluster on the kubernets platform that is configured with a regular network and an RDMA network,

starting a distributed TensorFlow task, starting a distributed TensorFlow training process in each Pod of the container cluster, and executing a subtask scheduled to the Pod;

before each training process begins to execute subtask training

Respectively acquiring (virtual) RDMA network card IP allocated to each Pod by the training process running on each Pod;

each Pod analyzes a respective environment variable TF _ CONFIG (which is an implementation manner of the environment configuration parameter when the example is deployed), and obtains a task name and a task ID of a respective subtask;

taking the Pod with the task of 0 as a master node and the other pods as slave nodes;

a training process running on the main node waits for all other computing nodes to establish TCP connection with the main node and creates an array for recording task IDs of each Pod of the cluster and RDMA network card IPs;

the training process running on the slave node acquires the conventional network IP of the master node and establishes TCP connection with the master node according to the IP; the previously acquired local RDMA network card IP and the corresponding task ID are sent to the main node through TCP connection;

in the main node, in the running training process, in addition to recording the task ID and the IP of the RDMA network card at the local end, the task ID and the IP of the RDMA network card sent by the slave node are inserted into the array for storage every time the task ID and the IP of the RDMA network card are received;

after receiving all task IDs and RDMA network card IPs of the cluster, the main node replaces a default network connection parameter-a connection access service name in the TF _ CONFIG with the RDMA network card IP, regenerates the TF _ CONFIG to replace the original TF _ CONFIG:

updating the master node, namely directly updating after generating TF _ CONFIG with task ID of 0 through the master node training process; if the slave node is updated, the master node training process generates a corresponding TF _ CONFIG according to the task ID, and transmits the TF _ CONFIG to the corresponding node to be updated by the master node training process;

after the updating of the slave node is completed, the slave node notifies the master node and closes the TCP connection;

thus, each Pod can start to execute tensorflow training and enter the stage of constructing graphs and data parallel computation.

Still other preferred embodiments of the present invention provide a system for resetting a data transmission network during training of a distributed training task. The system comprises an RDMA network information gathering unit and task environment configuration parameter updating units; wherein,

an RDMA network information gathering unit which runs on each computing node of the training cluster,

before the distributed training is started up,

respectively acquiring RDMA network IP of each computing node of a training cluster;

determining a main node from each computing node according to a task ID (preset) in an environment configuration parameter or by means of a ZooKeeper to collect the RDMA network information of the training cluster; the other computing nodes are slave nodes, and the RDMA network IP of each computing node is transmitted to the master node through the conventional network according to the network connection parameters in the environment configuration parameters);

and the task environment configuration parameter updating units update the environment configuration parameters of each computing node of the cluster, namely the environment configuration parameters of each subtask according to the default network connection parameters in the RDMA network IP replacement environment configuration parameters of the RDMA network information of the training cluster collected by the RDMA network information collecting unit.

Some of the preferred embodiments described above provide systems for resetting a data transport network during training of a distributed training task, wherein the data transport network can be reset during the training process. The system also includes a distributed training process control unit: after the distributed training task is scheduled to a training cluster with an RDMA network and the distributed training is started, the distributed training process of each computing node of the cluster is stopped, the RDMA network information collecting unit collects the RDMA network information of the training cluster and the task environment configuration parameter updating units update the environment configuration parameters of each subtask according to the RDMA network information of the training cluster, namely, the distributed training process is recovered after the data transmission network is reset.

In some of the foregoing preferred embodiments, in the system for resetting a data transmission network in a distributed training task training process, the specific process in which the task environment configuration parameter updating units update the configuration parameters of the respective subtasks according to the training cluster RDMA network information may be that the master node generates a corresponding new environment configuration parameter for each subtask according to the cluster RDMA network information, and transmits the new environment configuration parameter to the corresponding slave node via a conventional network, so as to replace the original environment configuration parameters of the respective computing node;

In some of the systems for resetting a data transmission network during training of a distributed training task provided in the foregoing preferred embodiments, in the computing nodes of the training cluster, in addition to the work node (Worker) for data parallel computation during the training process, a Parameter server (Parameter server) node is further included, which is responsible for maintaining the globally shared parameters.

Some of the above preferred embodiments provide a system for resetting a data transmission network in a distributed training task training process, wherein the distributed training task is deployed on a container cloud platform, and a computing node of the system is not a physical host in a general sense, but a virtualized computer resource such as a container/container group.

The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto.

Claims

1. A method for resetting a data transmission network during training of a distributed training task, comprising:

before the distributed training is started up,

or after distributed training is initiated and before each subtask application performs training,

respectively acquiring RDMA network IP of each computing node of the cluster;

determining one of the computing nodes as a master node and the other computing nodes as slave nodes;

the slave nodes IP-transmit the RDMA network of the respective computing node to the master node via the conventional network according to the network connection parameters in the respective environment configuration parameters;

replacing default network connection parameters in the environment configuration parameters with RDMA network IP in the training cluster RDMA network information.

2. The method of resetting a data transmission network during training of a distributed training task of claim 1,

and determining the master node and the slave node according to the task ID in the environment configuration parameters or by means of the ZooKeeper.

3. The method of resetting a data transmission network during training of a distributed training task of claim 1,

generating a corresponding new environment configuration parameter for each subtask by the master node according to the RDMA information of the cluster, and transmitting the new environment configuration parameter to a corresponding slave node through a conventional network to replace the original environment configuration parameter of each computing node;

or the like, or a combination thereof,

and transmitting the RDMA information of the cluster to each slave node by the master node through a conventional network, and generating new environment configuration parameters for respective tasks by each computing node to replace the original environment configuration parameters of each computing node.

4. The method of resetting a data transmission network during training of a distributed training task of claim 1,

the computing nodes also comprise a parameter server node which is used for maintaining the parameters of global sharing.

5. The method of resetting a data transmission network during training of a distributed training task of claim 1,

when the distributed training task is deployed on the container cloud platform, the computing nodes are containers/container groups.

6. A system for resetting a data transmission network during training of a distributed training task, comprising:

the RDMA network information aggregation unit is used for respectively acquiring the RDMA network IP of each computing node of the training cluster and determining a main node to aggregate the RDMA network information of the training cluster;

7. The system for resetting a data transmission network during training of a distributed training task of claim 6,

the method comprises a distributed training process control unit:

after the distributed training task is scheduled to a training cluster with an RDMA network and the distributed training is started, the distributed training process of each computing node of the cluster is aborted,

and the RDMA network information collecting unit collects the RDMA network information of the training cluster and the task environment configuration parameter updating unit updates the environment configuration parameters of each subtask according to the RDMA network information of the training cluster and then recovers the distributed training process.

8. The system for resetting a data transmission network during distributed training task training according to claim 6,

when the task environment configuration parameter updating units update the subtask environment configuration parameters according to the RDMA network information of the training cluster:

or,

and transmitting the cluster RDMA information to each slave node by the master node through a conventional network, and generating new environment configuration parameters for respective tasks by each computing node to replace the original environment configuration parameters of each computing node.

9. The system for resetting a data transmission network during distributed training task training according to claim 6,

the computing nodes also comprise parameter server nodes which are used for maintaining parameters of global sharing.

10. The system for resetting a data transmission network during distributed training task training according to claim 6,