[go: up one dir, main page]

CN112350842B - Method for resetting data transmission network in distributed training task training process - Google Patents

Method for resetting data transmission network in distributed training task training process Download PDF

Info

Publication number
CN112350842B
CN112350842B CN201910731784.8A CN201910731784A CN112350842B CN 112350842 B CN112350842 B CN 112350842B CN 201910731784 A CN201910731784 A CN 201910731784A CN 112350842 B CN112350842 B CN 112350842B
Authority
CN
China
Prior art keywords
training
network
rdma
cluster
environment configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910731784.8A
Other languages
Chinese (zh)
Other versions
CN112350842A (en
Inventor
张翔宇
郭昊
张曼妮
孙军欢
赵来松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhixing Technology Co Ltd
Original Assignee
Shenzhen Zhixing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhixing Technology Co Ltd filed Critical Shenzhen Zhixing Technology Co Ltd
Priority to CN201910731784.8A priority Critical patent/CN112350842B/en
Publication of CN112350842A publication Critical patent/CN112350842A/en
Application granted granted Critical
Publication of CN112350842B publication Critical patent/CN112350842B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0813Configuration setting characterised by the conditions triggering a change of settings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Small-Scale Networks (AREA)

Abstract

The invention provides a method for resetting a data transmission network in a distributed training task training process, which is characterized in that after a distributed training task is scheduled to a training cluster with an RDMA network, by respectively acquiring the RDMA network IP of each subtask computing node and determining a main node (the rest are slave nodes) from each subtask computing node, the main node is used for collecting the RDMA network information of the training cluster; after the completion of the aggregation, updating the environment configuration parameters of each subtask according to the cluster RDMA network information so as to realize communication according to the updated environment configuration parameters in the distributed training process and achieve the purpose of resetting the data transmission network into the RDMA network.

Description

Method for resetting data transmission network in training process of distributed training task
Technical Field
The invention relates to the field of distributed machine learning and the technical field of container cloud; and more particularly, to a method of resetting a data transmission network during a distributed training task training process.
Background
Machine learning, particularly deep learning, has enjoyed wide success in artificial intelligence driven services. As models become more complex, their training becomes computationally more costly. If efficient and timely training is to be achieved, the advantages of parallel computing of a distributed system need to be explored. Industry leadership enterprises such as microsoft, facebook, and Google have begun attempting to run distributed machine learning training tasks on production clusters of hundreds or thousands of servers.
However, a practical physical cluster for distributed training, from building deployment to operation and maintenance, is a very professional and complex or even cumbersome task. The container cloud technology is applied to the field of distributed machine learning, and the difficulty of constructing, deploying, operating and maintaining work can be simplified undoubtedly.
The container cloud technology can not only realize the rapid deployment of the container cluster, but also be a lightweight solution, and can effectively integrate and manage bare computer resources. Taking the Kubernetes platform to run a distributed machine learning training task as an example, kubernetes not only provides a consistent method for packaging applications, ensures the running consistency of the applications on different devices, provides resource isolation for the running environment of the applications, abstracts the complexity and node management of a hardware bottom layer, and supports the scheduling of a GPU.
However, whether the physical cluster is built by a plurality of host servers for training or the training cluster is deployed on a container cloud platform, data transmission between the computing nodes is usually realized by network communication based on a TCP/IP protocol (also a network protocol commonly used by a wide area network and a local area network at present). The network communication process needs the intervention of an operating system and a protocol stack, but as the training set is larger and larger, a large amount of CPU resources are inevitably occupied in the parameter exchange (parameter exchange) process, so that a larger network delay is caused, and the training efficiency is seriously restricted.
A Remote Direct Memory Access technology, i.e., an RDMA (Remote Direct Memory Access) technology, is a Direct Memory Access technology; it transfers data directly from the memory of one computer to another without the intervention of operating systems of both parties. Therefore, compared with the conventional network based on the general TCP/IP protocol, the RDMA network communication can avoid a large amount of CPU resource occupation in the network transmission process, and simultaneously reduces the network delay. Then, building/deploying a training cluster with an RDMA network for a distributed training task, and providing RDMA network communication for training data (e.g., data communication in a parameter exchange process) in a training process is obviously an effective way to break through a communication bottleneck of a parameter interaction network and improve distributed training efficiency.
In the distributed training process, the dependency relationship among the subtasks allocated to each computing node and the data consistency among the control subtasks are generally guaranteed by environment configuration parameters. Specifically, in general, the environment configuration parameters corresponding to each subtask will include all subtasks and some information of the current subtask (e.g., subtask number, network connection parameters, etc.). In the actual deployment and training process, besides scheduling the distributed tasks to the training cluster (i.e., distributing each subtask to each computing node of the training cluster) by using the environment configuration parameters, the method also includes implementing data communication between the training applications running on different computing nodes through network connection parameters in the environment configuration parameters in the training process.
Therefore, in practice, taking the example of deploying a distributed training task in a physical cluster with an RDMA network as an example, to implement efficient distributed training in an RDMA network environment, generally, RDMA network IPs of computing nodes of a training cluster are first obtained, environment configuration parameters including the RDMA network IPs (as network connection parameters) are generated manually/by using a script, and then efficient distributed training after the task is scheduled to the training cluster is implemented.
However, deploying distributed training on a container cloud platform is often considered to be more efficient in utilizing platform resources. To better utilize resources, when a container cloud platform deploys a training task, it is usually: the method comprises the steps of firstly decomposing a training task into a plurality of subtasks, generating environment configuration parameters for the subtasks, and then creating corresponding container/container groups for the subtasks (the container/container group refers to the minimum unit of a container cluster during arrangement management, wherein a container refers to a container running an independent application in a container environment, and the container group refers to a logic host running the independent application in the container environment and runs one or more tightly coupled application containers, such as Pod of a Kubernetes platform). After distributed training is started, communication between subtask training applications running on each computing node is achieved through a connection access service via a conventional network (i.e., a TCP/IP-based network, which is generally used as a default network of a multi-network cluster). This communication mechanism requires just the intervention of the system kernel. The key to achieving efficient communication in RDMA networks is not to rely on system kernel intervention.
In summary, since RDMA network information (e.g., RDMA network IP) cannot be obtained in advance, even if the current container training cluster has an RDMA network, the training applications running on the computing nodes (i.e., the containers/container groups used for training) cannot discover and efficiently use the RDMA network.
In addition, although some of the aforementioned methods can also implement the deployment of training tasks in the RDMA network physical cluster, manual configuration (RDMA network IP is an environment configuration parameter of a network connection parameter) is difficult to avoid errors; even if the configuration is only slightly wrong, the whole training cluster cannot provide an effective RDMA network for data transmission in the training task, and the whole deployment fails.
Disclosure of Invention
In view of this, the present invention provides a method for resetting a data transmission network in a distributed training task training process, after a distributed training task is scheduled to a training cluster with an RDMA network, RDMA network IPs of subtask computing nodes are respectively obtained, a master node (the rest are slave nodes) is determined from the subtask computing nodes to collect RDMA network information of the training cluster, and then an environment configuration parameter of each subtask is updated according to the collected RDMA network information, so as to achieve a purpose of resetting the data transmission network to the RDMA network by using the updated environment configuration parameter in the distributed training process.
In one aspect, an embodiment of the present invention provides a method for resetting a data transmission network during a training process of a distributed training task.
The method for resetting the data transmission network in the training process of the distributed training task comprises the following steps:
when a distributed training task is scheduled to a training cluster with an RDMA network,
before the distributed training is started up,
or after the distributed training is started, namely after each computing node of the training cluster respectively starts the training application program of the corresponding subtask, and before each subtask application program executes the training,
respectively acquiring RDMA network IP of each computing node of the cluster;
and determining the role of each computing node in the network resetting process:
namely, determining that one of the computing nodes is a master node and the other computing nodes are slave nodes;
the slave nodes IP-transmitting the RDMA network of the respective computing node to the master node via a conventional network (i.e., a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;
the primary node then aggregates the RDMA network information of the training cluster:
recording a main node and an RDMA network IP thereof, and each slave node and the RDMA network IP thereof;
after all RDMA network information of the training cluster is collected, updating the environment configuration parameters of each subtask according to the RDMA network information:
replacing default network connection parameters in the environment configuration parameters by RDMA network IP in the training cluster RDMA network information;
thus, the data transmission network of the distributed training task is reset;
after the training application program of each subtask starts to execute training, data transmission, especially transmission of a large amount of training data, of the subtask can use the RDMA network of the training cluster according to the updated environment configuration parameters, and efficient data transmission is achieved.
In another aspect, an embodiment of the present invention provides a system for resetting a data transmission network during a training process of a distributed training task.
The system for resetting the data transmission network in the training process of the distributed training task comprises the following steps:
RDMA network information collection unit and task environment configuration parameter updating unit; wherein,
the RDMA network information gathering unit is used for respectively acquiring the RDMA network IP of each computing node of the training cluster and determining a main node to gather the RDMA network information of the training cluster;
and the task environment configuration parameter updating units update the environment configuration parameters of each subtask according to the default network connection parameters in the RDMA network IP replacement environment configuration parameters of the training cluster RDMA network information collected by the RDMA network information collecting unit.
In the method and system for resetting the data transmission network in the training process of the distributed training task in the embodiment, after the distributed training task is scheduled to the training cluster with the RDMA network, the RDMA network IP of each computing node of the training cluster is acquired and the RDMA network information of the training cluster is collected by the fixed main node, and the environment configuration parameters of each subtask are updated according to the RDMA network information, so that the distributed training process is communicated according to the updated environment configuration parameters, and the purpose of resetting the data transmission network to the RDMA network is achieved. The method and the system solve the problem of communication bottleneck in the training process, and can avoid modifying and complicating the task scheduling mechanism.
Drawings
To more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings related to a part of the embodiments of the present invention or the description in the prior art will be briefly introduced below.
Fig. 1 is a schematic flow diagram of updating TF _ CONFIG of each subtask (i.e., corresponding Pod) in a distributed tensoflow task deployment process of a kubernets platform based on a method for resetting a data transmission network in a distributed training task training process provided in a preferred embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the described embodiments are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of protection of the present invention.
Some preferred embodiments of the invention are as follows. Some of these preferred embodiments provide a method of resetting a data transmission network during training of a distributed training task. The method comprises the following steps:
when a distributed training task is scheduled to a training cluster with an RDMA network,
before distributed training is started, starting a process at each computing node, and respectively acquiring RDMA network IP of each computing node;
determining a master node in the network resetting process from each computing node according to a task ID (preset) in the environment configuration parameters or by means of ZooKeeper, wherein the other nodes are slave nodes;
the slave nodes IP-transmit the RDMA network of the respective computing node to the master node via a conventional network (i.e., a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;
the primary node then aggregates the RDMA network information of the training cluster:
recording a main node and an RDMA network IP thereof, and each slave node and the RDMA network IP thereof;
after all RDMA network information of the training cluster is collected, updating the environment configuration parameters of each subtask according to the RDMA network information:
replacing default network connection parameters in the environment configuration parameters by RDMA network IP in the training cluster RDMA network information;
thus, the data transmission network of the distributed training task is reset; after the reset is complete, the start of distributed training may be notified.
Still other of the preferred embodiments provide another method of resetting a data transmission network during training of a distributed training task. The method can still reset the transmission network after the distributed training is started, and specifically comprises the following steps:
when a distributed training task is scheduled to a training cluster with an RDMA network,
after distributed training is started, namely after each computing node of the training cluster respectively starts a training application program of a corresponding subtask, and before each subtask application program executes training,
respectively acquiring RDMA network IP of each computing node by each training application program;
determining a main node in the network resetting process from each computing node according to the task ID (preset) in the environment configuration parameters of the corresponding subtasks by the training application programs, wherein the other nodes are slave nodes;
or determining a master node in the network resetting process from each computing node by means of ZooKeeper, and the other nodes are slave nodes;
after confirming the master node and the slave node, the training application program of each slave node transmits RDMA network IP of each computing node to the master node through a conventional network (namely a TCP/IP-based network) according to the network connection parameters in the respective environment configuration parameters;
the training application program of the master node creates a list to record the RDMAIP of the master node, the slave nodes and the RDMA network IPs thereof (namely the RDMA network information of the training cluster);
after the main node collects all RDMA network information of the training cluster, updating the environment configuration parameters of each subtask according to the information:
replacing default network connection parameters in the environment configuration parameters by RDMA network IP in the training cluster RDMA network information;
subsequently, each node training application can start to execute training, and data transmission in the training process can use the RDMA network by using the updated environment configuration parameters, namely, reset of the data transmission network is realized.
In some of the methods for resetting a data transmission network in a distributed training task training process provided in the foregoing preferred embodiments, a specific process of updating an environment configuration parameter of each subtask according to cluster RDMA network information collected by a master node may be that the master node generates a corresponding new environment configuration parameter for each subtask according to the cluster RDMA network information, and transmits the new environment configuration parameter to a corresponding slave node via a conventional network, so as to replace an original environment configuration parameter of each computing node;
or the main node transmits the cluster RDMA information to each slave node through a conventional network, and each computing node generates new environment configuration parameters for each task to replace the original environment configuration parameters of each computing node.
In some of the methods for resetting a data transmission network during training of a distributed training task provided in the foregoing preferred embodiments, in the computing nodes of the training cluster, in addition to a Worker node (Worker) for data parallel computation during the training process, a Parameter server node (Parameter server) is further included, which is responsible for maintaining parameters of global sharing.
Some of the above preferred embodiments provide a method for resetting a data transmission network in a distributed training task training process, wherein a distributed training task is deployed on a container cloud platform, and a computing node of the distributed training task is not a physical host in a general sense, but a virtualized computer resource such as a container/container group.
Fig. 1 is a schematic flow diagram of updating the TF _ CONFIG of each subtask (i.e., corresponding Pod) in the process of deploying the distributed tensolflow task on the kubernets platform based on the method for resetting the data transmission network in the distributed training task training process provided in an preferred embodiment of the present invention. As shown in figure 1 of the drawings, in which,
after the distributed tenserflow task is scheduled to a container cluster on the kubernets platform that is configured with a regular network and an RDMA network,
starting a distributed TensorFlow task, starting a distributed TensorFlow training process in each Pod of the container cluster, and executing a subtask scheduled to the Pod;
before each training process begins to execute subtask training
Respectively acquiring (virtual) RDMA network card IP allocated to each Pod by the training process running on each Pod;
each Pod analyzes a respective environment variable TF _ CONFIG (which is an implementation manner of the environment configuration parameter when the example is deployed), and obtains a task name and a task ID of a respective subtask;
taking the Pod with the task of 0 as a master node and the other pods as slave nodes;
a training process running on the main node waits for all other computing nodes to establish TCP connection with the main node and creates an array for recording task IDs of each Pod of the cluster and RDMA network card IPs;
the training process running on the slave node acquires the conventional network IP of the master node and establishes TCP connection with the master node according to the IP; the previously acquired local RDMA network card IP and the corresponding task ID are sent to the main node through TCP connection;
in the main node, in the running training process, in addition to recording the task ID and the IP of the RDMA network card at the local end, the task ID and the IP of the RDMA network card sent by the slave node are inserted into the array for storage every time the task ID and the IP of the RDMA network card are received;
after receiving all task IDs and RDMA network card IPs of the cluster, the main node replaces a default network connection parameter-a connection access service name in the TF _ CONFIG with the RDMA network card IP, regenerates the TF _ CONFIG to replace the original TF _ CONFIG:
updating the master node, namely directly updating after generating TF _ CONFIG with task ID of 0 through the master node training process; if the slave node is updated, the master node training process generates a corresponding TF _ CONFIG according to the task ID, and transmits the TF _ CONFIG to the corresponding node to be updated by the master node training process;
after the updating of the slave node is completed, the slave node notifies the master node and closes the TCP connection;
thus, each Pod can start to execute tensorflow training and enter the stage of constructing graphs and data parallel computation.
Still other preferred embodiments of the present invention provide a system for resetting a data transmission network during training of a distributed training task. The system comprises an RDMA network information gathering unit and task environment configuration parameter updating units; wherein,
an RDMA network information gathering unit which runs on each computing node of the training cluster,
before the distributed training is started up,
or after the distributed training is started, namely after each computing node of the training cluster respectively starts the training application program of the corresponding subtask, and before each subtask application program executes the training,
respectively acquiring RDMA network IP of each computing node of a training cluster;
determining a main node from each computing node according to a task ID (preset) in an environment configuration parameter or by means of a ZooKeeper to collect the RDMA network information of the training cluster; the other computing nodes are slave nodes, and the RDMA network IP of each computing node is transmitted to the master node through the conventional network according to the network connection parameters in the environment configuration parameters);
and the task environment configuration parameter updating units update the environment configuration parameters of each computing node of the cluster, namely the environment configuration parameters of each subtask according to the default network connection parameters in the RDMA network IP replacement environment configuration parameters of the RDMA network information of the training cluster collected by the RDMA network information collecting unit.
Some of the preferred embodiments described above provide systems for resetting a data transport network during training of a distributed training task, wherein the data transport network can be reset during the training process. The system also includes a distributed training process control unit: after the distributed training task is scheduled to a training cluster with an RDMA network and the distributed training is started, the distributed training process of each computing node of the cluster is stopped, the RDMA network information collecting unit collects the RDMA network information of the training cluster and the task environment configuration parameter updating units update the environment configuration parameters of each subtask according to the RDMA network information of the training cluster, namely, the distributed training process is recovered after the data transmission network is reset.
In some of the foregoing preferred embodiments, in the system for resetting a data transmission network in a distributed training task training process, the specific process in which the task environment configuration parameter updating units update the configuration parameters of the respective subtasks according to the training cluster RDMA network information may be that the master node generates a corresponding new environment configuration parameter for each subtask according to the cluster RDMA network information, and transmits the new environment configuration parameter to the corresponding slave node via a conventional network, so as to replace the original environment configuration parameters of the respective computing node;
or the main node transmits the cluster RDMA information to each slave node through a conventional network, and each computing node generates new environment configuration parameters for each task to replace the original environment configuration parameters of each computing node.
In some of the systems for resetting a data transmission network during training of a distributed training task provided in the foregoing preferred embodiments, in the computing nodes of the training cluster, in addition to the work node (Worker) for data parallel computation during the training process, a Parameter server (Parameter server) node is further included, which is responsible for maintaining the globally shared parameters.
Some of the above preferred embodiments provide a system for resetting a data transmission network in a distributed training task training process, wherein the distributed training task is deployed on a container cloud platform, and a computing node of the system is not a physical host in a general sense, but a virtualized computer resource such as a container/container group.
The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto.

Claims (10)

1. A method for resetting a data transmission network during training of a distributed training task, comprising:
when a distributed training task is scheduled to a training cluster with an RDMA network,
before the distributed training is started up,
or after distributed training is initiated and before each subtask application performs training,
respectively acquiring RDMA network IP of each computing node of the cluster;
and determining the role of each computing node in the network resetting process:
determining one of the computing nodes as a master node and the other computing nodes as slave nodes;
the slave nodes IP-transmit the RDMA network of the respective computing node to the master node via the conventional network according to the network connection parameters in the respective environment configuration parameters;
the primary node then aggregates the RDMA network information of the training cluster:
recording a main node and an RDMA network IP thereof, and each slave node and the RDMA network IP thereof;
after all RDMA network information of the training cluster is collected, updating the environment configuration parameters of each subtask according to the RDMA network information:
replacing default network connection parameters in the environment configuration parameters with RDMA network IP in the training cluster RDMA network information.
2. The method of resetting a data transmission network during training of a distributed training task of claim 1,
and determining the master node and the slave node according to the task ID in the environment configuration parameters or by means of the ZooKeeper.
3. The method of resetting a data transmission network during training of a distributed training task of claim 1,
generating a corresponding new environment configuration parameter for each subtask by the master node according to the RDMA information of the cluster, and transmitting the new environment configuration parameter to a corresponding slave node through a conventional network to replace the original environment configuration parameter of each computing node;
or the like, or a combination thereof,
and transmitting the RDMA information of the cluster to each slave node by the master node through a conventional network, and generating new environment configuration parameters for respective tasks by each computing node to replace the original environment configuration parameters of each computing node.
4. The method of resetting a data transmission network during training of a distributed training task of claim 1,
the computing nodes also comprise a parameter server node which is used for maintaining the parameters of global sharing.
5. The method of resetting a data transmission network during training of a distributed training task of claim 1,
when the distributed training task is deployed on the container cloud platform, the computing nodes are containers/container groups.
6. A system for resetting a data transmission network during training of a distributed training task, comprising:
RDMA network information collection unit and task environment configuration parameter updating unit; wherein,
the RDMA network information aggregation unit is used for respectively acquiring the RDMA network IP of each computing node of the training cluster and determining a main node to aggregate the RDMA network information of the training cluster;
and the task environment configuration parameter updating units update the environment configuration parameters of each subtask according to the default network connection parameters in the RDMA network IP replacement environment configuration parameters of the training cluster RDMA network information collected by the RDMA network information collecting unit.
7. The system for resetting a data transmission network during training of a distributed training task of claim 6,
the method comprises a distributed training process control unit:
after the distributed training task is scheduled to a training cluster with an RDMA network and the distributed training is started, the distributed training process of each computing node of the cluster is aborted,
and the RDMA network information collecting unit collects the RDMA network information of the training cluster and the task environment configuration parameter updating unit updates the environment configuration parameters of each subtask according to the RDMA network information of the training cluster and then recovers the distributed training process.
8. The system for resetting a data transmission network during distributed training task training according to claim 6,
when the task environment configuration parameter updating units update the subtask environment configuration parameters according to the RDMA network information of the training cluster:
generating a corresponding new environment configuration parameter for each subtask by the master node according to the RDMA information of the cluster, and transmitting the new environment configuration parameter to a corresponding slave node through a conventional network to replace the original environment configuration parameter of each computing node;
or,
and transmitting the cluster RDMA information to each slave node by the master node through a conventional network, and generating new environment configuration parameters for respective tasks by each computing node to replace the original environment configuration parameters of each computing node.
9. The system for resetting a data transmission network during distributed training task training according to claim 6,
the computing nodes also comprise parameter server nodes which are used for maintaining parameters of global sharing.
10. The system for resetting a data transmission network during distributed training task training according to claim 6,
when the distributed training task is deployed on the container cloud platform, the computing nodes are containers/container groups.
CN201910731784.8A 2019-08-08 2019-08-08 Method for resetting data transmission network in distributed training task training process Active CN112350842B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910731784.8A CN112350842B (en) 2019-08-08 2019-08-08 Method for resetting data transmission network in distributed training task training process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910731784.8A CN112350842B (en) 2019-08-08 2019-08-08 Method for resetting data transmission network in distributed training task training process

Publications (2)

Publication Number Publication Date
CN112350842A CN112350842A (en) 2021-02-09
CN112350842B true CN112350842B (en) 2023-04-07

Family

ID=74366879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910731784.8A Active CN112350842B (en) 2019-08-08 2019-08-08 Method for resetting data transmission network in distributed training task training process

Country Status (1)

Country Link
CN (1) CN112350842B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114022B (en) * 2022-06-24 2024-10-15 苏州浪潮智能科技有限公司 Method, system, device and medium for using GPU resources

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103763173A (en) * 2013-12-31 2014-04-30 华为技术有限公司 Data transmission method and computing node
CN109067752A (en) * 2018-08-15 2018-12-21 无锡江南计算技术研究所 A method of compatible ICP/IP protocol is realized using RDMA Message
CN109634735A (en) * 2018-12-18 2019-04-16 郑州云海信息技术有限公司 A kind of method and device for dispatching Pod

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378179B2 (en) * 2012-11-21 2016-06-28 International Business Machines Corporation RDMA-optimized high-performance distributed cache
US10257273B2 (en) * 2015-07-31 2019-04-09 Netapp, Inc. Systems, methods and devices for RDMA read/write operations
US20170034267A1 (en) * 2015-07-31 2017-02-02 Netapp, Inc. Methods for transferring data in a storage cluster and devices thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103763173A (en) * 2013-12-31 2014-04-30 华为技术有限公司 Data transmission method and computing node
CN109067752A (en) * 2018-08-15 2018-12-21 无锡江南计算技术研究所 A method of compatible ICP/IP protocol is realized using RDMA Message
CN109634735A (en) * 2018-12-18 2019-04-16 郑州云海信息技术有限公司 A kind of method and device for dispatching Pod

Also Published As

Publication number Publication date
CN112350842A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
US11588675B2 (en) Systems and methods for selectively implementing services on virtual machines and containers
CN108924217B (en) A method for automatic deployment of distributed cloud system
CN109976774B (en) Block link point deployment method, device, equipment and storage medium
US10496503B2 (en) Healing cloud services during upgrades
CN106708622B (en) Cluster resource processing method and system and resource processing cluster
EP2849064B1 (en) Method and apparatus for network virtualization
CN107463582B (en) Distributed Hadoop cluster deployment method and device
WO2022141727A1 (en) Resource deployment system and method based on cloud cost
US20170289060A1 (en) Model driven process for automated deployment of domain 2.0 virtualized services and applications on cloud infrastructure
US20170373931A1 (en) Method for updating network service descriptor nsd and apparatus
US11093296B2 (en) System, virtualization control apparatus, method for controlling a virtualization control apparatus, and program
CN107145380A (en) Virtual resource method of combination and device
CN110297670B (en) Method and system for improving training efficiency of distributed tasks on container cloud
KR102438214B1 (en) ICT service provision method and system
CN113660316B (en) Network resource adaptive configuration method, system and medium based on container cloud platform
CN114615268B (en) Service network, monitoring node, container node and equipment based on Kubernetes cluster
CN110308987B (en) Method for updating connection parameters of distributed training tasks on container cloud
CN110308986A (en) The method of distributed training data communication on container cloud based on Optimized Operation
Csoma et al. Management and orchestration for network function virtualization: An open source MANO approach
CN112698838A (en) Multi-cloud container deployment system and container deployment method thereof
CN112350842B (en) Method for resetting data transmission network in distributed training task training process
CN113138831B (en) Network resetting method and acceleration distributed training method and system based on same
CN112348196A (en) Distributed machine learning system and method of self-adaptive RDMA (remote direct memory Access) network
CN110300192A (en) A method of distributed training mission Connecting quantity is updated according to IP allocation table
EP4213468A1 (en) Automated deployment of control nodes at remote locations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant