CN120560811A

CN120560811A - NUMA scheduling method, device, equipment and medium in large model training scenarios

Info

Publication number: CN120560811A
Application number: CN202510706034.0A
Authority: CN
Inventors: 韩坤; 康泽民; 薛娇; 陈尧
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2025-05-29
Filing date: 2025-05-29
Publication date: 2025-08-29

Abstract

The application discloses a NUMA scheduling method, device, equipment and medium in a large model training scene, which relate to the technical field of artificial intelligence and comprise the steps of acquiring a target cluster topological relation configuration file, acquiring a target affinity strategy corresponding to the requirements of a graphic processor of a current large model training task, if the target affinity strategy is a first affinity strategy, screening candidate processor nodes comprising candidate NUMA nodes from all processor nodes based on the topological relation configuration file, wherein the candidate NUMA nodes are NUMA nodes with the idle number of the graphic processors meeting the requirements of the graphic processor under a single NUMA node, determining the target NUMA nodes from the candidate NUMA nodes according to the performance communication scores of the candidate NUMA nodes, and scheduling the graphic processors under the target NUMA nodes to finish the current large model training task when a training container is started. And the efficiency of large model training is improved, and the cost is reduced.

Description

NUMA scheduling method, device, equipment and medium in large model training scene

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a NUMA scheduling method, device, equipment and medium in a large model training scene.

Background

The current large-scale deep learning model training has entered into a billion parameter era, and the integrated communication efficiency becomes a key bottleneck for limiting the training performance. The gradient synchronization algorithm represented by AllReduce requires the completion of a full-scale parameter aggregation across the graphics processor (Graphics Processing Unit, i.e., GPU), with a communication overhead that occupies 30% -50% of the iteration time in a typical training task. In heterogeneous computing clusters based on PCIe (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, a high-speed serial computer expansion bus standard) 4.0 architecture, there is significant performance degradation in inter-GPU communications across NUMA (Non-Uniform Memory Access, i.e., non-coherent memory access) nodes, resulting in reduced effective bandwidth utilization, non-linear increases in transmission delay, and increased interconnect resource contention due to the communication path that needs to traverse PCIe switch (PCIE SWITCH) and transit through central processor (Central Processing Unit, i.e., CPU).

On the other hand, in large-model distributed training, kubernetes cluster distributed training orchestration platform has become a de facto standard, but the default scheduler of Kubernetes adopts a coarse-grained resource allocation policy based on the number of GPUs, and cannot perceive physical constraints of hardware topology. In a typical scenario, for example, in a dual-path CPU server architecture, 8 GPU devices are divided into two physical nodes, namely NUMA0 and NUMA1, but the existing Device plug in (Device Plugin, such as NVIDIA DEVICE plug in) only reports the discrete GPU resource quantity, lacks key topology information such as NUMA affinity and PCIe topology level, and the like, and the topology-agnostic scheduling mechanism causes that training task training efficiency cannot maintain an efficient state.

The deficiency of the existing scheduling mechanism directly causes bandwidth attenuation, the bandwidth utilization rate is lower than that of NUMA nodes due to the overheads such as complex paths, protocol conversion and the like in the NUMA-crossing communication, delay accumulation is realized, a Ring-AllReduce algorithm is taken as an example, if a task GPU is distributed across NUMA, the hop count of a communication link is increased, and single iteration delay is improved.

Overall, how to improve the affinity of NUMA scheduling so that the efficiency of large model training is improved and the cost is reduced is a problem to be solved in the field.

Disclosure of Invention

In view of the above, the present invention aims to provide a NUMA scheduling method, device, equipment and medium in a large model training scenario, which improves the affinity of NUMA scheduling to improve the efficiency of large model training and reduce the cost. The specific scheme is as follows:

In a first aspect, the application discloses a NUMA scheduling method in a large model training scene, comprising the following steps:

collecting membership relations between different NUMA nodes and each graphic processor under each processor node in the target cluster to generate a topological relation configuration file;

acquiring a target affinity strategy corresponding to the requirements of a graphic processor of a current large model training task;

If the target affinity policy is a first affinity policy, screening candidate processor nodes comprising candidate NUMA nodes meeting the first affinity policy from the processor nodes based on the topological relation configuration file, wherein the candidate NUMA nodes meeting the first affinity policy are NUMA nodes with the idle number of the graphic processors under a single NUMA node not smaller than the number of the graphic processors corresponding to the requirements of the graphic processors;

determining a target NUMA node from the candidate NUMA nodes according to the performance communication scores of the candidate NUMA nodes;

A training container is bound to the target NUMA node so that each graphics processor under the target NUMA node is scheduled to complete the current large model training task when the training container is started.

Optionally, collecting membership between different NUMA nodes under each processor node in the target cluster and each graphics processor to generate a topology configuration file includes:

The Device plug application is deployed to the graphics processor of the target cluster in DaemonSet mode, and membership relations between different NUMA nodes under each processor node and each graphics processor are obtained through NVML library;

And encoding each graphic processor number and the corresponding NUMA node number into key value pairs according to the membership to generate a topological relation configuration file.

Optionally, the determining a target NUMA node from the candidate NUMA nodes according to the performance traffic score of the candidate NUMA nodes includes:

Scoring each candidate NUMA node based on the continuity condition of the graphics processor numbers corresponding to the candidate NUMA nodes so as to obtain the communication score of each candidate NUMA node;

Scoring each candidate NUMA node based on the load balancing condition of each candidate NUMA node to obtain the performance score of each candidate NUMA node;

And determining the sum of the communication score and the performance score as a performance communication score of each candidate NUMA node, and determining a target NUMA node from each candidate NUMA node according to the performance communication score.

Optionally, the NUMA scheduling method in the large model training scenario further includes:

And if the target affinity policy is a second affinity policy, determining the total number of idle graphics processors under all NUMA nodes in each processor node based on the topological relation configuration file, and screening candidate processor nodes of which the total number of idle graphics processors meets the requirements of the graphics processors from the processor nodes.

Arranging the NUMA nodes in the current candidate processor nodes in descending order according to the idle quantity of the graphic processor under a single NUMA node to obtain a NUMA node sequence, and determining the current NUMA node from the NUMA node sequence;

judging whether the total number of idle graphics processors from the first NUMA node in the NUMA node sequence to the current NUMA node meets the requirements of the graphics processors or not;

If the total number of the idle states does not meet the requirements of the graphic processor, determining a new current NUMA node from the NUMA node sequence, and re-jumping to the step of judging whether the total number of the idle states from the first NUMA node in the NUMA node sequence to all the graphic processors in the current NUMA node meets the requirements of the graphic processor;

If the total number of idles meets the graphics processor requirement, determining the first NUMA node in the sequence of NUMA nodes to the current NUMA node as candidate NUMA nodes that meet the second affinity policy.

monitoring a scheduling request of a training container based on LISTWATCH mechanisms through a target processor node, and binding the training container under the target NUMA node to limit the training container to only access a graphics processor bound under the target NUMA node, wherein the target processor node is a processor node corresponding to the target NUMA node.

Optionally, in the process of scheduling each graphics processor under the target NUMA node to complete the current large model training task, the method further includes:

Issuing a numbering list of a graphic processor bound by a target NUMA node to a container runtime through the target processor node;

And mounting the equipment file of the graphic processor bound by the target NUMA node to the training container through the container running process, and setting the visibility environment variable of the graphic processor according to the number list.

In a second aspect, the present application discloses a NUMA scheduling device in a large model training scenario, including:

The configuration generation module is used for collecting membership relations between different NUMA nodes and each graphic processor under each processor node in the target cluster so as to generate a topological relation configuration file;

the strategy acquisition module is used for acquiring a target affinity strategy corresponding to the requirement of a graphic processor of the current large model training task;

A first node determining module, configured to screen candidate processor nodes including candidate NUMA nodes satisfying the first affinity policy from the processor nodes based on the topology relation configuration file if the target affinity policy is the first affinity policy, where the candidate NUMA nodes satisfying the first affinity policy are NUMA nodes with the number of idle graphics processors being not less than the number of graphics processors corresponding to the graphics processor requirement in a single NUMA node;

a second node determining module configured to determine a target NUMA node from among the candidate NUMA nodes according to the performance communication score of each of the candidate NUMA nodes;

and the scheduling module is used for binding a training container to the target NUMA node so as to schedule each graphic processor under the target NUMA node to complete the current large model training task when the training container is started.

In a third aspect, the present application discloses an electronic device, comprising:

A memory for storing a computer program;

A processor for executing the computer program to implement the steps of the NUMA scheduling method in the large model training scenario disclosed previously.

In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the aforementioned disclosed NUMA scheduling method in a large model training scenario.

The method has the advantages that membership relations between different NUMA nodes under each processor node in a target cluster and each graphic processor are collected to generate a topological relation configuration file, a target affinity strategy corresponding to the graphic processor requirement of a current large model training task is obtained, candidate processor nodes comprising candidate NUMA nodes meeting the first affinity strategy are screened out of the processor nodes based on the topological relation configuration file if the target affinity strategy is a first affinity strategy, the number of idle graphics processors under the condition that the candidate NUMA nodes meeting the first affinity strategy are single NUMA nodes is not smaller than the number of graphic processors corresponding to the graphic processor requirement is selected, the target NUMA nodes are determined from the candidate NUMA nodes according to the performance communication scores of the candidate NUMA nodes, and a training container is bound to the target NUMA nodes so that each graphic processor under the target NUMA nodes can complete the current large model training task when the training container is started. It can be seen that the present application collects membership between NUMA nodes and graphics processors, and then screens out suitable candidate processor nodes from the processor nodes according to a target affinity policy corresponding to the graphics processor requirement of the current large model training task, specifically, if the target affinity policy is the first affinity policy, screens out suitable candidate processor nodes from the processor nodes based on the topology relationship configuration file, that is, there are candidate processor nodes of NUMA nodes with the idle number of graphics processors satisfying the graphics processor requirement under a single NUMA node, that is, the candidate processor nodes include NUMA node with the idle number of graphics processors satisfying the graphics processor requirement under a single NUMA node, determining NUMA nodes with the idle quantity of the graphic processor under a single NUMA node meeting the requirement of the graphic processor as candidate NUMA nodes, namely, removing the graphic processor needing to cross the NUMA nodes, then, binding a training container to the target NUMA nodes under the condition that the graphic processor under the target NUMA node is not subjected to cross-NUMA nodes, and therefore, when the training container is started, scheduling each graphic processor under the target NUMA node to complete the current large model training task, and improving the affinity of NUMA scheduling, namely, because the cross-NUMA nodes are not required to be communicated in the process of completing the large model training task, each graphic processor under the target NUMA nodes is all affiliated to the target NUMA nodes, the communication efficiency is higher, and the problem that the path is complicated due to cross-NUMA communication is avoided In addition, the application determines the target NUMA node through two-stage screening, namely firstly screening according to the requirements of a graphic processor, so that each candidate NUMA node meets the requirements of the graphic processor, and secondly screening according to performance communication scores, thus, the finally determined target NUMA node not only meets the requirements of the graphic processor, but also ensures good performance communication conditions during large model training, and can further improve the efficiency of the large model training.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a NUMA scheduling method in a large model training scenario in accordance with the present application;

FIG. 2 is a diagram illustrating a specific NUMA affinity scheduling architecture according to the present disclosure;

FIG. 3 is a schematic diagram of a NUMA scheduling device in a large model training scenario according to the present application;

fig. 4 is a block diagram of an electronic device according to the present disclosure.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The current large-scale deep learning model training has entered into a billion parameter era, and the integrated communication efficiency becomes a key bottleneck for limiting the training performance. The gradient synchronization algorithm, represented by AllReduce, requires the completion of a full-scale parameter aggregation across the graphics processor, with a communication overhead that takes up 30% -50% of the iterative time in a typical training task. In a heterogeneous computing cluster based on PCIe4.0 architecture, there is significant performance degradation in inter-GPU communications across NUMA nodes, resulting in reduced effective bandwidth utilization, non-linear propagation of transmission delay, and increased contention for interconnect resources due to the communication path being required to traverse the PCIe switch and transit through the central processor.

On the other hand, in large-model distributed training, kubernetes cluster distributed training orchestration platform has become a de facto standard, but the default scheduler of Kubernetes adopts a coarse-grained resource allocation policy based on the number of GPUs, and cannot perceive physical constraints of hardware topology. In a typical scenario, for example, in a dual-path CPU server architecture, 8 GPU devices are divided into two physical nodes, namely NUMA0 and NUMA1, but the existing Device plug only reports the discrete GPU resource number, lacks key topology information such as NUMA affinity and PCIe topology level, and the like, and the topology-agnostic scheduling mechanism causes that the training task training efficiency cannot maintain an efficient state.

Therefore, the application correspondingly provides a NUMA scheduling scheme in a large model training scene, and improves the affinity of NUMA scheduling so as to improve the efficiency of large model training and reduce the cost.

Referring to fig. 1, the embodiment of the application discloses a NUMA scheduling method in a large model training scene, which comprises the following steps:

And S11, collecting membership relations between different NUMA nodes and each graphic processor under each processor node in the target cluster to generate a topological relation configuration file.

In this embodiment, the method for collecting membership between different NUMA nodes under each processor node in a target cluster and each graphic processor to generate a topology configuration file includes deploying a Device plug in application to the graphic processor of the target cluster in DaemonSet form, obtaining membership between different NUMA nodes under each processor node and each graphic processor through a NVML library, and encoding each graphic processor number and a corresponding NUMA node number as a key value pair according to the membership to generate the topology configuration file.

The processor node comprises a plurality of NUMA nodes, and different graphic processors may belong to different NUMA nodes, and before large model training, the topological relation between the GPU and the NUMA needs to be acquired, so that node annotation initialization of the processor node is completed. The customized Device plug in module is realized based on NVIDIA DEVICE plug in, the plug in is deployed on each GPU node of a target cluster (specifically, may be a Kubernetes cluster) in DaemonSet mode, and nvidia.com/GPU resources are registered with a Kubelet equipment manager, and the customized extension module is started in a NVIDIA DEVICE plug in main method in Goroutine mode. After the custom module is started, a Go language binding library (Go bindings) of NVML is called to acquire the relation between the node GPU and the NUMA node, and a node level NUMA topology configuration file is generated. After the custom extension module generates node GPU and NUMA node topology configuration, the topology relation configuration file is written into node annotation (Annotations) of each processor node through PATCH interface of API (Application Programming Interface ) Server (service) in the target cluster in an annotation mode, and the NUMA node number in the topology relation configuration file is used as a node annotation key, and a list formed by the graphics processor numbers under different NUMA nodes is used as a node annotation value, and examples are as follows:

Annotations:

nvidia.com/gpu-topology.NUMA0:"gpu0,gpu1,gpu2,gpu3";

nvidia.com/gpu-topology.NUMA1:"gpu4,gpu5,gpu6,gpu7";

After the Device plug reports NUMA topology annotation, the Device monitors GPU state change (such as offline or hot plug of the Device) in real time, and adopts an event-driven batch update mechanism to reduce API SERVER loads.

And step S12, acquiring a target affinity strategy corresponding to the requirement of a graphic processor of the current large model training task.

The GPU affinity scheduling policy, i.e. the target affinity policy, is configured, the NUMA scheduling policy (i.e. the target affinity policy) is declared in metadata. Interactions of Pod, the annotated key is NUMA. Affinity. Policy (i.e. the target affinity policy), and the values corresponding to the annotations contain "struct (i.e. the first affinity policy)", and "besteffort (i.e. the second affinity policy)". When the value is strict, the nodes are strictly bound to the same NUMA node, when all GPU nodes in the cluster cannot meet the resource allocation requirement, pod is in a Pending state until the resource allocation requirement is met, when the value is besteffort, the nodes which can meet the scheduling requirement are preferentially scheduled, and when the nodes which cannot meet the resource allocation requirement exist, the nodes which can meet the GPU number requirement exist, and the nodes are scheduled to the corresponding nodes. The resource request is configured, the number of GPUs required is specified in spec. When 4 GPUs are requested, the value of spec.

And step S13, if the target affinity policy is a first affinity policy, screening candidate processor nodes comprising candidate NUMA nodes meeting the first affinity policy from the processor nodes based on the topological relation configuration file, wherein the candidate NUMA nodes meeting the first affinity policy are NUMA nodes with the idle number of the graphics processors under a single NUMA node not smaller than the number of the graphics processors corresponding to the graphics processor requirement.

The configured target affinity policy may be specifically a first affinity policy or a second affinity policy, and in either affinity policy, candidate NUMA nodes meeting the requirements of the graphics processor need to be screened out, the number of idle graphics processors in the candidate NUMA nodes under the first affinity policy is not less than the number of NUMA nodes corresponding to the requirements of the graphics processor in the single NUMA node, and the total number of idle graphics processors in the candidate NUMA nodes under the second affinity policy is all candidate NUMA nodes meeting the requirements of the graphics processor.

In a first specific embodiment, the target affinity policy is a first affinity policy, where the first affinity policy is a struct policy, and indicates that the target affinity policy is strictly bound to the same NUMA node, when all GPU nodes in the cluster cannot meet the resource allocation requirement, pod is in a Pending state until the resource allocation requirement is met, for example, membership in the topology configuration file indicates that GPU0, GPU1, GPU2, GPU3 belongs to the NUMA0 node, GPU4, GPU5, GPU6 belongs to the NUMA1 node, and the number of idle graphics processors in the NUMA0 node is 4, the number of idle graphics processors in the NUMA1 node is 2, and the graphics processor requirement indicates that 3 idle graphics processors are required for the current large model training task, and then the NUMA0 node is a candidate NUMA node that meets the first affinity policy. As another example, the target cluster includes a processor node A and a processor node B, where the processor node A includes an A-NUMA0 node and an A-NUMA1 node, the A-NUMA0 node and the A-NUMA1 node each include 2 idle graphics processors, the processor node B includes a B-NUMA0 node and a B-NUMA1 node, the B-NUMA0 node includes 4 idle graphics processors, the B-NUMA1 node includes 2 idle graphics processors, and where the graphics processor requirement indicates that 3 idle graphics processors are needed for the current large model training task, then only the processor node B is a candidate processor node and only the B-NUMA0 node is a candidate NUMA node that satisfies the first affinity policy, so that the target NUMA node is subsequently screened from the candidate NUMA nodes and the graphics processor in the target NUMA node can complete the current large model training task without crossing the graphics processor of the NUMA node to complete the training task. It should be noted that under the first affinity policy, when all GPU nodes in the cluster cannot meet the resource allocation requirement, the Pod is in a Pending state, so in order to prevent this situation, a target affinity policy corresponding to the graphics processor requirement of the current large model training task is acquired, and if the target affinity policy is the first affinity policy, that is, the graphics processor requirement of the current large model training task can be met under the first affinity policy, so that the situation that the Pod is in the Pending state after binding is avoided.

In a second embodiment, the method further comprises determining a total number of idles of the graphics processor under all NUMA nodes in each processor node based on the topology configuration file if the target affinity policy is a second affinity policy, and screening candidate processor nodes of which the total number of idles meets the graphics processor requirement from the processor nodes.

It will be appreciated that, taking into account the graphics processor requirements of the current large model training task, determining the target affinity policy to be the second affinity policy may account for the fact that Pod is Pending if it is under the first affinity policy, so to avoid this, the target affinity policy is determined to be the second affinity policy. The second affinity policy indicates that the nodes meeting the scheduling requirement are preferentially scheduled, when the nodes meeting the resource allocation requirement do not exist, the nodes meeting the GPU number requirement exist, and the nodes are scheduled to the corresponding nodes, that is, if the target affinity policy is the second affinity policy, the total idle number of graphics processors under all NUMA nodes in each processor node is determined based on the topological relation configuration file, candidate processor nodes, of which the total idle number meets the graphics processor requirement, are selected from the processor nodes, that is, the second affinity policy is not as strict as the first affinity policy, as long as the total idle graphics processors in the candidate processor nodes meet the graphics processor requirement, and the idle graphics processors can be different NUMA nodes. Idle graphics processors meeting graphics processor requirements are screened from candidate processor nodes, then NUMA nodes to which the idle graphics processors belong are determined to be candidate NUMA nodes, for example, the candidate processor nodes comprise a B-NUMA0 node and a B-NUMA1 node, the B-NUMA0 node comprises 2 idle graphics processors, the B-NUMA1 node comprises 2 idle graphics processors, and the graphics processor requirements indicate that 3 idle graphics processors are required for the current large model training task, then 2 idle graphics processors in the B-NUMA0 node and 1 idle graphics processor in the B-NUMA1 node are required.

The method further comprises the steps of arranging NUMA nodes in the current candidate processor node in descending order according to the idle quantity of a graphic processor under a single NUMA node, obtaining a NUMA node sequence, determining whether the idle total quantity of all graphic processors from the first NUMA node to the current NUMA node in the NUMA node sequence meets the graphic processor requirement or not, determining a new current NUMA node from the NUMA node sequence if the idle total quantity does not meet the graphic processor requirement, and re-jumping to the step of determining whether the idle total quantity of all graphic processors from the first NUMA node to the current NUMA node in the NUMA node sequence meets the graphic processor requirement or not, and determining the first NUMA node to the first NUMA node in the NUMA node sequence as the candidate NUMA node meeting the second affinity policy if the idle total quantity does not meet the graphic processor requirement. That is, while the second affinity policy is less strict than the first affinity policy, so long as the total number of all free graphics processors in the candidate processor nodes meets the graphics processor requirements, the candidate NUMA nodes are also preferably selected to have more free graphics processors, i.e., the NUMA nodes are arranged in descending order according to the free number of graphics processors to obtain a sequence of NUMA nodes, e.g., the NUMA node sequence [ NUMA1, NUMA0, NUMA2], NUMA1 includes 4 free graphics processors, NUMA0 includes 3 free graphics processors, NUMA2 includes 2 free graphics processors, and the graphics processor requirements indicate that 6 free graphics processors are needed for the current large model training task, although 2 free graphics processors in NUMA2 plus 3 free graphics processors in NUMA0 and 1 free graphics processor in NUMA1 can meet the graphics processor requirements, such that the graphics processors need to span 3 NUMA nodes, if only 4 free graphics processors in NUMA1 and 2 free graphics processors are used as the less efficient graphics processors for the graphics processor, some of which may be more efficient than some of the NUMA processors.

Step S14, determining target NUMA nodes from the candidate NUMA nodes according to performance communication scores of the candidate NUMA nodes.

In this embodiment, the determining the target NUMA node from the candidate NUMA nodes according to the performance communication score of each candidate NUMA node includes scoring each candidate NUMA node based on the continuity condition of the graphics processor number corresponding to each candidate NUMA node to obtain the communication score of each candidate NUMA node, scoring each candidate NUMA node based on the load balancing condition of each candidate NUMA node to obtain the performance score of each candidate NUMA node, determining the sum of the communication score and the performance score as the performance communication score of each candidate NUMA node, and determining the target NUMA node from each candidate NUMA node according to the performance communication score.

Scoring the screened candidate NUMA nodes, including communication scoring and performance scoring. The communication score reflects the continuity condition of the graphics processor numbers corresponding to each candidate NUMA node, the more the number of the graphics processor numbers in the same candidate NUMA node is, the fewer the number of the graphics processor numbers is, the higher the communication score is, for example, 40 points are added to each 4 continuous pieces of GPU, the more the number of the continuous numbered GPUs in the same NUMA node is, the higher the score is, and when n pieces of GPU are allocated continuously, the score value is increased by a weight score related to n, the performance score reflects the load balancing condition of each candidate NUMA node, the load balancing condition is determined by combining CPU or memory utilization in the NUMA node, and the lower the CPU or memory utilization is, the lower the load is, the higher the performance score of the candidate NUMA node is. The problems of resource fragmentation and uneven load of the traditional scheduling strategy can be solved by introducing a continuity score (GPU with continuous numbers is preferentially distributed to reduce PCIe sub-link switching) and a load balancing score (the priority is dynamically adjusted by combining CPU/memory utilization in NUMA). Further, the sum of the communication score and the performance score is determined as the performance communication score of each candidate NUMA node, and a target NUMA node is determined from each candidate NUMA node according to the performance communication score, namely, the node with the highest score is selected as the target NUMA node.

Step S15, binding a training container to the target NUMA node so as to schedule each graphic processor under the target NUMA node to complete the current large model training task when the training container is started.

Information (e.g., allocated-NUMA: "0") of a target NUMA node is written into an annotation of a training container to enable binding of the training container to the target NUMA node.

The method further comprises the steps of monitoring a scheduling request of a training container through a target processor node based on LISTWATCH mechanisms, and binding the training container under the target NUMA node to limit the training container to only access a graphics processor bound under the target NUMA node, wherein the target processor node is a processor node corresponding to the target NUMA node.

The resource binding is triggered Kubelet by API SERVER updating the spec.nodename field of Pod. The processor node to which the target NUMA node belongs is the target processor node, the target processor node (i.e. the target Kubelet node) monitors the scheduling request of the training container based on the LISTWATCH mechanism, namely Kubelet monitors that the Pod is scheduled to the node based on the LISTWATCH mechanism, and then invokes the allocation interface of the custom Device plug in according to the number of the nvidia/com/GPU resources applied by the container in the Pod to Allocate GPU resources to the Pod container, and binds the training container under the target NUMA node to limit the training container to only access the graphics processor bound under the target NUMA node.

In this embodiment, the process of scheduling each graphics processor under the target NUMA node to complete the training task of the current large model further includes issuing, by the target processor node, a numbered list of the graphics processor bound by the target NUMA node to a container runtime, mounting, by the container runtime, a device file of the graphics processor bound by the target NUMA node to the training container, and setting a graphics processor visibility environment variable according to the numbered list.

After receiving the allocation request, the custom Device plug in returns a GPU list (such as GPU0, GPU1, GPU2, GPU 3) in the target NUMA node, that is, when the target processor node issues the number list of the graphics processor bound by the target NUMA node to the container running, the container running automatically sets an nvidia_visible_devices environment variable (i.e., a graphics processor visibility environment variable) according to the GPU list allocated by the Device plug in, so as to ensure that the container only identifies the target GPU.

Binding the training container to the target NUMA node so as to schedule each graphic processor under the target NUMA node to complete the current large model training task when the training container is started, wherein the specific process is as follows:

First Kubelet applies for the resources required for Pod containers. The implementation comprises the following steps:

1) And monitoring a Pod scheduling event, namely, kubelet, after monitoring that the Pod is scheduled to the node through a LISTWATCH mechanism, calling an allocation interface of the custom Device plug in according to the number of nvidia.com/GPU resources applied by the container in the Pod to Allocate GPU resources for the Pod container.

2) Device assignment and binding, namely, after receiving an allocation request, the custom Device plug returns a GPU list (such as GPU0, GPU1, GPU2, GPU 3) in the target NUMA node.

3) And when the container is called to run, starting the container, namely acquiring resources required by the container according to the container resource application of the Pod, and when the container is called to run, starting the container of the Pod statement.

Secondly, the container is started up when the container is running. The implementation comprises the following steps:

1) The container runtime (e.g., containerd) calls NVIDIA Container Runtime to mount the GPU device file (e.g.,/dev/nvidia 0-3) into the container according to the device information issued by Kubelet.

2) And returning a Device list according to an allocation interface of NVIDIA DEVICE Plugin, and automatically setting an NVIDIA_VISIBLE_DEVICES environment variable according to a GPU list distributed by the Device Plugin when the container runs, so as to ensure that the container only identifies the target GPU.

The following is a detailed description of the present application, taking a specific NUMA affinity scheduling architecture as an example, as shown in fig. 2. The overall architecture includes custom Device plug in modules, kubelet modules, kube-apiserver modules, custom scheduler plug-in modules, and container runtime modules.

And the self-defined Device plug in (i.e. self-defined Device plug-in) module is used for realizing GPU NUMA topology detection and generating hardware configuration description on the node by expanding NVIDIA DEVICE plug in, registering Device resources to the Kubelet module, updating notes of the node containing NUMA node and GPU numbering relation to the Kube-apiserver module based on the mapping relation between the physical GPU position and the NUMA node, and enabling the self-defined scheduler plug-in module to realize GPU hardware resource topology perception and resource allocation according to the notes.

Kubelet (processor nodes in the target cluster) module for executing the NUMA resource binding and isolation policy. The LISTWATCH mechanism based on the Kubernetes observes the Pod load scheduled to the node, calls the custom Device plug in interface to apply for the target GPU resource based on the decision result of the custom scheduler and the NUMA topology configuration file, and limits the CPU, the memory and the GPU of the Pod in the same NUMA node through the cgroup mechanism, so that the physical consistency of the access of the resources in the running process is ensured.

Kube-apiserver (API server) module, which is used as a global control center and a unified access entry of the Kubernetes cluster and externally exposes a cluster operation interface based on RESTful API. Based on the Watch mechanism, the resource state change events are synchronized in real time, and components such as a drive scheduler (kube-scheduler), a controller (kube-controller-manager) and a node Kubelet work cooperatively to ensure that the cluster state is consistent with the expected state declared by a user. The method is used for receiving and storing the GPU topology information reported by the custom Device plug in and Kubelet modules, and synchronizing the resource information and the change event to the custom scheduler plug-in module and the Kubelet module in real time.

And the custom scheduler plug-in module is used for analyzing the NUMA topology configuration file and executing the affinity scheduling policy. Based on NUMA constraint conditions (such as strict binding and cross-node fault tolerance) declared by the Pod, combining with the GPU topological state of the target node, screening candidate nodes meeting the conditions, calculating priorities, and finally deciding an optimal node to realize scheduling of the Pod.

A container runtime module for loading NUMA resource quarantine configuration during a container startup phase. Based on the device mount instruction issued by Kubelet and the cgroup policy file, the GPU device, the CPU core, and the memory region are bound to a container process.

Referring to fig. 3, the embodiment of the application discloses a NUMA scheduling device under a large model training scene, which comprises:

The configuration generating module 11 is used for collecting membership relations between different NUMA nodes under each processor node in the target cluster and each graphic processor so as to generate a topology relation configuration file;

A policy acquisition module 12, configured to acquire a target affinity policy corresponding to a graphics processor requirement of a current large model training task;

A first node determining module 13, configured to screen candidate processor nodes including candidate NUMA nodes satisfying the first affinity policy from the processor nodes based on the topology relation configuration file if the target affinity policy is the first affinity policy, where the candidate NUMA nodes satisfying the first affinity policy are NUMA nodes having a number of idle graphics processors not less than a number of graphics processors corresponding to the graphics processor requirement in a single NUMA node;

A second node determination module 14 for determining a target NUMA node from among the candidate NUMA nodes based on the performance traffic scores of the candidate NUMA nodes;

a scheduling module 15, configured to bind a training container to the target NUMA node, so as to schedule, when the training container starts, each graphics processor under the target NUMA node to complete the current large model training task.

Further, the embodiment of the application also provides electronic equipment. Fig. 4 is a block diagram of an electronic device 20, according to an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present application in any way.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Specifically, the system comprises at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input/output interface 25 and a communication bus 26. The memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement relevant steps in the NUMA scheduling method in a large model training scenario, which is performed by the electronic device and disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device, the communication interface 24 is configured to create a data transmission channel with an external device for the electronic device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the outside, where the specific interface type may be selected according to the needs of the specific application, which is not specifically limited herein.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 21 may also include a main processor, which is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon include an operating system 221, a computer program 222, and data 223, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device and the computer program 222, so as to implement the operation and processing of the processor 21 on the mass data 223 in the memory 22, which may be Windows, unix, linux. The computer program 222 may further comprise a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the NUMA scheduling method performed by the electronic device in a large model training scenario as disclosed in any of the previous embodiments. The data 223 may include, in addition to data received by the electronic device and transmitted by the external device, data collected by the input/output interface 25 itself, and so on.

Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the NUMA scheduling method under the large model training scene. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in random access Memory (Random Access Memory, i.e., RAM), memory, read-Only Memory (ROM), electrically programmable EPROM (Erasable Programmable Read Only Memory), electrically erasable programmable EEPROM (Electrically Erasable Programmable Read Only Memory), registers, hard disk, removable disk, CD-ROM (Compact Disc Read-Only Memory), or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The above detailed description of a NUMA scheduling method, apparatus, device and medium in a large model training scenario provided by the present invention is provided, and specific examples are used herein to illustrate the principles and embodiments of the present invention, where the above examples are only used to help understand the method and core idea of the present invention, and meanwhile, according to the idea of the present invention, those skilled in the art may change the details and application scope, so that the disclosure should not be construed as limiting the present invention.

Claims

1. A NUMA scheduling method for large model training scenarios, characterized by comprising:

Collect the affiliation relationships between different NUMA nodes and each graphics processor under each processor node in the target cluster to generate a topology relationship configuration file;

Get the target affinity policy corresponding to the GPU requirements of the current large model training task;

If the target affinity policy is the first affinity policy, candidate processor nodes including a candidate NUMA node that satisfies the first affinity policy are screened from the processor nodes based on the topology relationship profile; wherein the candidate NUMA node that satisfies the first affinity policy is a NUMA node in which the number of idle graphics processors under a single NUMA node is not less than the number of graphics processors corresponding to the graphics processor demand;

Determine a target NUMA node from each of the candidate NUMA nodes according to the performance communication score of each of the candidate NUMA nodes;

The training container is bound to the target NUMA node so that when the training container is started, each graphics processor under the target NUMA node is scheduled to complete the current large model training task.

2. The NUMA scheduling method for large model training scenarios according to claim 1, wherein collecting the affiliation between different NUMA nodes and each graphics processor under each processor node in the target cluster to generate a topology relationship configuration file comprises:

Deploy the Device Plugin application to the GPU of the target cluster as a DaemonSet, and use the NVML library to obtain the affiliation between different NUMA nodes and each GPU under each processor node.

According to the affiliation relationship, each graphics processor number and the corresponding NUMA node number are encoded into a key-value pair to generate a topology relationship configuration file.

3. The NUMA scheduling method in a large model training scenario according to claim 2, wherein determining a target NUMA node from each candidate NUMA node based on the performance communication score of each candidate NUMA node comprises:

Scoring each candidate NUMA node based on the continuity of the graphics processor numbers corresponding to each candidate NUMA node to obtain a communication score for each candidate NUMA node;

Scoring each of the candidate NUMA nodes based on the load balancing status of each of the candidate NUMA nodes to obtain a performance score of each of the candidate NUMA nodes;

The sum of the communication score and the performance score is determined as the performance communication score of each candidate NUMA node, and a target NUMA node is determined from each candidate NUMA node according to the performance communication score.

4. The NUMA scheduling method in a large model training scenario according to any one of claims 1 to 3, further comprising:

If the target affinity policy is the second affinity policy, the total number of idle graphics processors under all NUMA nodes in each processor node is determined based on the topology relationship configuration file, and candidate processor nodes whose total idle number meets the graphics processor requirement are screened out from each processor node.

5. The NUMA scheduling method in a large model training scenario according to claim 4, further comprising:

Arrange the NUMA nodes in the current candidate processor nodes in descending order according to the number of idle GPUs under each NUMA node to obtain a NUMA node sequence, and determine the current NUMA node from the NUMA node sequence;

Determine whether the total number of idle graphics processors from the first NUMA node to the current NUMA node in the NUMA node sequence meets the graphics processor requirement;

If the total number of idle GPUs does not meet the GPU requirement, a new current NUMA node is determined from the NUMA node sequence, and the process jumps again to the step of determining whether the total number of idle GPUs from the first NUMA node to the current NUMA node in the NUMA node sequence meets the GPU requirement.

If the total number of idle nodes meets the graphics processor requirement, the first NUMA node to the current NUMA node in the NUMA node sequence are determined as candidate NUMA nodes that meet the second affinity policy.

6. The NUMA scheduling method in a large model training scenario according to claim 1, further comprising:

The target processor node monitors the scheduling request of the training container based on the ListWatch mechanism, and binds the training container to the target NUMA node to limit the training container to access only the graphics processor bound to the target NUMA node; wherein, the target processor node is the processor node corresponding to the target NUMA node.

7. The NUMA scheduling method for large model training scenarios according to claim 6, wherein the process of scheduling each GPU under the target NUMA node to complete the current large model training task further comprises:

Sending the number list of the graphics processors bound to the target NUMA node to the container runtime through the target processor node;

The device file of the graphics processor bound to the target NUMA node is mounted to the training container during container runtime, and the graphics processor visibility environment variable is set according to the number list.

8. A NUMA scheduling device for large model training scenarios, characterized by comprising:

The configuration generation module is used to collect the affiliation relationship between different NUMA nodes and each graphics processor under each processor node in the target cluster to generate a topology relationship configuration file;

A strategy acquisition module is used to obtain the target affinity strategy corresponding to the GPU requirements of the current large model training task;

a first node determination module configured to, if the target affinity policy is a first affinity policy, filter out candidate processor nodes from each of the processor nodes based on the topology relationship configuration file, including a candidate NUMA node that satisfies the first affinity policy; wherein the candidate NUMA node that satisfies the first affinity policy is a NUMA node having a number of idle graphics processors under a single NUMA node that is not less than the number of graphics processors corresponding to the graphics processor demand;

A second node determination module is configured to determine a target NUMA node from each of the candidate NUMA nodes according to a performance communication score of each of the candidate NUMA nodes;

The scheduling module is used to bind the training container to the target NUMA node so that when the training container is started, the graphics processors under the target NUMA node are scheduled to complete the current large model training task.

9. An electronic device, comprising:

Memory, used to store computer programs;

A processor is used to execute the computer program to implement the steps of the NUMA scheduling method in the large model training scenario as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it is used to store a computer program; wherein, when the computer program is executed by a processor, the steps of the NUMA scheduling method in a large model training scenario as described in any one of claims 1 to 7 are implemented.