CN114625482B

CN114625482B - Device management method and device

Info

Publication number: CN114625482B
Application number: CN202210294026.6A
Authority: CN
Inventors: 安仲奇; 董建波; 唐小川; 张正俣
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2025-09-23
Anticipated expiration: 2042-03-23
Also published as: CN114625482A

Abstract

An embodiment of the present application provides a device management method and apparatus. The method includes: mounting N GPUs on each of a plurality of containers, with a preset link existing between the N GPUs, where N is an integer greater than 1; virtualizing the GPUs that can be called by each container to obtain one or more vGPU instances corresponding to each container; and providing the virtualized vGPU instances to the corresponding container for use. By mounting N GPUs on each container and virtualizing the GPUs that can be called by each container, the isolation of GPUs between containers is ensured while avoiding blocking the preset links between GPUs, thereby allowing communication between GPUs using the preset links.

Description

Equipment management method and device

Technical Field

The present application relates to the field of computers, and more particularly, to a device management method and apparatus.

Background

With the continued development of computer technology, more and more artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) deep training tasks are deployed and run in containers, while AI deep training tasks rely largely on graphics processing units (graphics processing unit, GPUs).

Currently, when a GPU is mounted on a container, a system or a user configures the GPU that can be invoked by each container. The software responsible for managing the GPUs, i.e., GPU runtime (runtime), may mount a corresponding GPU for each container according to the configuration of the system or user. For each container, only the GPU mounted in the present container can be used to ensure isolation between containers.

In some scenarios, such as distributed training, the same tenant may use multiple containers to perform the same task, so as to improve the execution efficiency. Because data sharing may be required between multiple containers, high-speed data transmission between multiple containers is required, and the high-speed data transmission may be implemented through a GPU high-speed interconnection technology with a communication bandwidth far higher than that of a common network. But the different containers are isolated from each other, meaning that high-speed interconnections between GPUs of different containers cannot be used. Therefore, at present, data transmission between containers is mainly implemented by means of shared memory or network transmission. But this may affect overall performance. For example, the data needs to be transferred through the system main memory in a shared memory mode, and the data is copied for multiple times, so that the communication efficiency is low, the communication performance is poor, the execution efficiency is limited, and the expandability of training is limited.

Disclosure of Invention

The application provides a device management method and device, which aim to improve communication speed and execution efficiency while realizing container isolation.

In a first aspect, the application provides a device management method, which comprises the steps of mounting a GPU on each container in N containers, wherein a preset link exists between the N GPUs, N is an integer greater than 1, virtualizing the GPUs which can be called by each container to obtain one or more vGPU examples corresponding to each container, and providing the vGPU examples obtained by the virtualization for the corresponding container.

The application provides a device management device, which comprises a control module and a virtualization module, wherein the control module is used for mounting N Graphic Processing Units (GPUs) on each container in a plurality of containers, preset links exist among the N GPUs, N is an integer larger than 1, the virtualization module is used for virtualizing the GPUs which can be called by each container to obtain one or more vGPU examples corresponding to each container, and the control module is also used for providing the vGPU examples obtained by the virtualization for the corresponding container.

It should be appreciated that the respective modules may implement the respective functions by executing the computer program.

In a third aspect, the present application provides a device management apparatus comprising a processor for executing program code to cause the apparatus to implement the method of the first aspect.

In a fourth aspect, the present application provides a chip comprising at least one processor for implementing the functions referred to in the first aspect, for example, virtualizing a GPU.

In a fifth aspect, the present application provides a computing device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect described above when executing the computer program.

In a sixth aspect, the present application provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor causes the processor to implement the method of the first aspect described above.

Seventh method, the present application provides a computer program product comprising a computer program, which when run implements the method of the first aspect described above.

According to the scheme, on one hand, according to the GPUs which can be called by each container, the virtualized vGPU examples are provided for the containers through a virtualization technology, so that each container can only access the GPUs which can be called by the container, and inter-container GPU isolation can be ensured, and on the other hand, N GPUs are respectively mounted on each container, so that when the container is started, preset links between the GPUs mounted on different containers are prevented from being invalid, namely, the preset links between the GPUs are prevented from being blocked, and therefore, the GPUs are allowed to communicate by using high-speed interconnection. Because the communication efficiency of the shared memory or network transmission is far lower than that of the preset link, the communication speed can be greatly improved, the execution efficiency is improved, and good communication performance is ensured.

Drawings

FIG. 1 is a schematic diagram of communication between GPUs provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of a device management method provided by an embodiment of the present application;

FIG. 3 is another schematic diagram of communication between GPUs provided by an embodiment of the present application;

FIG. 4 is a schematic block diagram of a device management apparatus according to an embodiment of the present application;

Fig. 5 is another schematic block diagram of a device management apparatus provided in an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

The technical scheme provided by the application can be applied to the fields of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and deep learning (DEEP LEARNING, DL). Among them, the AI field is a new technical science for researching, developing theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. DL is a new research direction in the field of machine learning (MACHINE LEARNING) that was introduced to machine learning to bring it closer to the original AI goal.

Fig. 1 is a schematic diagram of communication between GPUs, which is suitable for use in embodiments of the present application.

In the communication scenario shown in fig. 1, two containers, container 0 and container 1, respectively, are activated. Each container contains a plurality of work processes (workers) and corresponding communication libraries. Wherein container 0 contains work process 0, work process 1, and work process 2, and container 1 contains work process 3 and work process 4. It should be appreciated that multiple work processes may constitute a collective process group, completing the same task in parallel. Each container has a GPU mounted thereon. As shown in fig. 1, GPU0, GPU1 and GPU2 are mounted on container 0, and GPU3 and GPU4 are mounted on container 1. Each GPU may be provided for use by a corresponding work process. For example, GPU0 may be provided for use by work process 0, GPU1 may be provided for use by work process 1, GPU2 may be provided for use by work process 2, GPU3 may be provided for use by work process 3, and GPU4 may be provided for use by work process 4. When a GPU is used by a work process in a container, the work process may use the GPU by way of a call, so that the work process using the GPU may need to communicate sets frequently.

It should be appreciated that a container may be a collection of processes that isolate other resources of the system, with its own independent view of resources.

It should also be appreciated that mounting is the process of exposing certain GPUs of a host to a container so that the container can access and use those GPUs.

It should also be appreciated that aggregate communications are a group of inter-process communications. Collective communication differs from point-to-point communication in that it requires all within a particular group to engage in simultaneous communication, there may be one-to-many, many-to-one, many-to-many. The communication library referred to in the present application is a communication library for collective communication. The related content of the aggregate communication and the communication library can refer to the prior art, and will not be described herein.

Currently, when GPUs are mounted on the container 0 and the container 1, the system or the user configures GPUs that can be used by the container 0 and the container 1 for each, for example, GPU0, GPU1 and GPU2 are configured for the container 0, and GPU3 and GPU4 are configured for the container 1. And (3) when the GPU runs, GPU0 to GPU2 are mounted on the container 0, and GPU3 and GPU4 are mounted on the container 1. Then, container 0 may access GPU0 through GPU2 and container 1 may access GPU3 and GPU4. Since container 0 and container 1 are isolated from each other, it may be guaranteed at the GPU runtime level that container 0 cannot access GPU3 and GPU4 inside container 1, and container 1 cannot access GPU0 through GPU2 inside container 0.

However, researchers have found that since container 0 and container 1 are isolated from each other, GPU0, GPU1, and GPU2 in container 0 cannot communicate with GPU3, GPU4 in container 1at high speed using GPU high speed interconnection techniques, i.e., cannot achieve high speed communication of GPUs between containers. Although data transfer between the containers may be implemented by means of shared memory, network transfer, or the like, GPU1 and GPU4 may be interconnected by means of shared memory. But this may affect overall performance. For example, the data needs to be transferred through the system main memory in a shared memory mode, and the data is copied for multiple times, so that the communication efficiency is low, the communication performance is poor, and the extensibility of training is limited.

In view of the above, the present application provides a device management method, on one hand, according to the GPUs that each container can call, the virtualized vGPU instances are provided to the containers through the virtualization technology, so that each container can only access the GPUs that it can call, and the inter-container GPU isolation can be ensured, and on the other hand, N GPUs are respectively mounted on each container, so as to avoid invalidating preset links between GPUs mounted on different containers when the container is started, that is, to avoid blocking the preset links between GPUs, and to allow the GPUs to communicate with each other using high-speed interconnection. If the preset link is designed as a high-speed communication link for communication between GPUs, the communication efficiency of shared memory or network transmission is far lower than that of the high-speed communication link, so that the communication efficiency can be greatly improved, the execution efficiency can be improved, and the overall performance can be improved.

The device management method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart of a device management method according to an embodiment of the present application. The method 200 shown in FIG. 2 may be applied to a CPU or to system software on a CPU. The system software may include an operating system and a resource scheduling system, wherein the resource scheduling system is operable for resource scheduling of the GPU.

The method 200 shown in fig. 2 may include steps 201 to 203. The steps in the method 200 shown in fig. 2 are described in detail below using the method 200 as applied to a CPU.

In step 201, N GPUs are mounted on each of the plurality of containers, where a preset link exists between the N GPUs, and N is an integer greater than 1.

Wherein the preset link may be used for communication between GPUs. In the embodiment of the application, the preset link can be designed as a high-speed communication link for communication between GPUs, and the transmission bandwidth of the high-speed communication link is far higher than that of a common network, so that higher communication speed can be provided. By way of example and not limitation, the high-speed communication link may be NVLink of Injeida, or may also be a high-speed communication link provided by other GPU vendors for inter-GPU communications, including but not limited to.

When a CPU starts a container, N GPUs, where N GPUs refer to all GPUs configured for multiple containers, may be mounted on each container. The plurality of containers may be, for example, a plurality of containers for the same tenant to perform the same task. Because all the GPUs are mounted on the containers, each container can not invalidate the high-speed communication links among the mounted different GPUs when being started, and the high-speed communication links among the GPUs are not blocked, so that the high-speed communication links can be used for data transmission among the GPUs.

As shown in fig. 3, the CPU starts two containers, container 0 and container 1, respectively, each of which container 0 and container 1 mounts GPUs 0 to 3. No isolation exists between GPU0, GPU1, and GPU3, and there is a high speed communication link between GPUs, whether for container 0 or container 1. Thus, high-speed interconnections between GPU0, GPU1, GPU2, and GPU3 may be achieved.

Alternatively, the GPU mounted by each of the plurality of containers may be configured by the system or by the user.

Specifically, the mountable GPU may be configured for the container by the system or a user. The system may be a resource scheduling system on the CPU that may be used to configure the mountable GPU for the container. It should be appreciated that the resource scheduling system may also be provided on the system software of the CPU.

The following exemplarily gives a specific procedure for configuring a mountable GPU for a container.

For example, a resource scheduling system or user may uniformly configure a mountable GPU for multiple containers. For example, both container 0 and container 1 may be configured by the resource scheduling system on the CPU to mount GPU0, GPU1, GPU2, and GPU3. Thus, the CPU mounts GPU0 to GPU3 for both container 0 and container 1 according to the configured GPUs.

Step 202, virtualizing the GPUs that each container can call to obtain one or more vGPU instances corresponding to each container.

The GPUs that each container is able to invoke, i.e., the GPUs that each container is actually able to use. For example, container 0 in FIG. 3 may call GPU0, GPU1, and GPU2, and container 1 may call GPU3. The GPUs that each container can call can be configured by a mapping relationship. The mapping relationship may be a mapping relationship between a container manually configured by a user or a resource scheduling system and the GPU. The mapping relationship may be configured separately for each container, and the mapping relationship configured for each container is used to indicate the GPU that can be invoked by the container, or may be configured uniformly for all containers, and the mapping relationship may be used to indicate the GPU that can be invoked by each container in the plurality of containers.

When the mapping relationship is configured separately for each container, the resource scheduling system or the user may generate one mapping relationship for each container, respectively. For example, the resource scheduling system generates a mapping relationship #1 for the GPUs that can be called by the container 0, where the mapping relationship #1 indicates that the GPUs that can be called by the container 0 are GPU0, GPU1 and GPU2, and the resource scheduling system also generates a mapping relationship #2 for the GPUs that can be called by the container 1, where the mapping relationship #2 indicates that the GPUs that can be called by the container 1 are GPU3. It should be understood that the mapping relationship #1 and the mapping relationship #2 are specific examples of the mapping relationship, respectively. When the mapping relationship is configured individually for each container, the mapping relationship configured for each container is different from each other.

When uniformly configured for all containers, the resource scheduling system or user uniformly generates a mapping relationship for all containers. For example, the mapping relationship indicates that GPUs that container 0 can call are GPU0, GPU1, and GPU3, and GPUs that container 1 can call are GPU3. In other words, the map is the complete set of the map #1 and the map # 2.

After knowing the callable GPUs of each container according to the mapping relation, the CPU can determine the callable GPUs from the N GPUs mounted in each container, and virtualize the callable GPUs.

It should be understood that GPU virtualization refers to packaging a single GPU device into several logical vGPU instances for concurrent use by different work processes.

Optionally, each container of the plurality of containers includes one or more work processes, each work process being provided with one or more vGPU instances.

When the GPU is virtualized, whether to virtualize one vGPU instance or a plurality of vGPU instances can be determined according to the configuration situation of the GPU. When the configuration of the GPU is low, the GPU can be virtualized into a vGPU instance when the requirements of a plurality of working processes cannot be met at the same time.

For example, in fig. 3, if the CPU learns that the GPUs callable by the container 0 are GPU0, GPU1 and GPU2 according to the mapping relationship, the GPUs callable by the container 1 are GPU3. Although GPU0 through GPU3 are mounted in both container 0 and container 1, for container 0, the CPU can virtualize GPU0 through GPU2 without virtualizing GPU3, and for container 1, the CPU can virtualize GPU3 without virtualizing GPU0 through GPU 2. Assuming that the configurations of GPU0, GPU1 and GPU2 are low, and cannot meet the requirements of multiple working processes at the same time, GPU0, GPU1 and GPU2 can be virtualized into vGPU instance-0, vGPU instance-1 and vGPU instance-2 respectively, while assuming that the configuration of GPU3 is high, and can meet the requirements of multiple working processes at the same time, GPU3 can be virtualized into vGPU instance-3 and vGPU instance-4. Thus, work process 0 in container 0 may be provided with vGPU instance-0, work process 1 may be provided with vGPU instance-1, work process 2 may be provided with vGPU instance-2, and work process 3 in container 1 may be provided with vGPU instance-3, and work process 4 may be provided with vGPU instance-4.

It can be seen that while the GPUs are not isolated from each other from the container, the container's calls to the resources are still isolated from each other by using virtualization techniques.

The vGPU instance may be obtained by virtualizing each GPU based on the mapping relationship described above when the vGPU is running. The vGPU runtime can be understood as software for virtualizing the GPU and managing the virtualized vGPU instances.

One possible implementation is to inject a vGPU runtime for each container, which is used to virtualize the callable GPU.

For example, at the time of starting the containers, the CPU may inject a vGPU runtime into each container, and the CPU may virtualize the GPU that each container can call into one or more vGPU instances by invoking the vGPU runtime in each container according to the above-described mapping relationship. In particular implementations, the CPU may inject vGPU runtime into each container by mounting the host volume.

Optionally, the method further comprises providing the above mapping relationship to the vGPU runtime of each container, the mapping relationship being used to indicate the GPU that each container can invoke.

For example, after generating the mapping relationship, the resource scheduling system in the CPU may provide the mapping relationship to the vGPU runtime, and the vGPU runtime virtualizes the GPU that the container may call according to the mapping relationship. The resource scheduling system can provide the mapping relation to the vGPU operation time in the forms of configuration files, environment variables, command line parameters and the like.

As described above, the mapping relationship may be configured individually for each container or may be configured uniformly for all containers. When the configuration is unified for all containers, the resource scheduling system provides the GPU which can be called by each container to each vGPU, namely, the content of the mapping relation which can be called by the container which belongs to each vGPU in the running process of the vGPU is the same, and when the configuration is unified for all containers, the resource scheduling system provides the GPU which can be called by each container to each vGPU in the running process of the vGPU, namely, the content of the mapping relation which can be called by each container is the same, and the GPU which can be called by the container which belongs to the vGPU is searched in the mapping relation in the running process of the vGPU.

It should be appreciated that the mapping relationship may also be provided by the user.

And 203, providing the virtualized vGPU instance for a corresponding container.

After the CPU virtualizes the GPU that the container can call by calling the vGPU runtime, the container may use the vGPU instance in the following manner. It should be appreciated that the use of the vGPU instance by the container may specifically be the use of the vGPU instance by a work process in the container.

It should be appreciated that since the vGPU instances are virtualized by the GPU, the use of the vGPU instances corresponding to the GPUs by the work processes in the container is also equivalent to the use of the GPUs by the work processes in the container.

Optionally, step 203 may specifically include hijacking calls to the first application program interface API by the vGPU runtime injected into each container and providing a second API.

The first API may be an API provided by a GPU vendor, and may specifically be a GPU user mode API or a GPU kernel driver API. The second API is an API provided by the vGPU runtime and having exactly the same appearance as the name, appearance, etc. of the API provided by the GPU vendor, and is used to call the vGPU instance in each container.

It should be appreciated that hijacking (hijack) a call to an API may be understood as modifying the entry of the original API to jump to another API. In the embodiment of the application, the call to the first API is jumped to the second API by hijacking the call to the first API and providing the second PAI. Specifically, when a work process in a container makes a call to a first API, the vGPU runtime can block the call to the first API by the work process and provide a second API to the work process.

Specifically, when the GPU is to be used by the work process in the container, an API provided by the manufacturer, i.e., the first API, is usually called, at this time, the CPU controls the vGPU runtime in the container to block the call of the work process to the first API, and the vGPU runtime provides the API for calling the vGPU instance in the container, i.e., the second API, to the work process. Because the appearance of the first API is completely consistent with that of the second API, the working process in the container can be induced to call the second API, and the corresponding vGPU can be used by calling the second API. When the vGPU runs and hives the APIs provided by manufacturers, the vGPU can hijack the APIs in the position of the user state of the GPU, and the vGPU can hijack the APIs in the position of the driving APIs of the GPU core.

Optionally, the vGPU runtime provides the functionality to inject the mapping or modify GPU runtime environment variables.

The GPU runtime can be understood as software that manages the GPU. The environment variables are typically parameters in the operating system that specify the operating system operating environment. The environment variables involved in the embodiments of the present application may be originally used to describe the GPU mounted on each container, for example, "CUDE _visible_devices", "hip_visible_devices", etc. It will be appreciated that in this embodiment, the GPUs mounted on each container are the N GPUs described above. Through modification of the environment variable, what is rendered by the environment variable is a GPU that each container can call.

In one example, the vGPU runtime may inject (object) a mapping into the second API, and the work process in the container may use the corresponding vGPU instance by calling the second API.

For example, if the vGPU runtime of container 0 injects a mapping relationship that characterizes container 0 as being able to call GPU0 to GPU2 into the second API, then work process 0 may use vGPU instance-0 with a call to the second API, work process 1 may use vGPU instance-1 with a call to the second API, and work process 2 may use vGPU instance-2 with a call to the second API. That is, work process 0 may use GPU0, work process 1 may use GPU1, and work process 2 may use GPU2 through a call to the second interface. While the vGPU runtime of container 1 injects the mapping relation characterizing that container 1 is able to call GPU3 into the second API, work process 3 may use vGPU instance-3 by calling the second API, and work process 4 may use vGPU instance-4 by calling the second API. That is, work process 3 and work process 4 multiplex GPU3.

As another example, if all GPUs of the configuration are mounted on each container, and the vGPU is running, only the GPUs that the container can use are virtualized. Since all GPUs of the configuration are mounted in the container, the container may access all GPUs mounted, and for non-virtualized GPUs, there may be cases where the container bypasses the second API to access the non-virtualized GPUs. For example, although GPU0 to GPU3 are mounted in container 0 and only GPU0 to GPU2 are virtualized, container 0 may access GPU3, and may bypass the second API by an illegal way to use GPU3 that should be used by container 1. For another example, the container 1 also mounts GPU0 to GPU3, and although only GPU3 is virtualized, the container 1 may access GPU0 to GPU2, and may bypass the second API by an illegal way to use GPU3 that should be used by the container 0. If this occurs, isolation between container 0 and container 1 is not guaranteed.

Thus, to avoid this, the vGPU runtime may provide the ability to modify GPU runtime environment variables, changing the container from "accessible" to "inaccessible" to the un-virtualized GPU. For example, modifying the GPU runtime environment variables describing the GPU installed on container 0 to make GPU3 "inaccessible", container 0 "cannot use GPU3. Likewise, modifying the GPU runtime environment variables describing the GPUs mounted on container 1 to render GPU0 through GPU2 "inaccessible" renders container 1 unable to use GPU0 through GPU2.

It can be known that, although both the container 0 and the container 1 mount the GPUs 0 to 3, the container 0 and the container 1 can only use the vGPU instance corresponding to the GPU that can be called by themselves due to the isolation guarantee between the containers when the vGPU runs, the container 0 cannot use the GPU3 in the container 1, and the container 1 cannot use the GPUs 0 to 2 in the container 0. Thus, the isolation between container 0 and container 1 is guaranteed at the level of the vGPU runtime.

It was noted above that when the configuration of the GPU is high, the GPU may be virtualized into multiple vGPU instances for use by the container. And deadlock may occur when the communication library is dominated by the GPU for the entire logic. Specifically, when the communication library is dominated by the GPU and the GPU resources are tense or the GPU utilization is high, i.e. the GPU has little available resources, if multiple nodes in the container multiplex one GPU and there are interdependencies between the multiple nodes, a deadlock phenomenon is very easy to occur.

It should be understood that interdependence may refer to a signal that one work process (e.g., denoted as work process a) needs to wait for another work process (e.g., denoted as work process b) to continue running on the GPU, while work process b needs to be scheduled after work process a releases the GPU's resources. However, the work process b can only run on the GPU after being scheduled, and signals the work process a, and the mode of the GPU long-resident persistence kernel determines that the work process a needs to release resources after completing tasks. In this way, the working process a waits for the signal of the working process b, the working process b waits for the working process a to release the resource, and the working process a waits for each other, so that the deadlock phenomenon occurs. The long-resident persistence kernel mode may be specifically expressed in that a work process running on the GPU needs to release resources after execution ends.

It should also be appreciated that deadlock does not occur when the GPU has available resources.

For example, as shown in fig. 3, when the vGPU in the container 1 runs, the GPU3 is virtualized into the vGPU instance-3, and the vGPU instance-4 is provided for the working process 3 and the working process 4 to use respectively, so that the working process 3 and the working process 4 use the GPU3 essentially. Assuming that the work process 3 is first run on the GPU3, a signal of the work process 4 is required to continue running when the work process 3 runs to a certain node on the GPU3 for the corresponding work process 3. However, if the resources of GPU3 are already full at this time, work process 4 cannot run on GPU3. Since the work process 4 cannot be operated, a signal for continuing the operation of the work process 3 cannot be given, so that the work process 3 is in a waiting state. For the work process 4, only when the GPU3 releases the resources occupied by the work process 3, the work process 4 can run on the GPU3. However, the work process 3 waits for the signal of the work process 4, and the GPU3 cannot release the resource, so that the work process 4 is always in a waiting state. Therefore, the work process 3 and the work process 4 wait for each other, and a deadlock phenomenon occurs.

Thus, to avoid the phenomenon that multiplexing a plurality of work processes with one GPU may cause deadlock, the method 200 may further include:

scheduling a work process based on control logic in the communication library such that the work process invokes resources in the CPU for computation in response to the scheduling of the CPU.

Specifically, the control logic of the communication library is offloaded from the GPU to the CPU, and the operating system of the CPU may be responsible for scheduling the work process based on the control logic of the communication library. Because the CPU runs with the operating system, the operating system can ensure that the resource is less likely to be exhausted. Thus, interdependent work processes may be scheduled on the CPU for computation. On the other hand, unlike the long-resident persistent kernel mode of the GPU, the CPU controls the communication logic so as not to load complex logic, for example, a working process depending on external conditions, and the operation can be finished when the GPU has available resources, so that the problem of deadlock is solved when the GPU occupies the resources of the GPU but waits for another working process which is not scheduled. Instead, the CPU may schedule the interdependent work processes in turn as resources are exhausted. Therefore, the deadlock phenomenon caused by mutual waiting of the mutually-dependent working processes can be avoided.

In one implementation, all work processes are handed to the CPU for processing. That is, both interdependent and independent work processes may be scheduled by the CPU.

For interdependent work processes, the CPU may employ a round robin scheduling mechanism. For example, assuming that the working process 3 and the working process 4 in the container 1 are mutually dependent, the operating system of the CPU may schedule the working process 3 first, and when the working process 3 runs to a certain node, a signal of the working process 4 is required to continue running and the resources of the CPU are currently occupied, the CPU may schedule the working process 3, schedule the working process 4, and let the working process 4 run. At this time, the scheduled work process 4 can give a signal to the work process 3 to continue running, and the CPU can schedule the work process 4, schedule the work process 3, and allow the work process 3 to continue running. The cycle is performed in such a way, so that the work process 3 and the work process 4 are scheduled to run in turn, and the problem that resources occupied by a certain work process in the GPU are not released can be avoided, namely the deadlock phenomenon can be avoided.

For independent working processes, the CPU can schedule according to the occupation condition of resources. For example, assuming that the work process 3 and the work process 4 in the container 1 are independent, the operating system of the CPU may schedule the work process 3 first, so that the work process 3 runs. If the CPU has remaining resources, the work process 4 may continue to be scheduled. If the CPU has no residual resources, the CPU can schedule the working process 3, schedule the working process 4, or schedule the working process 4 after the working process 3 runs to release the resources.

In another implementation, the independent work processes are handed to the CPU for processing and the independent work processes are handed to the GPU for processing. The manner in which the CPU processes the interdependent work processes is the same as that previously achievable, and will not be described in detail here. For the mode of processing independent working processes by the GPU, as the GPU does not have the capability of alternately scheduling the working processes, the former working process can be operated, and after resources are released, the latter working process can be operated.

It should also be appreciated that the control logic is determined by the CPU, and the application is not limited to specific executors such as data transfer, and may be a direct memory access (direct memory access, DMA) engine, a network card, a GPU program, or the CPU itself, etc.

According to the scheme, on one hand, according to the GPUs which can be called by each container, the virtualized vGPU examples are provided for the containers through a virtualization technology, so that each container can only access the GPUs which can be called by the container, and the inter-container GPU isolation can be ensured, and on the other hand, N GPUs are respectively mounted on each container, so that when the container is started, the preset links between the GPUs mounted on different containers are prevented from being invalid, namely, the preset links between the GPUs are prevented from being blocked, and communication is allowed between the GPUs by using the preset links. Since the preset link between GPUs may be designed as a high-speed communication link, communication between GPUs is allowed using the high-speed communication link. Because the communication efficiency of the shared memory or network transmission is far lower than that of the high-speed link, the communication speed can be greatly improved, the execution efficiency is improved, and good communication performance is ensured. In addition, by unloading the control logic of the aggregate communication library from the GPU to the CPU, the phenomenon that a plurality of working processes multiplex the same GPU and deadlock is likely to occur is avoided.

The method provided by the embodiment of the application is described in detail above with reference to fig. 2 to 3. The following describes in detail the apparatus provided in the embodiment of the present application with reference to fig. 4 to 5.

Fig. 4 is a schematic block diagram of an apparatus provided by an embodiment of the present application. As shown in fig. 4, the apparatus 400 may include a control module 410 and a virtualization module 420. The modules in the apparatus 400 may be used to implement the corresponding flow of the CPU in the method 200 shown in fig. 2. For example, control module 410 may be used to perform steps 201 and 203 in method 200 and virtualization module 420 may be used to perform step 202 in method 200.

Specifically, the control module 410 may be configured to mount N GPUs on each container of the plurality of containers, where a preset link exists between the N GPUs, where N is an integer greater than 1, the virtualization module 420 may be configured to virtualize the GPUs that can be invoked by each container to obtain one or more vGPU instances corresponding to each container, and the control module 410 is further configured to provide the virtualized vGPU instances to the corresponding containers for use.

Optionally, the control module 410 may be further configured to inject a vGPU runtime for each container, where the vGPU runtime injected into each container is used to virtualize the callable GPUs.

Alternatively, the control module 410 may be specifically configured to hijack calls to a first application program interface API by the vGPU runtime injected into each container, and provide a second API for calling the vGPU instance within each container, where the first API is a GPU custom API or a GPU kernel driver API provided by a GPU vendor.

Optionally, the vGPU runtime provides the functionality to inject mappings or modify GPU runtime environment variables.

Optionally, the control module 410 may be further configured to provide a mapping relationship to the vGPU runtime of each container, where the mapping relationship is used to indicate GPUs that each container can invoke.

Optionally, the GPU mounted by each container of the plurality of containers is configured by the system or by a user.

Optionally, the control module 410 may be further configured to schedule a work process based on control logic in the communication library, such that the work process invokes resources in the CPU for computation in response to the scheduling of the CPU.

It should be understood that the division of the modules in the embodiment of the present application is illustrative, and is merely a logic function division, and other division manners may be implemented in practice. In addition, each functional module in the embodiments of the present application may be integrated in one processor, or may exist alone physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.

Fig. 5 is another schematic block diagram of an apparatus provided by an embodiment of the present application. The apparatus 500 may be used to implement the functions of the CPU in the method 200 described above. The apparatus 500 may be a system-on-chip. In the embodiment of the application, the chip system can be formed by a chip, and can also comprise the chip and other discrete devices.

As shown in fig. 5, the apparatus 500 may include at least one processor 510 for implementing the CPU functions in the method 200 provided in the embodiment of the present application.

Illustratively, when the apparatus 500 is used to implement the functions of the CPU in the method 200 provided by the embodiment of the present application, the processor 510 may be configured to mount N GPUs on each container of the plurality of containers, where N is an integer greater than 1 and a preset link exists between the N GPUs, virtualize the GPUs that can be invoked by each container to obtain one or more vGPU instances corresponding to each container, and provide the vcpu instances obtained by the virtualization to the corresponding container for use. Reference is made specifically to the detailed description in the method examples, and details are not described here.

The apparatus 500 may also include at least one memory 520 for storing program instructions and/or data. Memory 520 is coupled to processor 510. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units, or modules, which may be in electrical, mechanical, or other forms for information interaction between the devices, units, or modules. Processor 510 may operate in conjunction with memory 520. Processor 510 may execute program instructions stored in memory 520. At least one of the at least one memory may be included in the processor.

The apparatus 500 may also include a communication interface 530 for communicating with other devices over a transmission medium, such that the apparatus 500 may communicate with other devices. The communication interface 530 may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of implementing a transceiver function. Processor 510 may utilize communication interface 530 to transceive data and/or information and may be used to implement methods performed by the CPU in the corresponding embodiment of fig. 2.

The specific connection medium between the processor 510, the memory 520, and the communication interface 530 is not limited to the above embodiments of the present application. The embodiment of the present application is illustrated in fig. 5 as a bus connection between processor 510, memory 520, and communication interface 530. The bus is shown in bold lines in fig. 5, and the manner in which other components are connected is merely illustrative and not limiting. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.

It should be appreciated that the processor in embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (DIGITAL SIGNAL processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a field programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The application also provides a chip comprising at least one processor for implementing the functions involved in the CPU in the embodiment shown in fig. 2.

In one possible design, the chip may further include a memory for holding program instructions and data, the memory being located within the processor or external to the processor.

The application also provides a computing device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which processor implements the method of the embodiment shown in fig. 2 when executing the computer program.

The present application also provides a computer-readable storage medium storing a computer program (which may also be referred to as code, or instructions). The computer program, when executed, causes the computer to perform the method of the embodiment shown in fig. 2.

The application also provides a computer program product comprising a computer program which, when allowed, implements the method of the embodiment shown in fig. 2.

The terms "unit," "module," and the like as used in this specification may be used to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution.

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks (illustrative logical block) and steps (steps) described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above-described embodiments, the functions of the respective functional units may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions (programs). When the computer program instructions (program) are loaded and executed on a computer, the processes or functions according to embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.), the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc., that contains an integration of one or more available media, the available media may be magnetic media, (e.g., floppy disk, hard disk, magnetic tape), optical media (e.g., digital video disc (digital video disc, DVD), or semiconductor media (e.g., solid state hard disk (solid STATE DISK, SSD)), or the like.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A device management method, characterized in that the method comprises:

Mounting N graphics processing units (GPUs) corresponding to all containers on each of the multiple containers, wherein a preset link exists between the N GPUs, and the GPUs communicate with each other using a high-speed interconnect. Each container corresponds to the same GPU, and N is an integer greater than 1;

Virtualizing the GPU that can be called by each container to obtain one or more vGPU instances corresponding to each container;

The vGPU instances obtained by virtualizing each GPU are provided to corresponding different containers for use; the vGPU runtime provides the function of injecting mapping relationships or modifying GPU runtime environment variables.

2. The method according to claim 1, further comprising:

A vGPU runtime is injected into each container, and the vGPU runtime injected into each container is used to virtualize a callable GPU.

3. The method according to claim 2, wherein providing the virtualized vGPU instance to the corresponding container for use comprises:

The vGPU runtime injected into each container hijacks the call to the first application program interface API and provides a second API, where the first API is a GPU user-mode API or a GPU kernel driver API provided by the GPU manufacturer, and the second API is used to call the vGPU instance in each container.

4. The method according to claim 2 or 3, further comprising:

A mapping relationship is provided to the vGPU runtime of each container, where the mapping relationship is used to indicate the GPU that can be called by each container.

5. The method of claim 1, wherein the GPU mounted on each of the multiple containers is configured by the system or by the user.

6 . The method of claim 1 , wherein each of the plurality of containers comprises one or more worker processes, and each worker process is provided with one or more vGPU instances.

7. The method according to claim 6, characterized in that it is applied to a central processing unit (CPU), and further comprises:

The work process is scheduled based on the control logic in the communication library, so that the work process responds to the scheduling of the CPU and calls the resources in the CPU for calculation.

8. A device management apparatus, characterized by comprising a module for executing the method according to any one of claims 1 to 7.

9. A device management apparatus, comprising a processor, wherein the processor is configured to execute program code so that the apparatus implements the method according to any one of claims 1 to 7.

10. A chip, comprising: at least one processor, configured to implement the functions involved in the method according to any one of claims 1 to 7.

11. A computing device, comprising: a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

12. A computer program product, comprising a computer program, wherein when the computer program is executed, the method according to any one of claims 1 to 7 is implemented.

13. A computer-readable storage medium storing a computer program, wherein when the computer program is executed, the method according to any one of claims 1 to 7 is implemented.