CN119597721A

CN119597721A - A distributed file system adapted for GPU cluster training and its creation method, electronic device and storage medium

Info

Publication number: CN119597721A
Application number: CN202411640610.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Mole Thread Intelligent Technology Beijing Co ltd
Current assignee: Mole Thread Intelligent Technology Beijing Co ltd
Priority date: 2024-11-15
Filing date: 2024-11-15
Publication date: 2025-03-11
Anticipated expiration: 2044-11-15
Also published as: CN119597721B

Abstract

The present disclosure relates to a distributed file system adapted for GPU cluster training and a creation method, an electronic device and a storage medium, wherein the system comprises: a management module, which is used to create a server for multiple first computer nodes included in the target GPU cluster and a client for multiple second computer nodes included in the target GPU cluster after allocating a target GPU cluster for a target training task, wherein each computer node included in the target GPU cluster is deployed with at least one hardware GPU; a server, which is used to construct a GPU cluster distributed file system adapted for the target training task using a preset local memory space deployed on the corresponding first computer node; and a client, which is used to perform IO operations on the GPU cluster distributed file system in the process of executing the target training task using the target GPU cluster. The disclosed embodiment can effectively meet the IO requirements in the GPU cluster training scenario.

Description

Distributed file system adapting to GPU cluster training, creation method, electronic device and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a distributed file system for adapting GPU cluster training, a creation method, an electronic device and a storage medium.

Background

In the scenario of large model training based on a cluster of graphics processors (Graphics Processing Unit, GPUs), the use of a file system mainly includes three types of Input/Output (IO) operations, namely dataset read, checkpoint data write, and checkpoint data read. The current GPU cluster training uses a storage system that is generally set externally, such as a general network file system, a parallel file system, object storage, etc., and the data reading performance depends on the performance of the storage system, which may not adapt to the data reading performance requirements in the GPU cluster training.

Disclosure of Invention

The disclosure provides a distributed file system adapting to GPU cluster training, a creating method, electronic equipment and a technical scheme of a storage medium.

According to one aspect of the disclosure, a distributed file system for adapting to training of a GPU cluster is provided, and the distributed file system comprises a management module, a server side and a client side, wherein the management module is used for creating the server side for a plurality of first computer nodes included in a target GPU cluster and creating the client side for a plurality of second computer nodes included in the target GPU cluster after distributing the target GPU cluster for the target training task, each computer node included in the target GPU cluster is provided with at least one hardware GPU, the server side is used for constructing the distributed file system of the GPU cluster adapting to the target training task by utilizing a preset local memory space deployed on the corresponding first computer node, and the client side is used for executing IO operation on the distributed file system of the GPU cluster in the process of executing the target training task by utilizing the target GPU cluster.

In one possible implementation, the memory size of the preset local memory space deployed on each first computer node is the same.

In one possible implementation, the GPU cluster distributed file system is configured to store a data set required in the target training task, and checkpoint data during training.

In one possible implementation manner, the client comprises a system call intercepting module and a client communication module, the server comprises a server communication module, the system call intercepting module is used for intercepting a target file system call in a second computer node where the client is located and converting the target file system call into an IO request, wherein the target file system call is used for accessing the GPU cluster distributed file system, and the client communication module is used for sending the IO request to the server communication module.

In one possible implementation, the client communication module comprises a Remote Procedure Call (RPC) client module, the server communication module comprises an RPC server module, and the RPC client module is used for sending the IO request to the RPC server module in a non-blocking RPC network communication mode.

In one possible implementation manner, the client communication module comprises an IPC client module, the server communication module comprises an IPC server module, and the IPC client module is used for sending the IO request to the IPC server module under the condition that a first computing point where the server is located and a second computer node where the client is located are the same computer nodes.

In one possible implementation, the client includes a file map, where the file map is used to store access records of the client to the GPU cluster distributed file system.

In one possible implementation manner, the IO request comprises a metadata IO request, the client comprises a data segmentation module, the server comprises a metadata database, the data segmentation module is used for segmenting target data which needs to be written into the GPU cluster distributed file system, determining a plurality of data blocks corresponding to the target data and metadata information of each data block, wherein the metadata information corresponding to each data block is used for indicating a target server which needs to be written into the data block, the client communication module is used for sending the metadata IO request to each server communication module, the metadata IO request is used for requesting the metadata information of each data block corresponding to the target data to be written into each server, and the metadata database in each server is used for responding to the metadata IO request and storing the metadata information of each data block corresponding to the target data.

In one possible implementation manner, the IO requests comprise data IO requests, the service end comprises a persistence module, the communication client module is used for sending the data IO requests to a plurality of target service ends, wherein the data IO requests corresponding to each target service end are used for requesting to write specified data blocks corresponding to the target data into the target service end, and the persistence module in each target service end is used for responding to the received data IO requests and storing the specified data blocks required to be written into the target service end into the preset local memory space corresponding to the target service end according to the metadata IO requests stored in a metadata database in the target service end.

In one possible implementation manner, the persistence module in each target server stores each specified data block to be written into the target server in the preset local memory space corresponding to the target server.

In one possible implementation manner, the number of specified data blocks corresponding to the target data written into each target server is the same.

In one possible implementation manner, the data IO request corresponding to each target server includes client memory area information corresponding to a specified data block to be written into the target server, and the persistence module in each target server reads the specified data block from a memory space specified by the client memory area information in a remote direct memory access RDMA mode based on the client memory area information included in the received data IO request.

In one possible implementation, the GPU cluster distributed file system is further configured to asynchronously send all or part of the checkpoint data to a PFS external to the target GPU cluster.

In one possible implementation, the management module is configured to delete the client and the server after completing the target training task using the target GPU cluster.

According to one aspect of the disclosure, a method for creating a distributed file system for adapting to training of a GPU cluster is provided, and the method comprises the steps of creating a server for a plurality of first computer nodes included in a target GPU cluster and creating a client for a plurality of second computer nodes included in the target GPU cluster after the target GPU cluster is allocated for a target training task, wherein at least one hardware GPU is deployed on each computer node included in the target GPU cluster, constructing the distributed file system for adapting to the target training task by using a preset local memory space deployed on the corresponding first computer node based on the server, and executing IO operation on the distributed file system of the GPU cluster in the process of executing the target training task by using the target GPU cluster based on the client.

According to an aspect of the disclosure, there is provided an electronic device comprising a processor, a memory for storing processor-executable instructions, wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a distributed file system adapting to GPU cluster training is designed, and the distributed file system comprises a management module, a server side and a client side, wherein after a target GPU cluster is allocated to a target training task, the management module creates the server side for a plurality of first computer nodes included in the target GPU cluster and creates the client side for a plurality of second computer nodes included in the target GPU cluster according to data read-write requirements of the target training task, so that the server side can elastically construct the GPU cluster distributed file system adapting to the target training task by utilizing a preset local memory space deployed on the corresponding first computer nodes, and furthermore, in the process of executing the target training task by utilizing the target GPU cluster, the client side can execute IO operation on the GPU cluster distributed file system so as to effectively meet IO requirements under a GPU cluster training scene.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 illustrates a block diagram of a distributed file system that adapts GPU cluster training, according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of interactions between a client and a server according to an embodiment of the present disclosure.

Fig. 3 illustrates a schematic diagram of metadata information and data transmission between a client and a server according to an embodiment of the disclosure.

FIG. 4 illustrates a schematic diagram of a client writing a file to a GPU cluster distributed file system, according to an embodiment of the present disclosure.

Fig. 5 illustrates a flowchart of a method of creating a distributed file system that adapts GPU cluster training, according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of an electronic device, according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

In the scenario of large model training based on a cluster of graphics processors (Graphics Processing Unit, GPUs), the use of a file system mainly includes three types of Input/Output (IO) operations, namely dataset read, checkpoint data write, and checkpoint data read.

Data set reading at the beginning of each iterative training phase of the GPU cluster training, the training framework (e.g., pytorch) will load the data set using the data loader. The data set is read by a large number of random IO, and 4k random reading is mainly adopted. The data set may also be converted in advance in data format using some means so that a data set read can convert a random read into a sequential read operation.

And writing check point data, namely, after the GPU cluster training is subjected to iterative training for a certain round, the training framework is required to store the output result of the training process, namely, the check point data in a lasting manner. The storage of the check point data is mainly used for directly outputting the model to perform fine adjustment and continuous training on the one hand, and continuously training based on the check point data after fault recovery under the condition that the follow-up training has faults on the other hand. The checkpoint data is typically large, e.g., the size of the checkpoint data corresponding to the model of 70B is approximately 980GB. Thus, the checkpoint data is typically saved and read slice-distributed. Checkpoint data writing is characterized by a large number of sequential write operations, predominantly 1M sequential write.

And (3) reading check point data, namely inevitably causing training interruption caused by some software or hardware faults in the process of carrying out large-scale model training based on the large-scale GPU cluster. Training needs to be continued after fault recovery, and at this time, the checkpoint data stored in the early stage needs to be read. Checkpointed data reads are characterized by a large number of sequential read operations, predominantly 1M sequential reads.

The storage system used by the data set reading and check point data reading and writing of the current GPU cluster training can mainly adopt the following three schemes of 1, a general network file system or a distributed file system, such as NFS, cephFS and the like, which can ensure the reliability of data, but needs independent deployment and operation and maintenance, and has poorer performance. 2. Parallel file systems, e.g., lustrefs, gpfs, beegfs, etc., perform well, but require independent deployment and operation. The physical form is mainly divided into an integrated machine form (single storage device) and a distributed form (multiple storage servers). 3. Object storage, expansibility is good, but performance is poor. Part of the scene may be accelerated in conjunction with a distributed cache, e.g., juicefs, alluxio, etc., but require a premature download to a local file system in the GPU cluster, or the GPU cluster locally provides a S3 FS-like client file system to provide access to POSIX semantics.

The storage system used for data set reading and check point data reading and writing of the current GPU cluster training is huge and complex, and most of functions are not required by three core IO service scenes in the GPU cluster training scene. Therefore, the embodiment of the disclosure provides a distributed file system for adapting to GPU cluster training, which can adapt to three core IO service scenes in a GPU cluster training scene, has simple operation and maintenance, saves cost and improves performance. The following describes in detail a distributed file system for adapting GPU cluster training provided by embodiments of the present disclosure.

Fig. 1 illustrates a block diagram of a distributed file system that adapts GPU cluster training, according to an embodiment of the present disclosure. As shown in FIG. 1, the system comprises a management module, a server side and a client side.

And the management module is used for creating a server for a plurality of first computer nodes included in the target GPU cluster and creating a client for a plurality of second computer nodes included in the target GPU cluster after the target GPU cluster is allocated for the target training task, wherein at least one hardware GPU is deployed on each computer node included in the target GPU cluster.

A GPU cluster is a computer cluster comprising a plurality of computer nodes. Each computer node is not only provided with at least one hardware GPU, but also provided with single-core CPU, multi-core CPU and even multi-CPU computing resources. GPU clusters utilize the computing power of hardware GPUs to enable very fast computing tasks to be performed.

For a target training task, a target GPU cluster can be distributed for the target training task based on a Kubernetes platform according to task requirements. Wherein the number of computer nodes included in the target GPU cluster depends on the task requirements of the target training task, which is not specifically limited by the present disclosure.

After the target GPU cluster is allocated to the target training task and before the execution of the target training task is started, the management module may create a server for a plurality of first computer nodes included in the target GPU cluster according to the actual IO requirements of the target training task.

The management module may multiplex an operator mechanism for cluster management in the Kubernetes platform, creating and distributing a server to a plurality of first computer nodes in the target GPU cluster.

The server side is used for constructing the GPU cluster distributed file system adapting to the target training task by utilizing the preset local memory space deployed on the corresponding first computer node.

The plurality of servers can communicate through a remote direct memory access (Remote Direct Memory Access, RDMA) network, so that a preset local memory space deployed on the plurality of first computer nodes is used as a file system back end to jointly construct a GPU cluster distributed file system adapting to the target training task.

The number of the first computer nodes can be determined according to the actual IO requirements of the target training task and the size of the preset local memory space deployed on each first computer node, so long as the total size of the preset local memory spaces deployed on all the first computer nodes, namely the size of the storage space of the constructed GPU cluster distributed file system, is ensured, and the actual IO requirements of the target training task can be met. The first computer node may be all computer nodes in the target GPU cluster, or may be only some computer nodes in the target GPU cluster, which is not specifically limited in this disclosure.

As shown in fig. 1, only a part of computer nodes in the target GPU cluster are used as first computer nodes, and a server is deployed.

In order to ensure the uniformity of the GPU cluster distributed file system constructed as described above, the memory of the preset local memory space deployed on each first computer node may be set to be the same size.

In an example, the preset local memory space deployed on the first computer node may be a local NVME (Non-Volatile Memory Express) disk space deployed on the first computer node, a shared memory space deployed on the first computer node (for example, shared memory/dev/shm in a Linux operating system), or other local memory spaces deployed on the first computer node, which is not specifically limited in the embodiments of the present disclosure.

For example, a target GPU cluster is allocated to a target training task, where the target GPU cluster includes N computer nodes, and before the target training task is started, the management module creates a server for each computer node included in the target GPU cluster, that is, each computer node included in the target GPU cluster is used as a first computer node. The size of the preset local memory space deployed on each first computer node is 15.36TB, at this time, N servers can communicate through an RDMA network to jointly construct an adaptive target training task, and the size of the memory space is 15.36TB×N GPU cluster distributed file system.

And the management module is also used for creating clients for the plurality of second computer nodes included in the target GPU cluster.

The number of second computer nodes may be determined based on the actual data reading requirements of each computer node in the target GPU cluster. In performing a target training task in a target GPU cluster, each computer node in the target GPU cluster performs a different training process. Based on the training process performed by each computer node, it may be determined whether the computer node has a data reading requirement. The management module may create clients for only the plurality of second computer nodes in the target GPU cluster that have data read requirements, without creating clients for computer nodes in the target GPU cluster that do not have data read requirements. The second computer node may be all computer nodes in the target GPU cluster, or may be only some computer nodes in the target GPU cluster, which is not specifically limited in this disclosure.

As shown in fig. 1, only a part of computer nodes in the target GPU cluster are used as second computer nodes, and a server is deployed.

Furthermore, for some computer nodes in the target GPU cluster, both as first computer node and as second computer node, i.e. in these computer nodes, both the server and the client are deployed.

As shown in fig. 1, a server and a client are deployed on a portion of computer nodes included in the target GPU cluster.

And the client is used for executing IO operation on the GPU cluster distributed file system in the process of executing the target training task by utilizing the target GPU cluster.

The client is integrated into the target training task entry, so that the GPU cluster distributed file system for constructing and adapting to the target training task can be accessed through the client, and in the process of executing the target training task by utilizing the target GPU cluster, IO operation can be executed on the GPU cluster distributed file system through the client.

In one possible implementation, the GPU cluster distributed file system is used to store the data set required in the target training task, as well as checkpoint data during training.

Because the GPU cluster distributed file system can store the data set required by the target training task and the check point data in the training process, the client can execute IO operations such as data set reading, check point data writing, check point data reading and the like on the GPU cluster distributed file system, and the client is adapted to three core IO service scenes in the GPU cluster training scene.

In one possible implementation, the client comprises a system call intercepting module and a client communication module, the server comprises a server communication module, the system call intercepting module is used for intercepting a target file system call in a second computer node where the client is located and converting the target file system call into an IO request, wherein the target file system call is used for accessing the GPU cluster distributed file system, and the client communication module is used for sending the IO request to the server communication module.

Fig. 2 shows a schematic diagram of interactions between a client and a server according to an embodiment of the present disclosure. As shown in fig. 2, a system call intercept module is included in each client.

When the second computer node where the client is located has data reading requirements in executing the training process, a target file system call is initiated to the constructed GPU cluster distributed file system, a system call intercepting module included in the client can intercept the target file system call, further, the target file system call can be converted into an IO request through random hash, and the IO request is sent to a server communication module included in the server by utilizing the client communication module.

In one possible implementation, the client communication module includes a remote procedure call (Remote Procedure Call, RPC) client module, the server communication module includes an RPC server module, and the RPC client module is configured to send the IO request to the RPC server module in a non-blocking RPC network communication mode.

The client and the server can communicate across computer nodes through a high-performance RPC network, and in addition, in order to further improve the performance, a non-blocking RPC network communication mode is adopted for communication between the client and the server.

The non-blocking RPC network communication mode refers to that after the RPC client module initiates a remote call to the RPC server module, the remote call is not waited for to be executed, but the subsequent code is returned and executed immediately. When the remote call process is completed, the RPC client may obtain the remote call result in some manner (e.g., callback function, event notification, etc.).

The RPC client module in the embodiments of the present disclosure may use a non-blocking RPC network communication mode to communicate with any one RPC server module.

As shown in fig. 2, each client includes an RPC client module therein, and each server includes an RPC server module therein, each RPC client module may communicate with each RPC server module based on a high performance network (e.g., an RPC network).

In one possible implementation, the client communication module comprises an IPC client module, and the server communication module comprises an IPC server module, wherein the IPC client module is used for sending the IO request to the IPC server module when the first computing point where the server is located and the second computer node where the client is located are the same computer nodes.

When the client and the server are located in the same computer node, in order to improve performance, an Inter-process communication (Inter-Process Communication, IPC) mode may be used to perform communication inside the computer node.

As shown in fig. 2, the client includes an IPC client module, and the server includes an IPC server module. Aiming at the client and the server in the same computer node, the internal communication of the computer node can be performed based on the IPC client module and the IPC server module.

In one possible implementation, the client includes a file map, the file map storing access records of the client to the GPU cluster distributed file system.

The file map in the client stores access records of the client to the GPU cluster distributed file system, e.g., records which files of the GPU cluster distributed file system the client currently has access to.

In one example, stored in the file map may be data of a map data structure.

In the embodiment of the disclosure, the high-performance network is abstracted and divided into an RPC layer and a bull layer. Fig. 3 illustrates a schematic diagram of metadata information and data transmission between a client and a server according to an embodiment of the disclosure. As shown in fig. 3, the RPC layer is used to transfer metadata information point-to-point between the client and the server, and the bulk layer is used to transfer actual data. The bulk layer may transmit actual data in an RDMA manner. bulk layer employs RDMA to transfer data

In one example, the underlying implementation of RPC transport may be to call OFED an interface through a Fabric API, call Verbs or TCP interface, and finally transport the data through a high performance network card driver. For the specific transmission procedure, reference may be made to an RPC transmission procedure in the related art, which is not specifically limited in the embodiments of the present disclosure.

In an example, based on the RPC layer for data transmission, a non-blocking RPC communication mode may be employed to improve data transmission performance.

In an example, a thread pool may be pre-constructed, and further, data transmission between the client and the server is implemented by using threads in the thread pool, so as to improve data transmission performance.

The data transmission between the client and the server is described in detail below.

In one possible implementation, the IO request comprises a metadata IO request, the client comprises a data segmentation module, the server comprises a metadata database, the data segmentation module is used for segmenting target data which needs to be written into the GPU cluster distributed file system, determining a plurality of data blocks corresponding to the target data and metadata information of each data block, wherein the metadata information corresponding to each data block is used for indicating the target server which needs to be written into the data block, the client communication module is used for sending the metadata IO request to each server communication module, the metadata IO request is used for requesting the metadata information of each data block corresponding to the target data to be written into each server, and the metadata database in each server is used for responding to the metadata IO request and storing the metadata information of each data block corresponding to the target data.

In the embodiment of the disclosure, the GPU cluster distributed file system is a fully distributed system, that is, each client may communicate with each server in the GPU cluster distributed file system, so that each client may independently parse each server in the GPU cluster distributed file system, without a central data structure capable of tracking the location of metadata information or data.

In order to realize balanced data distribution of large files, before a client requests to write target data into a GPU cluster distributed file system for storage, a data segmentation module may be utilized to segment the target data into a plurality of data blocks with the same size, and determine metadata information of each data block, where the metadata information corresponding to each data block is used to indicate a target server to which the data block needs to be written.

As shown in fig. 2, each server includes a metadata database for storing metadata information.

In one possible implementation, the number of specified data blocks corresponding to the target data written to each target server is the same.

In order to realize balanced data distribution of large files, the number of designated data blocks corresponding to target data written into each target server is the same.

In an example, for any one client, the data blocks corresponding to the target data may be selected to be stored in all the servers in an equalized distributed manner, or all the data blocks corresponding to the target data may be selected to be stored in some servers in an equalized distributed manner.

For example, the GPU cluster distributed file system corresponds to 10 servers, and the client stores all data blocks corresponding to the target data in a balanced distributed manner in all the servers, that is, the 10 servers are all target clients. Further, the client may divide the target data into 20 data blocks with the same size by using the data dividing module, and then store 2 data blocks in each server.

For example, the GPU cluster distributed file system corresponds to 10 servers, and the client stores all data blocks corresponding to the target data in a balanced distributed manner in 8 servers, that is, 8 servers (e.g., randomly selected, specified based on configuration information, etc.) are selected as the target client from the 10 servers. Further, the client may divide the target data into 8 data blocks with the same size by using the data dividing module, and then store 1 data block in each server.

After determining the metadata information of each data block obtained by dividing the target data, the client may generate a metadata IO request, and send the metadata IO request to each server by using the client communication module and the server communication module, so as to request the metadata information of each data block corresponding to the target data to be written into each server. That is, the metadata information of each data block corresponding to the target data is stored in a full amount at each server in the GPU cluster distributed file system, so that the GPU cluster distributed file system becomes a full distributed system.

Because the GPU cluster distributed file system provided by the embodiment of the disclosure can only support three core IO service scenes in the GPU cluster training scene, the size of metadata information can be reduced, and therefore metadata information can be processed with high performance under hundreds or even thousands of computer node scenes.

In one possible implementation mode, the IO requests comprise data IO requests, the service end comprises a persistence module, a communication client module and a persistence module, wherein the communication client module is used for sending the data IO requests to a plurality of target service ends, the data IO requests corresponding to each target service end are used for requesting to write specified data blocks corresponding to target data into the target service end, and the persistence module in each target service end is used for responding to the received data IO requests and storing the specified data blocks required to be written into the target service end into a preset local memory space corresponding to the target service end according to the metadata IO requests stored in a metadata database in the target service end.

After metadata information of each data block corresponding to target data is stored in a full amount in each server side in the GPU cluster distributed file system, a client side generates a data IO request according to the metadata information of each data block corresponding to the target data, and the client side communication module and the server side communication module are utilized to send the data IO request to each target server side, so that the designated data block needing to be written into the target server side is stored in a preset local memory space corresponding to the target server side in a lasting mode according to the metadata IO request stored in a metadata database in each target server side.

FIG. 4 illustrates a schematic diagram of a client writing a file to a GPU cluster distributed file system, according to an embodiment of the present disclosure. As shown in fig. 4, when a training process in a computer node where a client is located needs to write a file to a GPU cluster distributed file system, first, target data is written into a cache in the client, and then, the target data is divided into 6 data blocks with the same size, namely, data block 0 to data block 5 by using a data dividing module. And then the data block 0 and the data block 5 are packaged and then sent to a preset local memory space in the server 1 through the thread 1, the data block 2 and the data block 4 are packaged and then sent to the preset local memory space in the server 2 through the thread 2, and the data block 1 and the data block 3 are packaged and then sent to the preset local memory space in the server 3 through the thread 3.

In one possible implementation, the data IO request corresponding to each target server includes client memory area information corresponding to a specified data block to be written into the target server, and the persistence module in each target server reads the specified data block from a memory space specified by the client memory area information in an RDMA mode based on the client memory area information included in the received data IO request.

Under the condition of supporting the underlying network structure protocol, the client can disclose the client memory area information corresponding to the appointed data block to be written into the target server to the target server through the data IO request, so that the target server can directly read the appointed data block from the memory space appointed by the client memory area information in an RDMA mode, and the data transmission performance is improved.

In one possible implementation manner, the persistence module in each target server stores each specified data block to be written into the target server in a segment chunk file in a preset local memory space corresponding to the target server.

And the persistence module in each target server stores each appointed data block which needs to be written into the target server into a fragment chunk file in a preset local memory space corresponding to the target server, and the fragment chunk file is stored in a node structural domain at the bottom layer of the fragment chunk file.

In one possible implementation, the GPU cluster distributed file system is further configured to asynchronously send all or part of the checkpoint data to a parallel file system (PARALLEL FILE Storage, PFS) external to the target GPU cluster.

In actual use, the GPU cluster distributed file system may be used only to temporarily store data, and all or part of the checkpoint data temporarily stored by the GPU cluster distributed file system may be sent asynchronously to PFSs (e.g., GPFS, lustre, etc.) external to the target GPU cluster during or after execution of the target training task so that such data is not affected by failure of the target GPU cluster.

After completing the target training task with the target GPU cluster execution, the management module may delete the client and the server to free up resources.

In an example, the management module may multiplex the operators mechanism for cluster management in the Kubernetes platform, creating and deleting servers on computer nodes according to the target training tasks.

In an example, the management module is further configured to store configuration information of each server and monitoring information during a training process, so as to ensure that the target training task is successfully executed based on the target GPU cluster.

The distributed file system of the adaptive GPU cluster training provided by the embodiments of the present disclosure can provide uniform storage space, as well as greater capacity and linearly increasing throughput, compared to conventional parallel file systems using either main memory or localfs. Moreover, the distributed file system adapting to the GPU cluster training provided by the embodiment of the disclosure can only adapt to file system call in the GPU cluster training scene, abandon other capabilities of the traditional file system and the traditional parallel file system, simplify the call chain, thereby enabling the GPU cluster training to achieve more than 6-7 times of performance and having very simple operation and maintenance. In addition, the distributed file system adapting to GPU cluster training provided by the embodiment of the disclosure can support the shared memory on each computer node in the GPU cluster as a back-end storage medium, so that the performance of more than 10 times can be realized.

The distributed file system adapting to the training of the GPU clusters, provided by the embodiment of the disclosure, can be flexibly created and fused to deploy the distributed file system of the GPU clusters meeting the task requirements along with the starting and deleting of training tasks, fully utilizes the CPU, storage, PCIE bandwidth and network resources of computing nodes in the GPU clusters, saves cost and improves performance.

Fig. 5 illustrates a flowchart of a method of creating a distributed file system that adapts GPU cluster training, according to an embodiment of the present disclosure. As shown in fig. 5, the method includes:

In step S51, after the target GPU cluster is allocated for the target training task, a server is created for a plurality of first computer nodes included in the target GPU cluster, and a client is created for a plurality of second computer nodes included in the target GPU cluster, wherein at least one hardware GPU is deployed on each computer node included in the target GPU cluster.

In step S52, based on the server, a GPU cluster distributed file system adapted to the target training task is constructed using a preset local memory space deployed on the corresponding first computer node.

In step S53, based on the client, in the process of executing the target training task with the target GPU cluster, an IO operation is performed on the GPU cluster distributed file system.

For a specific process of performing an IO operation on the GPU cluster distributed file system by the client, reference may be made to the descriptions related to the embodiments shown in fig. 1 to 5, which are not described herein.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

The method has specific technical association with the internal structure of the computer system, and can solve the technical problems of improving the hardware operation efficiency or the execution effect (including reducing the data storage amount, reducing the data transmission amount, improving the hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system which accords with the natural law.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides electronic equipment, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to call the instructions stored by the memory so as to execute the method.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 6 shows a block diagram of an electronic device, according to an embodiment of the disclosure. Referring to fig. 6, an electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 6, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows Server ^TM), the apple Inc. promoted graphical user interface-based operating system (Mac OS X ^TM), the multi-user, multi-process computer operating system (Unix ^TM), the free and open source Unix-like operating system (Linux ^TM), the open source Unix-like operating system (FreeBSD ^TM), or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and obvious mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, if the personal voluntarily enters the acquisition range, the personal information is considered as consent to acquire the personal information, or if a clear mark/information is used on a personal information processing device to inform that the personal information processing rule is used, personal authorization is obtained through popup information or a mode of requesting the personal information to upload the personal information by the personal, wherein the personal information processing rule can comprise information such as a personal information processor, a personal information processing purpose, a processing mode, a processed personal information type and the like.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The distributed file system adapting to GPU cluster training is characterized by comprising a management module, a server side and a client side;

The management module is used for creating a server for a plurality of first computer nodes included in a target GPU cluster and creating a client for a plurality of second computer nodes included in the target GPU cluster after the target GPU cluster is allocated for a target training task, wherein at least one hardware GPU is deployed on each computer node included in the target GPU cluster;

the server is used for constructing a GPU cluster distributed file system adapting to the target training task by utilizing a preset local memory space deployed on a corresponding first computer node;

And the client is used for executing input/output IO operation on the GPU cluster distributed file system in the process of executing the target training task by utilizing the target GPU cluster.

2. The system of claim 1, wherein the memory size of the preset local memory space deployed on each first computer node is the same.

3. The system of claim 1, wherein the GPU cluster distributed file system is configured to store a set of data required in the target training task, and checkpoint data during training.

4. The system of claim 1, wherein the client comprises a system call interception module and a client communication module, and the server comprises a server communication module;

The system call intercepting module is used for intercepting a target file system call in a second computer node where the client is located and converting the target file system call into an IO request, wherein the target file system call is used for accessing the GPU cluster distributed file system;

The client communication module is used for sending the IO request to the server communication module.

5. The system of claim 4, wherein the client communication module comprises a remote procedure call, RPC, client module, the server communication module comprises an RPC server module;

The RPC client module is used for sending the IO request to the RPC server module by adopting a non-blocking RPC network communication mode.

6. The system of claim 4, wherein the client communication module comprises an IPC client module, and wherein the server communication module comprises an IPC server module;

The IPC client module is configured to send the IO request to the IPC server module when the first computing point where the server is located and the second computing node where the client is located are the same computing nodes.

7. The system according to any one of claims 1 to 6, wherein the client comprises a file map;

And the file graph is used for storing access records of the client to the GPU cluster distributed file system.

8. The system of claim 4, wherein the IO request comprises a metadata IO request, the client comprises a data splitting module, and the server comprises a metadata database;

the data segmentation module is used for partitioning target data which needs to be written into the GPU cluster distributed file system, determining a plurality of data blocks corresponding to the target data and metadata information of each data block, wherein the metadata information corresponding to each data block is used for indicating a target server which needs to be written into the data block;

The client communication module is configured to send the metadata IO request to each server communication module, where the metadata IO request is used to request metadata information of each data block corresponding to the target data to be written into each server;

And the metadata database in each server is used for responding to the metadata IO request and storing the metadata information of each data block corresponding to the target data.

9. The system of claim 8, wherein the IO request comprises a data IO request, and wherein the server comprises a persistence module;

The communication client module is used for sending data IO requests to a plurality of target service ends, wherein the data IO request corresponding to each target service end is used for requesting to write a specified data block corresponding to the target data into the target service end;

and the persistence module in each target server is used for responding to the received data IO request, and storing the appointed data block which needs to be written into the target server into the preset local memory space corresponding to the target server according to the metadata IO request stored in the metadata database in the target server.

10. The system of claim 9, wherein the persistence module in each target server stores each specified data block to be written into the target server in a segment chunk file in the preset local memory space corresponding to the target server.

11. The system according to any one of claims 8 to 10, wherein the number of specified data blocks corresponding to the target data written to each target server is the same.

12. The system according to claim 9 or 10, wherein the data IO request corresponding to each target server includes the client memory area information corresponding to the specified data block to be written into the target server;

and the persistence module in each target server side reads the appointed data block from the memory space appointed by the client memory area information by a Remote Direct Memory Access (RDMA) mode based on the client memory area information included in the received data IO request.

13. The system of claim 3, wherein the GPU cluster distributed file system is further configured to asynchronously send all or a portion of checkpoint data to a parallel file system PFS external to the target GPU cluster.

14. The system of claim 1, wherein the management module is configured to delete the client and the server after completing the target training task using the target GPU cluster execution.

15. A method for creating a distributed file system adapted for GPU cluster training, comprising:

After a target GPU cluster is allocated for a target training task, a server is created for a plurality of first computer nodes included in the target GPU cluster, and a client is created for a plurality of second computer nodes included in the target GPU cluster, wherein at least one hardware GPU is deployed on each computer node included in the target GPU cluster;

based on the server, constructing a GPU cluster distributed file system adapting to the target training task by utilizing a preset local memory space deployed on a corresponding first computer node;

And based on the client, in the process of executing the target training task by utilizing the target GPU cluster, executing IO operation on the GPU cluster distributed file system.

16. An electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of claim 15.

17. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of claim 15.