[go: up one dir, main page]

CN118153711A - A model checkpoint saving method, device, equipment and storage medium - Google Patents

A model checkpoint saving method, device, equipment and storage medium Download PDF

Info

Publication number
CN118153711A
CN118153711A CN202410302871.2A CN202410302871A CN118153711A CN 118153711 A CN118153711 A CN 118153711A CN 202410302871 A CN202410302871 A CN 202410302871A CN 118153711 A CN118153711 A CN 118153711A
Authority
CN
China
Prior art keywords
model
parameters
storage
transmitted
target model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410302871.2A
Other languages
Chinese (zh)
Inventor
陆游游
曾少勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202410302871.2A priority Critical patent/CN118153711A/en
Publication of CN118153711A publication Critical patent/CN118153711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明公开了一种模型检查点保存方法、装置、设备及存储介质。所述方法应用于训练端,包括:在确定需要保存目标模型检查点的情况下,将所述目标模型当前的模型训练参数传输到本地预设存储空间;在将所述目标模型当前的模型训练参数传输到本地预设存储空间的过程中,开始监测预设传输条件是否被满足;所述监测至少持续到所述目标模型当前的模型训练参数成功传输到存储端;在监测到所述预设传输条件被满足的情况下,针对所述本地预设存储空间中,未传输到存储端的模型训练参数,确定出一组待传输参数传输到存储端;所述存储端用于:根据接收到的模型训练参数,保存目标模型检查点。

The present invention discloses a model checkpoint saving method, device, equipment and storage medium. The method is applied to a training end, comprising: when it is determined that a target model checkpoint needs to be saved, the current model training parameters of the target model are transmitted to a local preset storage space; in the process of transmitting the current model training parameters of the target model to the local preset storage space, start monitoring whether a preset transmission condition is met; the monitoring continues at least until the current model training parameters of the target model are successfully transmitted to the storage end; when it is monitored that the preset transmission condition is met, for the model training parameters in the local preset storage space that have not been transmitted to the storage end, a group of parameters to be transmitted is determined and transmitted to the storage end; the storage end is used to: save the target model checkpoint according to the received model training parameters.

Description

Model check point storage method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a method, an apparatus, a device, and a storage medium for storing a model checkpoint.
Background
In the training process of the machine learning model, because the training time is long, if the training equipment fails, the training is interrupted, and the model parameters are lost. To reduce the risk of parameter loss, a model checkpointing mechanism may be employed to persist model states at some point in time.
Even if the training equipment fails, the model parameters are lost, the model parameters can be recovered based on the stored model states, and particularly the latest stored check points can be recovered, so that the original model states can be recovered at a lower calculation cost.
However, in the existing model check point preservation mechanism, the training equipment is required to send the current state of the trained model to the external equipment for storage, and the training equipment is required to extract the model state into the local storage space first, further required to wait for the model state to be completely extracted into the local storage space, and then copy the complete model state into the continuous address space as a whole for serialization and then send the complete model state to the external equipment.
Therefore, the conventional model checkpoint preservation method is inefficient.
Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for saving a model check point, which are used for solving the defects in the related technology.
According to a first aspect of an embodiment of the present invention, there is provided a model checkpoint saving method, applied to a training terminal, including:
transmitting current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored;
In the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
Under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in the local preset storage space;
The storage end is used for: and according to the received model training parameters, storing a target model check point.
According to a second aspect of an embodiment of the present invention, there is provided another model checkpoint saving method, including:
The training end transmits current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored;
The training end starts to monitor whether preset transmission conditions are met or not in the process of transmitting current model training parameters of the target model to a local preset storage space; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
Under the condition that the preset transmission condition is met, the training end determines a group of parameters to be transmitted to a storage end according to model training parameters which are not transmitted to the storage end in the local preset storage space;
and the storage end stores the target model check point according to the received model training parameters.
According to a third aspect of the embodiment of the present invention, there is provided a model checkpoint saving apparatus, applied to a training terminal, including:
The storage unit is used for transmitting the current model training parameters of the target model to a local preset storage space under the condition that the check point of the target model is determined to be stored;
The monitoring unit is used for starting to monitor whether preset transmission conditions are met or not in the process of transmitting the current model training parameters of the target model to a local preset storage space; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
The transmission unit is used for determining a group of parameters to be transmitted to the storage end aiming at the model training parameters which are not transmitted to the storage end in the local preset storage space under the condition that the preset transmission condition is met; the storage end is used for: and according to the received model training parameters, storing a target model check point.
According to a fourth aspect of an embodiment of the present invention, there is provided a model checkpoint preservation system, including a storage end and at least one training end;
Any training end is used for: transmitting current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored; in the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end; under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in the local preset storage space;
The storage end is used for: and according to the received model training parameters, storing a target model check point.
According to a fifth aspect of an embodiment of the present invention, there is provided an electronic apparatus including:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the one processor to enable the at least one processor to perform the method of the first aspect described above.
According to a sixth aspect of embodiments of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect described above.
According to a seventh aspect of embodiments of the present invention, there is provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method of the first aspect described above.
According to the embodiment, the training end directly transmits the model training parameters to the storage end through monitoring of the preset transmission conditions in the process of transmitting the model training parameters to the local preset storage space, so that the target model checking point is stored, the waiting time for waiting for all model training parameters to be transmitted to the local preset storage space is not required to be transmitted, at least the waiting time for waiting for all model training parameters to be transmitted to the local preset storage space can be saved, and the storage efficiency of the model checking point is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow diagram illustrating a method of model checkpoint preservation in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a model training parameter transmission in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart illustrating another method of model checkpoint preservation in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a model checkpoint preservation method in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a model checkpoint preservation apparatus in accordance with an embodiment of the present invention;
Fig. 6 is a schematic diagram of a hardware structure of a computer device for configuring a method according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the embodiments of the present invention are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide a corresponding operation entry for the user to select authorization or rejection.
In the training process of the machine learning model, because the training time is long, if the training equipment fails, the training is interrupted, and the model parameters are lost. To reduce the risk of parameter loss, a model checkpointing mechanism may be employed to persist model states at some point in time.
Even if the training equipment fails, the model parameters are lost, the model parameters can be recovered based on the stored model states, and particularly the latest stored check points can be recovered, so that the original model states can be recovered at a lower calculation cost.
However, in the existing model check point preservation mechanism, the training equipment is required to send the current state of the trained model to the external equipment for storage, and the training equipment is required to extract the model state into the local storage space first, further required to wait for the model state to be completely extracted into the local storage space, and then copy the complete model state into the continuous address space as a whole for serialization and then send the complete model state to the external equipment.
Therefore, the conventional model checkpoint preservation method is inefficient.
In order to solve the problems, the embodiment of the invention discloses a model check point preservation method.
In the method, in order to improve the model checkpoint storage efficiency, an existing model checkpoint storage mechanism is first analyzed.
The model state comprises a model structure and model training parameters. The model training parameters may specifically be parameters used in the model training process.
Optionally, the model training parameters may include at least one of: model tensors, model parameters, and model training optimizer parameters.
The model tensor may specifically refer to a data structure of a multidimensional array, which is one of the most commonly used data representations in a neural network. Tensors can be seen as a generalization of vectors, matrices, and higher-dimensional arrays, which are the basic units of storing and processing data, and in particular, can represent the form of data structures of model parameters. Model tensors are used herein to represent model training parameters, which may include, in particular, model parameters in the form of tensors and/or model training optimizer parameters. The model tensor may include part of the model parameters, or may include part of the model training optimizer parameters.
Regarding model training optimizer parameters, during model training, a model training optimizer, which may also be referred to as a model parameter optimizer, may be used to optimize the update of model parameters when updating the model parameters. The optimizer here does not participate in model prediction after model training is completed.
Of course, the model training optimizer may not be used in part of the model training process, or the optimizer parameters may not be updated. The specific examples are not limited.
In analyzing existing model checkpoint preservation mechanisms, model training parameters are found to be continuous address space that is saved directly into local storage space. Model training parameters are typically maintained as a whole into a continuous address space.
In one specific example, since the form of the model training parameters is typically in the form of tensors, it is typically stored as a whole, into a continuous address space.
Based on the characteristics, the method can be used for rapidly transmitting the model training parameters so as to improve the storage efficiency of the model check points.
Specifically, after the training device transmits the model training parameters to the local storage space, the training device can directly sequence the model training parameters to the external device without additionally copying the model training parameters to the continuous address space, and save the model checking points, so that the time for copying the model training parameters to the continuous address space can be saved, and the saving efficiency of the model checking points can be improved.
In addition, the model training parameters are stored locally in a binary form, and the binary form can be a form capable of being transmitted, so that serialization is not needed, conversion into a form capable of being transmitted is not needed, the model training parameters in the local storage space are directly sent to external equipment, serialization is not needed, time for serialization of the model training parameters can be saved, and storage efficiency of model checking points is improved.
In addition, in the method, the model training parameters of the local storage space can be directly transmitted to the external equipment, namely, the model training parameters can be transmitted in a pipeline mode, the model training parameters are not required to be completely transmitted to the local storage space and then transmitted to the external equipment, but part of the model training parameters transmitted to the local storage space can be directly transmitted to the external equipment in the process of transmitting the model training parameters to the local storage space until all the model training parameters are transmitted to the external equipment, so that the waiting time for transmitting all the model training parameters to the local storage space can be saved, and the storage efficiency of model checking points can be improved.
In summary, by analyzing the existing mechanism, in the process that the training device transmits the model training parameters to the local storage space, the method adopts the pipeline form to directly transmit the model training parameters to the external device.
The following explains in detail a model checkpoint saving method provided by the embodiment of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a method for saving a model checkpoint in accordance with an embodiment of the present invention.
The method flow can be applied to a training end. Wherein, the training end can be used for model training, and can save model check points to the storage end. The method is not limited to a specific training device, and the specific training device may be any computing device, for example, a server or a terminal, etc.
The method flow may include the following steps.
S101: and under the condition that the target model check point is determined to be required to be saved, transmitting the current model training parameters of the target model to a local preset storage space.
S102: in the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; monitoring is continued at least until current model training parameters of the target model are successfully transmitted to the storage end.
S103: under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in a local preset storage space; the storage end is used for: and according to the received model training parameters, storing a target model check point.
In the process of the method, the training end directly transmits the model training parameters to the storage end through monitoring of the preset transmission conditions in the process of transmitting the model training parameters to the local preset storage space, so that compared with the existing model check point storage mechanism, the method has the advantages that waiting time for transmitting all model training parameters to the local preset storage space is saved, waiting time for transmitting all model training parameters to the local preset storage space is at least saved, and storage efficiency of the model check point is improved.
Some additions to the above process flow are made below.
Regarding the training side and the storage side, the flow of the method is not limited to the relation between the training side and the storage side, and is not limited to the specific equipment of the storage side.
In an alternative embodiment, the training side saves the model checkpoints on the storage side, that is, the external device, instead of being local to the training side, which may be due to the expansion of the model parameter scale, the training side may have difficulty storing multiple versions of the model checkpoints locally, so that the model checkpoints may be saved to the external storage device.
In addition, the current model training can be generally performed by using distributed equipment, so that calculation force can be enhanced, model training efficiency is improved, and accordingly, model checking points trained by all equipment in the distributed equipment can be stored in external storage equipment when the model checking points are stored, so that management and storage are convenient.
Thus, multiple training terminals may each save the model checkpoints on an external storage device. The method for storing the model check points of any training end is explained in the flow of the method, and the method for storing the model check points of other training ends can refer to the explanation of the flow of the method. Different training terminals can train different models, and model check points of the different models are respectively stored in the storage terminal.
Regarding the saving of model checkpoints, the present method flow is not limited to a specific saving mode.
In the above-mentioned method flows S101-S103, the transmission of the model training parameters is explained, and in the process of storing the model check points, the model structure and the model training parameters can be added to be stored together, so that the complete model state can be stored.
The flow of the method is not limited to a specific form of the model structure, but is also not limited to a storage method and a transmission method of the model structure.
In an alternative embodiment, the storage may store different model checkpoints based on the fixed model structure, and different versions of model training parameters, considering that the model structure is not typically updated during model training, so that the model structure may be transferred only once to the storage. The embodiment can reduce the data quantity to be transmitted, save the data transmission time, save the time of repeatedly transmitting the model structure and improve the storage efficiency of the model check points.
In another alternative embodiment, the model structure may also be transferred to the storage for saving the model checkpoints each time the model checkpoints need to be saved. Considering that the model structure is also updated, for example, the number of convolution layers is increased or decreased in the model training process, the stability of the model check points can be improved by transmitting the model structure under the condition that the model check points need to be saved.
Therefore, optionally, the storage end stores the model check point according to the received model training parameters, which may specifically be: and the storage end stores the target model check point according to the received model training parameters and the model structure of the target model transmitted by the training end. The storage terminal may specifically be used for: and storing model checking points according to the received model training parameters and the target model structure transmitted by the training end.
In this embodiment, the model training parameters and the model structure of the target model may be combined, so as to save the model check point and improve the integrity of the model check point.
The embodiment does not limit the specific manner in which the training end transmits the model structure of the target model to the storage end. Specifically, the model structure of the target model may be transferred to the storage end in advance, or the model structure of the target model may be transferred to the storage end in real time.
Optionally, the method for transmitting the model structure of the object model may include any one of the following:
1) And under the condition that the check point of the target model is required to be saved, the training end transmits the model structure of the target model to the storage end.
2) The training end transmits the model structure of the target model to the storage end in advance.
3) Under the condition that the target model check point needs to be stored for the first time, the training end transmits the model structure of the target model to the storage end.
Under the condition that the target model check point needs to be stored is determined, the training end can transmit the model structure of the target model to the storage end, and specifically, the training end can transmit the target model structure once every time the target model check point needs to be stored.
Optionally, since the training end needs to transmit the model structure of the target model and the model training parameters to the storage end, in order to improve the transmission efficiency, the target model structure and the model training parameters may be transmitted in parallel, and in particular, different transmission connections may be adopted to perform parallel transmission.
Of course, serial transmission is also possible. For example, the model training parameters are transmitted first, and then the model structure is transmitted.
For the training end to transmit the model structure of the target model to the storage end in advance, the embodiment does not limit the time of the pre-transmission.
Alternatively, before the target model checkpoint is saved, the training end may transmit the target model structure to the storage end in advance for storage. Or before model training, the training end transmits the target model structure to the storage end in advance for storage.
In addition, the training end can transmit the model structure of the target model to the storage end under the condition that the target model check point needs to be stored for the first time.
Under the condition of transmitting the target model structure once in advance, the target model structure can be not transmitted any more when the target model check point is stored later, and the storage end can store the target model structure received in advance, so that the model check point can be stored in combination with the stored target model structure after model training parameters are received later.
Compared with the method that the target model structure is transmitted when the target model check point is stored every time, the method and the device can reduce the data quantity to be transmitted, improve the data transmission efficiency, save the time for transmitting the target model structure and improve the storage efficiency of the model check point.
In addition, the above method flow describes the flow of the training end at a single storage model check point, and it can be understood that the above method flow can be adopted for storage for any storage model check point.
The steps in the above-described process flow are explained in detail below.
1. S101: and under the condition that the target model check point is determined to be required to be saved, transmitting the current model training parameters of the target model to a local preset storage space.
First, regarding the training end, the flow of the method is not limited to the form in which the model training is performed by the specific training end. Alternatively, the training side may train one or more models simultaneously. The training end can perform model training in the display card or in the memory.
And as for the target model, a model trained by the training side may be used. The flow of the method is not limited to a specific structure of the target model, and can be a neural network model, a convolution model, a language model and the like.
The flow of the method is not limited to specific training conditions of the target model. The target model may be model training in the training end, and specifically may be training in a memory or a graphics card of the training end. The training mode of the target model can be unsupervised training, supervised training, contrast learning, antagonistic learning and the like.
The flow of the method is not limited to the specific case of saving the target model check point. Alternatively, the target model check point may be periodically determined to be saved, or after updating the model parameters a certain number of times, the target model check point may be determined to be saved, or in the model training process, the target model check point may be randomly determined to be saved.
With respect to model training parameters, the model training parameters may specifically be parameters used in the target model training process. Optionally, the model training parameters may include at least one of: model tensors, model parameters, and model training optimizer parameters. Specific explanations can be found above.
The model tensor may specifically be a model training parameter in the form of a tensor. The model parameters may be parameters of the target model itself, and the model training optimizer parameters may include parameters of a parameter optimizer used in training the target model.
Regarding the transmission of the model training parameters to the local preset storage space, the flow of the method is not limited to a specific transmission mode or a specific form of the local preset storage space.
For convenience of distinction, the storage space for model training may be referred to as a training storage space, and may specifically be a training-side memory or a memory in a graphics card.
The model training parameters may be copied from the training storage space to the local preset storage space, where the model training parameters are mainly used to reduce the influence of model check point preservation on the model training process.
Since the model training requires updating parameters many times, if the model training parameters are directly transmitted from the training storage space to the storage end, the storage end needs to wait for the response after the transmission is completed, so that the parameters can be continuously updated. And if a problem occurs in the transmission process, for example, network fluctuation or network bandwidth is small, so that the transmission rate is slow, the model is influenced to train continuously.
Therefore, the model training parameters are locally copied, the parameters can be continuously updated after copying is completed, the copied model training parameters can be used for being transmitted to the storage end, and compared with the mode training parameters which are directly transmitted from the training storage space to the storage end, the parameters can be continuously updated faster, and model training can be continuously performed.
In addition, after the model training parameters are copied to the local preset storage space, the model training parameters can be transmitted from the local preset storage space to the storage end, problems occur in the transmission process, and the training of the model in the training storage space is difficult to influence, so that the influence of data transmission on model training is reduced.
The local preset storage space can be specifically a memory of the training end and can be distinguished from the training storage space. The local preset storage space can also be a snapshot memory of the training end, so that model training parameters can be stored in a snapshot storage mode.
Optionally, the current model training parameters of the target model may be transmitted to a local preset storage space, and specifically, the current model training parameters of the target model may be transmitted to a continuous address space in the local preset storage space, so that the transmission is convenient.
For easy understanding, in a specific example, the target model may be trained in a graphics card at a training end, and further, in a case that it is determined that a target model check point needs to be saved, for example, if the target model parameter is updated 3 times, it is determined that the target model check point needs to be saved, so that the current model training parameter of the target model may be saved and copied to a local preset storage space, for example, to a local memory. In particular, it may be copied into a continuous address space in a local preset memory space.
Based on the above explanation, the training end can also transmit the model structure of the target model to the storage end, and the method is not limited to a specific model structure transmission flow.
Alternatively, the model structure of the target model may be transferred directly from the training memory space to the memory, which takes into account that the model structure of the target model is typically smaller in data volume and has less impact on model training.
Alternatively, the model structure of the target model may be transferred from the training end to the local preset storage space, or to another local storage space, and then transferred to the storage end.
The method specifically comprises the steps of transmitting a model structure of a target model and current model training parameters of the target model in parallel to a local preset storage space; the model structure of the target model and the current model training parameters of the target model can be transmitted in parallel to a storage end.
The time for transmitting data to the local preset storage space can be reduced by parallel transmission to the local preset storage space, and the storage efficiency of the model check point is improved; by transmitting the data to the storage end in parallel, the time for transmitting the data to the storage end can be reduced, and the transmission efficiency of the model check point can be improved.
In addition, the model structure of the target model can be transmitted to a local continuous address space, so that the transmission is convenient. In particular, can be transmitted to the local continuous address space together with the model training parameters.
These embodiments of transferring the model structure to the local preset storage space may be combined with the above embodiments of the transfer method of the model structure.
In an alternative embodiment, the method for transmitting the object model structure may include any one of the following:
1) Under the condition that the check point of the target model needs to be stored is determined, the training end transmits the model structure of the target model to a local preset storage space and transmits the model structure to the storage end.
2) And under the condition that the check point of the target model is required to be saved, the training end transmits the model structure of the target model from the training storage space to the storage end.
3) The training end transmits the model structure of the target model to a local preset storage space in advance and transmits the model structure to the storage end.
4) The training end transmits the model structure of the target model from the training storage space to the storage end in advance.
5) Under the condition that the check point of the target model needs to be stored for the first time, the training end transmits the model structure of the target model to a local preset storage space and transmits the model structure to the storage end.
The 5 embodiments described above are for illustrative purposes only, and combinations of the various embodiments by simple reasoning are within the scope of the disclosure of the embodiments of the present specification.
It can be appreciated that the transmission mode of the model structure may specifically refer to the transmission mode of the model training parameters.
In an alternative embodiment, the model training parameters of the target model may be all model training parameters of the target model, thereby facilitating subsequent storage of model checkpoints.
And considering that between the stored model checkpoints, there is update of model training parameters, and possibly also unchanged model training parameters, so that the updated model training parameters can be transmitted, and further, the model training parameters of the model checkpoints which need to be stored currently can be determined based on the model training parameters in the last stored model checkpoints.
Therefore, optionally, in the case of not first storing the target model checking point, the updated model training parameters generated between the current model training parameters of the target model and the model training parameters in the last stored model checking point are transmitted to the local preset storage space, and further the subsequent steps are executed, so that the storage end stores the model checking point according to the received model training parameters and the last stored model checking point.
The embodiment can reduce the transmission data quantity by transmitting the updated model training parameters, and improve the data transmission efficiency and the storage efficiency of the model check points.
In an optional embodiment, the method flow may further include: in updating the model training parameters of the target model, updated model training parameters are determined.
Because the training end can perform model training and update model parameters, the updated model training parameters can be directly determined.
Alternatively, the determined model training parameters may be directly stored as current model training parameters in a local preset storage space. The judgment can also be performed according to the determined model training parameter data quantity.
If the determined data quantity of the model training parameters is more, the updated model training parameter positions also need to be recorded, so that the full quantity of model training parameters can be directly saved in a local preset storage space in an original mode.
And if the determined model training parameter data amount is small, the updated model training parameter can be directly saved.
Therefore, optionally, the current model training parameters of the target model are transmitted to the local preset storage space, which may specifically be: and under the condition that the data quantity of the updated model training parameters is smaller than the threshold value of the updated data quantity, transmitting the model training parameters of the current updated target model to a local preset storage space.
The present embodiment does not limit the specific size of the update data amount threshold. For example, the updated data volume threshold may be 30% of the full volume data volume of the model training parameters, and so on.
The manner in which the determined model training parameters are stored is not limited in this embodiment. The model training parameters can be stored in tensor form, and the model training parameters which are not updated can be stored by adopting specified marks. The determined model training parameters can also be stored corresponding to the positions of the parameters in the target model, so that the updated model training parameters can be conveniently determined later.
Correspondingly, optionally, the storage end stores a model check point according to the received model training parameters, which specifically may be: and the storage end stores the model check point according to the received model training parameters and the last stored model check point. The storage terminal may specifically be used for: and storing the target model check point according to the received model training parameters and the target model check point stored last time.
The embodiment can reduce the transmission data quantity by transmitting the updated model training parameters, and improve the data transmission efficiency and the storage efficiency of the model check points.
2. S102: in the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; monitoring is continued at least until current model training parameters of the target model are successfully transmitted to the storage end.
3. S103: under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in a local preset storage space; the storage end is used for: and according to the received model training parameters, storing a target model check point.
Since S102 and S103 have a step-wise association, these 2 steps are combined for explanation.
In the process of transmitting the current model training parameters to the local preset storage space in S101, S102 and S03 may be further performed.
It should be noted that, between S101 to S103, the steps are not performed in a fixed order. In the execution of S101, S102 and S103 may be executed. Also, S102 and S103 may be performed a plurality of times. The monitoring in S102 is a continuously performed operation.
In the process of transmitting the current model training parameters in S101, the current model training parameters may be transmitted to a local preset storage space through data transmission in the training end, so as to save the current model training parameters to the local preset storage space.
In the process of transmitting the current model training parameters, monitoring whether the preset transmission conditions are met can be started. If the preset transmission condition is not satisfied, the monitoring can be continued. If the preset transmission condition is satisfied, it means that data transmission is required, so that S103 can be performed.
It should be emphasized that the steps S102 and S103 may be to implement pipelined data transmission, so that the model training parameters may be transmitted faster, that is, without waiting for the full transmission of the model training parameters to the local preset storage space, but in the process of transmitting the model training parameters, the model training parameters transmitted to the local preset storage space in real time are transmitted to the storage terminal.
Therefore, the transmission efficiency of model training parameters can be improved, and the storage efficiency of model checking points can be improved.
The storage end may store the target model check point in real time according to the model training parameters received each time, or may be configured to store the target model check point according to the model training parameters received, or may specifically be configured to store the target model check point when a full amount of model training parameters are received, or configured to store the target model check point when receiving the model training parameters is completed.
With respect to monitoring, the flow of the method is not limited to a specific monitoring mode.
Alternatively, a periodic monitoring mode may be adopted, or a loop step may be adopted to monitor, where each loop determines whether the preset transmission condition is satisfied.
The flow of the method is not limited to a specific monitoring object, and the monitoring can be performed according to specific preset transmission conditions.
The flow of the method is not limited to the time for starting monitoring, and the monitoring can be started in the process of transmitting the current model training parameters of the target model to the local preset storage space.
Alternatively, it may be that at the moment when the current model training parameters of the target model are transmitted to the local preset storage space, monitoring is started to determine whether the preset transmission conditions are satisfied; or after starting to transmit the current model training parameters of the target model to the local preset storage space, designating a time length, and starting to monitor whether preset transmission conditions are met; or the time when the parameter of the appointed data quantity is transmitted to the local preset storage space in the current model training parameter of the target model can be used for starting to monitor whether the preset transmission condition is met.
In this embodiment, by starting to monitor whether the preset transmission condition is satisfied during the process of locally transmitting the model training parameters, and further combining with the execution of S103, the model training parameters in the local preset storage space may be transmitted to the storage end, without waiting for all the model training parameters to be transmitted and stored in the local preset storage space, and then starting to transmit, so as to reduce the transmission time of the model training parameters, improve the transmission efficiency of the model training parameters, and improve the storage efficiency of the model check points.
In the continuous monitoring process, there may be multiple times of preset transmission conditions satisfied, so that S103 may be triggered and executed multiple times, and the parameters to be transmitted are determined to be transmitted to the storage end.
Therefore, alternatively, S103 may specifically be that, in each case that it is monitored that the preset transmission condition is satisfied, the following steps may be triggered to be performed: and determining a group of parameters to be transmitted to the storage end aiming at the model training parameters which are not transmitted to the storage end in the local preset storage space.
Optionally, S103 may specifically be that, in a case where it is detected that the preset transmission condition is met at any one time, the following steps may be triggered to be executed: and determining a group of parameters to be transmitted to the storage end aiming at the model training parameters which are not transmitted to the storage end in the local preset storage space.
The flow of the method is not limited to a specific case of monitoring stoppage. Alternatively, the monitoring may be stopped under the condition that the current model training parameters are fully transmitted to the storage end, specifically, the monitoring may be continued until the current model training parameters of the target model are successfully transmitted to the storage end, or the monitoring may be continued until the current model training parameters of the target model are fully successfully transmitted to the storage end. The monitoring may be stopped when the storage end successfully saves the target model checkpoint, and specifically, the monitoring may be continued until the storage end successfully saves the target model checkpoint. Because the storage end successfully stores the target model check point, all current model training parameters of the target model need to be used, and after all current model training parameters of the target model are successfully transmitted to the storage end, the storage end successfully stores the target model check point.
Alternatively, the monitoring may be stopped in case the current model training parameters of the target model are all transferred to the storage. In the case that the storage end successfully saves the target model checkpoints, the monitoring can be stopped.
It should be noted that, in the process of continuously monitoring whether the preset transmission condition is satisfied, the preset transmission condition may be satisfied for multiple times, so as to trigger and execute S103 multiple times, determine that the parameter to be transmitted is transmitted to the storage end, implement pipelined transmission, reduce the transmission time of the model training parameter, improve the transmission efficiency of the model training parameter, and improve the storage efficiency of the model check point.
Regarding the preset transmission conditions, the flow of the method is not limited to specific preset transmission conditions.
Alternatively, the preset transmission condition may be whether there is a model training parameter in the local preset storage space that is not transmitted to the storage end. The model training parameters determined as parameters to be transmitted may be transmitted to the storage terminal based on the above step S103. The model training parameters not transmitted to the storage end can comprise model training parameters newly transmitted to a local preset storage space and model training parameters failed to be transmitted to the storage end.
If it is monitored that there are model training parameters in the local preset storage space that are not transmitted to the storage side, S103 may be executed.
Of course, the preset transmission condition may be other conditions, for example, the step S103 may be performed periodically, the monitoring may be performed according to a duration, and if the parameter to be transmitted is not determined for the preset duration, the step S103 may be performed; and the judgment can be performed according to the model training parameter data quantity which is not transmitted to the storage terminal at present.
Different specific preset transmission conditions may be combined, and the embodiment is not limited.
Thus, optionally, the preset transmission conditions may comprise at least one of:
1) In the local preset storage space, the data quantity of the model training parameters which are not transmitted to the storage end currently is larger than the first preset data quantity.
2) And the parameters to be transmitted are not determined for a preset duration.
3) And after the parameters to be transmitted are determined last time, the data quantity of the newly added model training parameters in the local preset storage space is larger than the second preset data quantity.
The specific values of the first preset data amount, the preset duration and the second preset data amount are not limited in this embodiment.
It should be noted that, in the local preset storage space, after the part of the model training parameters are determined as the parameters to be transmitted, the parameters may be further transmitted to the storage end.
The model training parameters which are not transmitted to the storage end can be used for monitoring whether the preset transmission conditions are met or not, and the model training parameters which are newly transmitted to the local preset storage space and the model training parameters which are failed to be transmitted to the storage end can be included.
The embodiment is not limited to a specific manner of determining the model training parameters that are not transferred to the storage. Optionally, the model training parameters are transmitted to the storage end, and the model training parameters successfully transmitted to the storage end and the model training parameters failed to be transmitted can be determined according to the response message returned by the storage end.
And, the newly stored model training parameters may be directly determined not to be transmitted to the storage end, or the model training parameters which are not determined to be parameters to be transmitted may be determined to be model training parameters which are not transmitted to the storage end.
Optionally, the model training parameters not transmitted to the storage side, that is, the model training parameters not successfully transmitted to the storage side, may include at least one of the following: 1) Model training parameters that are not determined as parameters to be transmitted; 2) Parameters to be transmitted which are failed to be transmitted to a storage end; 3) And the model training parameters are newly transmitted to a local preset storage space.
According to the embodiment, the storage efficiency of the model check point is improved by setting specific preset transmission conditions.
And under the condition that the preset transmission condition is met, the parameters to be transmitted can be further determined and transmitted to the storage end.
The method specifically should determine parameters to be transmitted according to model training parameters which are not transmitted to a storage end in the model training parameters currently stored in a local preset storage space.
Regarding the parameters to be transmitted, the method is not limited to a specific manner of determining the parameters to be transmitted.
Alternatively, all model training parameters that are not currently transmitted to the storage end may be determined as parameters to be transmitted. And determining that part of parameters are parameters to be transmitted from model training parameters which are not transmitted to a storage end at present.
Thus, optionally, the method of determining the parameters to be transmitted may comprise any one of the following:
1) For model training parameters which are not transmitted to a storage end in a local preset storage space, selecting model training parameters with fixed data quantity, and determining the model training parameters as a group of parameters to be transmitted.
2) And determining model training parameters which are not transmitted to a storage end in a local preset storage space as a group of parameters to be transmitted.
The present embodiment is not limited to a specific size of the fixed data amount. And may be 40MB or 30MB in particular.
The fixed data amount may be the same as the first preset data amount or the second preset data amount.
In an alternative embodiment, since there may be a case where the preset transmission condition is satisfied multiple times, a set of parameters to be transmitted may be determined multiple times and transmitted to the storage end.
The embodiment is not limited to the transmission cases of the parameters to be transmitted in different groups, and may be serial transmission or parallel transmission.
The parameters to be transmitted in different groups can be transmitted to the storage end in parallel without mutual interference.
Specifically, since the next set of parameters to be transmitted may be already determined in the process of transmitting the set of parameters to be transmitted to the storage end, in order to improve the transmission efficiency, the waiting time of the next set of parameters to be transmitted is reduced, and the transmission efficiency of the model training parameters can be improved by a mode of parallel transmission of different sets of parameters to be transmitted.
The specific parallel transmission mode is not limited, and different transmission connections may be used to transmit different sets of parameters to be transmitted.
Optionally, determining a set of parameters to be transmitted to the storage end may specifically be: determining a group of parameters to be transmitted, determining idle transmission connection with a storage end, and transmitting the determined group of parameters to be transmitted through the determined transmission connection; wherein, different groups of parameters to be transmitted are transmitted to the storage end in parallel.
The present embodiment is not limited to a specific manner of determining the idle transmission connection.
Alternatively, the transmission connection for determining the data not to be transmitted may be a new transmission connection with the storage terminal.
According to the embodiment, the transmission efficiency of the model training parameters can be improved and the storage efficiency of the model check points can be improved by transmitting the parameters to be transmitted of different groups to the storage end in parallel.
With respect to the storage side, the method flow is not limited to a specific way of saving the model checkpoints.
Optionally, the storage end may save the target model checkpoint according to all the received model training parameters of the target model. The storage end can also integrate current model training parameters of the target model and a model structure of the target model, and store a target model check point.
In an alternative embodiment, since the storage may store multiple model checkpoints, an index convenience query model checkpoint may be constructed for convenience of the query.
Optionally, the storage end may be further configured to: storing in association with the stored address information of the saved object model checkpoint at least one of: identification information of the target model and current version information of the saved target model checkpoint.
The present embodiment is not limited to a specific form of the identification information of the object model.
Optionally, the identification information of the target model may be used to uniquely identify the target model, and may further be used to uniquely identify the target model on the training end.
The storage end can store model check points of a plurality of models, so that identification information of the models can be set, and the identification information can be specifically the number of the models.
In addition, the training model can be distributed by a plurality of devices, so that the identification information of the training end can be set, and the identification information can be specifically the number of the training end device.
Thus, optionally, the identification information of the object model may comprise at least one of: the identity of the training end device and the identity of the target model itself.
For the target model, the storage end can store model checkpoints of different versions, so that version information can be set, different model checkpoints can be distinguished, and the model checkpoints are uniquely identified.
Of course, the identification information of the target model may specifically be a result obtained by performing the hash algorithm processing, and may be a result obtained by performing the hash algorithm processing on at least one of the following: the identity of the training end device and the identity of the target model itself.
Therefore, the identification information of the target model can be stored in association with the storage address information of the model check point, so that the subsequent training end can conveniently inquire, and particularly, the target model can be inquired according to the identification information of the target model.
The identification information of the target model and the current version information of the stored model check point can be stored in association with the storage address information of the model check point, so that the subsequent training end can conveniently inquire, and particularly, the target model can inquire according to the identification information of the target model and the version information of the model check point.
The embodiment is also not limited to the specific form of associative memory. Alternatively, the association storage may be performed in a table form, may be performed in a key value pair form, may be performed in an index form, or the like.
Thus, optionally, the storage side may also be configured to: under the condition that the index containing the identification information of the target model is not queried, creating an index containing the identification information of the target model, wherein the created index also contains a linked list, and newly adding and storing the storage address information of the stored target model check point to the tail part of the linked list in the created index; and under the condition that the index containing the identification information of the target model is inquired, the storage address information of the stored target model check point is newly added and stored to the tail part of the linked list in the inquired index.
The created index can contain model identification information and a linked list for storing model checkpoint storage address information.
The interpretation of the model identification information may be found in the above explanation. Version information of the model checkpoints can be not used in the index, and the context of the version can be indicated through the context of the nodes in the linked list. Of course, version information of the model checkpoint may be used in the index, and specifically, version information of the model checkpoint may be newly added to the end of the linked list stored in the index together with the storage address information.
And the single model identification information can be uniquely corresponding to one index or a linked list, and a plurality of model check point information of the model is stored.
Optionally, the storage address information of the saved model checkpoint is newly added to the tail part of the linked list in the index, specifically, a node is newly added to the tail part of the linked list, and the storage address information of the saved model checkpoint is stored.
By adding nodes at the tail of the linked list and storing model check point information, the front-back relation between model check points, namely the new-old relation of model check point versions, can be clarified, and subsequent inquiry is facilitated.
For example, if the training end needs to query the latest saved model check point, query acquisition can be performed from the node at the tail of the linked list.
The storage address information about the model checkpoint may be storage address information allocated to the model checkpoint by the storage end when the model checkpoint is stored.
Optionally, a block address allocator may be included in the storage, so that block addresses may be allocated for saved model checkpoints for storage and subsequent querying.
For ease of understanding, the present description embodiments also provide a specific example of transmitting model training parameters.
As shown in fig. 2, fig. 2 is a schematic diagram illustrating a model training parameter transmission according to an embodiment of the present invention.
Wherein the training process of the target model may cycle through the steps of forward propagation, backward propagation and parameter updating.
After one parameter update, it may be determined that the model checkpoint needs to be saved, so that saving of the model checkpoint may begin.
For convenience of comparison and viewing, fig. 2 shows a conventional storage method, that is, after the target model is stored locally and the target model is waited for being stored locally, the target model may be further serialized and then transmitted to a storage end. The simplified representations of "store to local", "serialization" and "transfer to storage" are used in fig. 2.
As can be seen, conventional preservation methods are time consuming and inefficient.
The method for storing the pipeline, that is, the method flows S101-S103, can store the current model training parameters of the target model into a local preset storage space (the "memory" in fig. 2), and further can directly start to transmit the stored partial model training parameters (the transmission in fig. 2) without serialization in the storage process, thereby realizing the pipeline transmission. Specifically, the transmission in the storage process can be performed by monitoring whether the preset transmission condition is met. The simplified representation of "store" and "pass" is employed in fig. 2. Specific steps may be seen in the explanation of the above-described method flows S101-S103.
Compared with the conventional preservation method, the preservation method of the pipeline reduces the time required for transmitting data and can also improve the preservation efficiency of model check points.
Of course, fig. 2 is for illustrative purposes only and does not limit the scope of the disclosure herein.
For ease of understanding, the present description embodiments also provide another model checkpoint preservation method.
FIG. 3 is a flow chart illustrating another method of model checkpoint preservation according to an embodiment of the present invention.
The method flow can be described from both sides of the training end and the storage end.
The method flow may include the following steps.
S201: and the training end transmits the current model training parameters of the target model to a local preset storage space under the condition that the target model check point is required to be stored.
S202: the training end starts to monitor whether preset transmission conditions are met or not in the process of transmitting current model training parameters of the target model to a local preset storage space; monitoring is continued at least until current model training parameters of the target model are successfully transmitted to the storage end.
S203: under the condition that the preset transmission condition is met, the training end determines a group of parameters to be transmitted to the storage end according to the model training parameters which are not transmitted to the storage end in the local preset storage space.
S204: and the storage end stores the target model check point according to the received model training parameters.
In the process of the method, the training end directly transmits the model training parameters to the storage end through monitoring of the preset transmission conditions in the process of transmitting the model training parameters to the local preset storage space, so that compared with the existing model check point storage mechanism, the method has the advantages that waiting time for transmitting all model training parameters to the local preset storage space is saved, waiting time for transmitting all model training parameters to the local preset storage space is at least saved, and storage efficiency of the model check point is improved.
It should be noted that, between S201 to S204, the steps may not be performed in the order of steps. S202 and S203 may be performed in the course of the execution of S201, and S202 and S203 may be performed a plurality of times.
Optionally, the storage end stores the target model check point according to the received model training parameters, which may specifically be: and storing a target model check point according to the received model training parameters and the model structure of the target model transmitted by the training end.
Optionally, the method for transmitting the model structure of the object model may include any one of the following:
1) And under the condition that the check point of the target model is required to be saved, the training end transmits the model structure of the target model to the storage end.
2) The training end transmits the model structure of the target model to the storage end in advance.
3) Under the condition that the target model check point needs to be stored for the first time, the training end transmits the model structure of the target model to the storage end.
Optionally, determining a set of parameters to be transmitted to the storage end may specifically be: determining a group of parameters to be transmitted, determining idle transmission connection with a storage end, and transmitting the determined group of parameters to be transmitted through the determined transmission connection; wherein, different groups of parameters to be transmitted are transmitted to the storage end in parallel.
Optionally, the model training parameters may include at least one of: model tensors, model parameters, and model training optimizer parameters.
Optionally, the preset transmission condition may include at least one of:
1) In the local preset storage space, the data quantity of the model training parameters which are not transmitted to the storage end currently is larger than the first preset data quantity.
2) And a group of parameters to be transmitted is not determined for a preset duration.
3) And after the last time a group of parameters to be transmitted are determined, the data quantity of the newly added model training parameters in the local preset storage space is larger than the second preset data quantity.
Optionally, the method for determining the parameters to be transmitted may include any one of the following:
1) The training end selects model training parameters with fixed data volume according to model training parameters which are not transmitted to the storage end in a local preset storage space, and determines the model training parameters as a group of parameters to be transmitted.
2) The training end determines model training parameters which are not transmitted to the storage end in a local preset storage space as a group of parameters to be transmitted.
Optionally, the storage end stores at least one of the following in association with the storage address information of the saved target model checkpoint: identification information of the target model and current version information of the saved target model checkpoint.
Optionally, under the condition that the storage end does not inquire the index containing the identification information of the target model, creating an index containing the identification information of the target model, wherein the created index also contains a linked list, and newly adding and storing the storage address information of the stored target model check point to the tail part of the linked list in the created index; and under the condition that the index containing the identification information of the target model is inquired, the storage address information of the stored target model check point is newly added and stored to the tail part of the linked list in the inquired index.
Optionally, the method flow further includes: the training end determines updated model training parameters in the process of updating the model training parameters of the target model;
The current model training parameters of the target model are transmitted to a local preset storage space, which can be specifically: transmitting the model training parameters of the target model, which are updated currently, to a local preset storage space under the condition that the updated model training parameter data quantity is determined to be smaller than an updated data quantity threshold value;
the storage end can store the target model check point according to the received model training parameters and the target model check point stored last time.
The explanation of the flow of the method can be found in the explanation of the flow S101-S103 of the method described above.
For easy understanding, the embodiment of the invention also provides an application embodiment.
During the training process of the machine learning model, the training is interrupted due to software and hardware faults, and the model parameters are lost.
To prevent such parameter loss, the checkpointing mechanism may persist model state at some point in time into the persistent medium to prevent the need to train the model from scratch.
In the event of failure, training can be restored from the closest checkpoint, so that less computational effort is required to restore to the pre-crash model state.
In addition, checkpoints can also be used to analyze historical training data of the model to better perform model convergence and model interpretation.
Checkpointing of a model is divided into two steps.
First, the model state is transferred from the video memory of the video card to the memory of the host, and this process is called a snapshot stage.
The snapshot of the model in memory is then transferred over the network to a remote storage medium, a process known as the persistence phase.
The existing model state preservation method is to preserve the whole model state in a persistent medium, and because the model state comprises a complex data structure, the method usually involves serialization of software, namely, serialization is performed in a memory of a host side first and then persistence is performed. The remote storage system provides a file interface in which each checkpoint is saved in the form of multiple files.
The serialization process and remote storage system incur additional software overhead because existing checkpoint storage mechanisms cannot perceive tensors in the model. Specifically, the snapshot stage and the persistence stage can only be sequentially performed, so that the time for storing the check point is prolonged; on the other hand, remote storage systems are general file systems, and the complex file system directory tree maintenance and address space management functions are too redundant for checkpoint preservation, compromising system performance.
The embodiment provides a method for storing a machine learning model check point of tensor perception.
In the method, when the machine learning model trains and saves the check point, the check point can be written into a remote storage system for later restoring and reading the check point; the storage check point is divided into a snapshot stage and a persistence stage, in the snapshot stage, tensors in the model are extracted, zero-sequence zero-copy transmission is carried out on the tensors, and particularly pipeline transmission can be carried out, and the extracted partial tensors of the model are transmitted to a far end in real time in the process of extracting the tensors of the model; in the persistence phase, checkpoints are persisted in a remote storage medium.
The remote end comprises a tensor storage pool, and two-layer hash indexes in the tensor storage pool are designed based on tensors so as to meet the unique range searching requirement; the snapshot stage and the persistence stage may take the form of a pipeline, because snapshots (i.e., model tensors) that are transferred to memory to a certain size may be asynchronously persisted without serializing the tensors, and written to a remote flash array via a network storage protocol.
The method aims at tensor design, simplifies the design of hash index, and improves the access performance of check points, thereby improving the efficiency of machine learning training.
The method specifically comprises the following steps:
Writing the check point into a remote storage system for later restoring and reading the check point when the machine learning model trains and stores the check point; the storage check point is divided into a snapshot stage and a persistence stage, and tensors in the model are extracted in the snapshot stage, and zero-sequence zero-copy transmission is carried out on the tensors; in the persistence phase, checkpoints are persisted in the remote storage medium.
Further, model checkpoints are snapshots of model states, including model structures and tensors. Wherein the model structure defines the structure and hierarchy of the model and the tensor includes model parameters and the weights of the optimizer.
Further, the two-tier hash index of the remote tensor storage pool is designed for tensors to meet its unique range lookup requirements. The range lookup requirement of checkpoints is to retrieve the checkpoint closest to a version number within a given interval.
Further, the snapshot stage and the persistence stage can be in a pipeline form, because the snapshot which is transmitted to the memory and reaches a certain size can be asynchronously and permanently solidified without serializing tensors, and the snapshot and the persistence stage are written into a remote flash memory array through a network storage protocol.
Further, the method aims at tensor design, simplifies the design of hash indexes, and improves the access performance of check points, thereby improving the efficiency of machine learning training.
Further, the remote tensor storage pool includes two layers of hash index and block address allocators, the former providing a global index view and the latter providing allocation of remote block addresses.
Further, checkpoints are persistently stored in the flash array, and power failure and machine crashes are not lost.
According to the tensor-aware machine learning model check point preservation method, when the machine learning model trains and preserves check points, the check points are written into a remote storage system and used for later recovery and reading of the check points. The method aims at tensor design, simplifies the design of hash index, and improves the access performance of check points, thereby improving the efficiency of machine learning training.
For ease of understanding, as shown in fig. 4, fig. 4 is a schematic diagram of a model checkpoint preservation method according to an embodiment of the present invention.
As shown in fig. 4, the local may be a computing server, i.e., a training end. The remote end may be a storage server, i.e. a storage end.
The compute node server extracts the tensor from the model state and transmits the tensor in parallel with the model structure to the host-side memory.
Tensors do not need serialization because their addresses are consecutive.
Tensor snapshots in memory can be pipelined for persistence.
The remote tensor storage pool is responsible for persistently storing tensors in a persistent medium, and includes two parts, a two-layer hash index and a block address allocator.
The first layer of the hash index of the two layers is a hash table, keys of the hash table are model numbers and process numbers, and values of the hash table are historical iteration information linked lists; the second layer is a linked list, records historical iteration information, and is organized in increasing order from new to old.
The block address allocator is responsible for allocating addresses of the flash array in which the slice is stored given a tensor of a size.
This embodiment is optimized in terms of reduced checkpoint save time. Checkpoints include snapshot and persistence phases.
Conventional approaches require waiting for the snapshot phase to end before the persistence phase can occur because the snapshot phase involves a serialization operation for the checkpoint.
With this embodiment, pipeline forms can be adopted for the snapshot and persistence stages, because tensor sensing checkpoints can extract tensors from complex model states, and tensors can be asynchronously persisted when a certain size (40 MB) is reached in a memory because the tensors are stored in a continuous address space without a serialization operation. The snapshot and persistence process of the pipeline shortens the time to save checkpoints.
The various technical features of the above embodiments may be arbitrarily combined as long as there is no conflict or contradiction between the features, but are not described in detail, and therefore, the arbitrary combination of the various technical features of the above embodiments is also within the scope of the present disclosure.
Corresponding to the method embodiment, the embodiment of the invention also provides a device embodiment.
Fig. 5 is a schematic structural view of a model checkpoint saving device according to an embodiment of the present invention. The device can be applied to a training end.
The apparatus may include the following units.
The storage unit 301 is configured to transmit, when it is determined that the target model checkpoint needs to be stored, current model training parameters of the target model to a local preset storage space.
A monitoring unit 302, configured to start monitoring whether a preset transmission condition is satisfied in a process of transmitting a current model training parameter of a target model to a local preset storage space; monitoring is continued at least until current model training parameters of the target model are successfully transmitted to the storage end.
The transmission unit 303 is configured to determine, for the model training parameters that are not transmitted to the storage end in the local preset storage space, a set of parameters to be transmitted to the storage end when it is monitored that the preset transmission condition is satisfied; the storage end is used for: and according to the received model training parameters, storing a target model check point.
Optionally, the storage terminal is specifically configured to: and according to the received model training parameters and the model structure of the target model transmitted by the training end, storing model check points.
Optionally, the method for transmitting the model structure of the object model includes any one of the following:
1) And under the condition that the check point of the target model is required to be saved, the training end transmits the model structure of the target model to the storage end.
2) The training end transmits the model structure of the target model to the storage end in advance.
3) Under the condition that the target model check point needs to be stored for the first time, the training end transmits the model structure of the target model to the storage end.
Optionally, the transmission unit 303 is configured to: determining a group of parameters to be transmitted, determining idle transmission connection with a storage end, and transmitting the determined group of parameters to be transmitted through the determined transmission connection; wherein, different groups of parameters to be transmitted are transmitted to the storage end in parallel.
Optionally, the model training parameters include at least one of: model tensors, model parameters, and model training optimizer parameters.
Optionally, the preset transmission condition includes at least one of:
1) In the local preset storage space, the data quantity of the model training parameters which are not transmitted to the storage end currently is larger than the first preset data quantity.
2) And a group of parameters to be transmitted is not determined for a preset duration.
3) And after the last time a group of parameters to be transmitted are determined, the data quantity of the newly added model training parameters in the local preset storage space is larger than the second preset data quantity.
Optionally, the method for determining the parameters to be transmitted includes any one of the following:
1) For model training parameters which are not transmitted to a storage end in a local preset storage space, selecting model training parameters with fixed data quantity, and determining the model training parameters as a group of parameters to be transmitted.
2) And determining model training parameters which are not transmitted to a storage end in a local preset storage space as a group of parameters to be transmitted.
Optionally, the storage end is further configured to: storing in association with the stored address information of the saved object model checkpoint at least one of: identification information of the target model and current version information of the saved target model checkpoint.
Optionally, the storage end is further configured to: under the condition that the index containing the identification information of the target model is not queried, creating an index containing the identification information of the target model, wherein the created index also contains a linked list, and newly adding and storing the storage address information of the stored target model check point to the tail part of the linked list in the created index; and under the condition that the index containing the identification information of the target model is inquired, the storage address information of the stored target model check point is newly added and stored to the tail part of the linked list in the inquired index.
Optionally, the apparatus further comprises an updating unit 304 for determining updated model training parameters during updating of model training parameters of the target model.
The holding unit 301 is configured to: and under the condition that the data quantity of the updated model training parameters is smaller than the threshold value of the updated data quantity, transmitting the model training parameters of the current updated target model to a local preset storage space.
The storage end is specifically used for: and storing the target model check point according to the received model training parameters and the target model check point stored last time.
Specific explanation can be found in the method examples described above.
Corresponding to the method embodiment, the embodiment of the invention also provides a system embodiment.
The system may include a storage side and at least one training side.
Wherein, any training end is used for: transmitting current model training parameters of the target model to a local preset storage space under the condition that the target model check point is determined to be stored; in the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; monitoring successful transmission of model training parameters which last at least until the current model of the target model is transmitted to a storage end; under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in a local preset storage space;
The storage end is used for: and according to the received model training parameters, storing a target model check point.
Specific explanation can be found in the method examples described above.
The embodiment of the invention also provides computer equipment, which at least comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize any method embodiment.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to enable the at least one processor to perform any one of the method embodiments described above.
Fig. 6 is a schematic diagram of a hardware structure of a computer device for configuring a method according to an embodiment of the present invention, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present invention.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present invention are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary for implementing the embodiments of the present invention, and not all the components shown in the drawings.
The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements any of the method embodiments described above.
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements any of the method embodiments described above.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied essentially or in contributing parts in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments or some parts of the embodiments of the present invention.
Embodiments of the present invention also provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements any of the method embodiments described above.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing embodiments of the present invention. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is merely illustrative of the principles of this invention and it will be appreciated by those skilled in the art that numerous modifications and variations could be made without departing from the principles of this invention.
In the present invention, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" refers to two or more, unless explicitly defined otherwise.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (16)

1. A model checkpoint preservation method, characterized in that it is applied to a training end, the method comprising:
transmitting current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored;
In the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
Under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in the local preset storage space;
The storage end is used for: and according to the received model training parameters, storing a target model check point.
2. The method of claim 1, wherein the storage side is configured to: and storing a target model check point according to the received model training parameters and the model structure of the target model transmitted by the training end.
3. The method according to claim 2, characterized in that the transmission of the model structure of the object model comprises any one of the following:
Under the condition that a target model check point is required to be stored, the training end transmits a model structure of the target model to the storage end;
the training end transmits the model structure of the target model to the storage end in advance;
and under the condition that the target model check point needs to be stored for the first time, the training end transmits the model structure of the target model to the storage end.
4. A method according to any one of claims 1 to 3, wherein said determining a set of parameters to be transmitted to a storage terminal comprises:
Determining a group of parameters to be transmitted, determining idle transmission connection with the storage end, and transmitting the determined group of parameters to be transmitted through the determined transmission connection;
Wherein, different groups of parameters to be transmitted are transmitted to the storage end in parallel.
5. A method according to any one of claims 1 to 3, wherein the model training parameters comprise at least one of: model tensors, model parameters, and model training optimizer parameters.
6. A method according to any one of claims 1 to 3, wherein the preset transmission conditions comprise at least one of:
In the local preset storage space, the data quantity of the model training parameters which are not currently transmitted to the storage end is larger than the first preset data quantity;
a group of parameters to be transmitted are not determined for a preset duration;
and after a group of parameters to be transmitted are determined last time, the data size of the newly added model training parameters in the local preset storage space is larger than the second preset data size.
7. A method according to any one of claims 1 to 3, characterized in that the determination of the parameters to be transmitted comprises any one of the following:
selecting model training parameters with fixed data volume according to the model training parameters which are not transmitted to a storage end in the local preset storage space, and determining the model training parameters as a group of parameters to be transmitted;
And determining the model training parameters which are not transmitted to the storage end in the local preset storage space as a group of parameters to be transmitted.
8. A method according to any one of claims 1 to 3, wherein the storage side is further configured to:
Storing in association with the stored address information of the saved object model checkpoint at least one of: identification information of the target model and current version information of the saved target model check point.
9. A method according to any one of claims 1 to 3, wherein the storage side is further configured to:
Under the condition that the index containing the identification information of the target model is not queried, creating an index containing the identification information of the target model, wherein the created index also contains a linked list, and newly adding and storing the storage address information of the stored target model check point to the tail part of the linked list in the created index;
And under the condition that the index containing the identification information of the target model is inquired, the storage address information of the stored target model check point is newly added and stored to the tail part of the linked list in the inquired index.
10. A method according to any one of claims 1 to 3, further comprising: in the process of updating the model training parameters of the target model, determining the updated model training parameters;
The transmitting the current model training parameters of the target model to a local preset storage space comprises the following steps:
transmitting the current updated model training parameters of the target model to a local preset storage space under the condition that the updated model training parameter data quantity is smaller than an updated data quantity threshold value;
the storage end is used for: and storing the target model check point according to the received model training parameters and the target model check point stored last time.
11. A method for preserving a model checkpoint, comprising:
The training end transmits current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored;
The training end starts to monitor whether preset transmission conditions are met or not in the process of transmitting current model training parameters of the target model to a local preset storage space; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
Under the condition that the preset transmission condition is met, the training end determines a group of parameters to be transmitted to a storage end according to model training parameters which are not transmitted to the storage end in the local preset storage space;
and the storage end stores the target model check point according to the received model training parameters.
12. A model checkpoint preservation device, characterized by being applied to a training end, the device comprising:
The storage unit is used for transmitting the current model training parameters of the target model to a local preset storage space under the condition that the check point of the target model is determined to be stored;
The monitoring unit is used for starting to monitor whether preset transmission conditions are met or not in the process of transmitting the current model training parameters of the target model to a local preset storage space; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end;
The transmission unit is used for determining a group of parameters to be transmitted to the storage end aiming at the model training parameters which are not transmitted to the storage end in the local preset storage space under the condition that the preset transmission condition is met; the storage end is used for: and according to the received model training parameters, storing a target model check point.
13. A model checkpoint preservation system, wherein the system comprises a storage end and at least one training end;
Any training end is used for: transmitting current model training parameters of the target model to a local preset storage space under the condition that a target model check point is required to be stored; in the process of transmitting the current model training parameters of the target model to a local preset storage space, starting to monitor whether preset transmission conditions are met; the monitoring is continued at least until the current model training parameters of the target model are successfully transmitted to a storage end; under the condition that the preset transmission condition is met, determining a group of parameters to be transmitted to a storage end aiming at model training parameters which are not transmitted to the storage end in the local preset storage space;
The storage end is used for: and according to the received model training parameters, storing a target model check point.
14. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.
15. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 10.
16. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any one of claims 1 to 10.
CN202410302871.2A 2024-03-15 2024-03-15 A model checkpoint saving method, device, equipment and storage medium Pending CN118153711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410302871.2A CN118153711A (en) 2024-03-15 2024-03-15 A model checkpoint saving method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410302871.2A CN118153711A (en) 2024-03-15 2024-03-15 A model checkpoint saving method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118153711A true CN118153711A (en) 2024-06-07

Family

ID=91296542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410302871.2A Pending CN118153711A (en) 2024-03-15 2024-03-15 A model checkpoint saving method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118153711A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118821973A (en) * 2024-09-12 2024-10-22 支付宝(杭州)信息技术有限公司 System and method for model training and checkpoint file storage
CN119623585A (en) * 2024-12-02 2025-03-14 华中科技大学 A pipeline checkpoint operation method and operating system thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118821973A (en) * 2024-09-12 2024-10-22 支付宝(杭州)信息技术有限公司 System and method for model training and checkpoint file storage
CN119623585A (en) * 2024-12-02 2025-03-14 华中科技大学 A pipeline checkpoint operation method and operating system thereof

Similar Documents

Publication Publication Date Title
US11614867B2 (en) Distributed storage system-based data processing method and storage device
CN109086388B (en) Block chain data storage method, device, equipment and medium
US9372908B2 (en) Merging an out of synchronization indicator and a change recording indicator in response to a failure in consistency group formation
US20150213100A1 (en) Data synchronization method and system
CN107329704B (en) Cache mirroring method and controller
CN118153711A (en) A model checkpoint saving method, device, equipment and storage medium
CN113570460B (en) Method and device for concurrently executing transactions in a blockchain
CN104246767A (en) Telemetry system for cloud synchronization system
US11455117B2 (en) Data reading method, apparatus, and system, avoiding version rollback issues in distributed system
CN111737331B (en) Transaction consistency processing method and system for database and object storage
CN109697140B (en) Data backup method and device, data recovery method and device and storage medium
CN117075821A (en) Distributed storage method and device, electronic equipment and storage medium
US9330153B2 (en) System, method, and computer readable medium that coordinates between devices using exchange of log files
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN116186033B (en) Data archiving method, system, device and storage medium
US20180137055A1 (en) Log-Structured Storage Method and Server
CN112948281B (en) Data processing method, device, equipment and storage medium
CN110287164A (en) A kind of data reconstruction method, device and computer equipment
EP4500353A1 (en) Maintaining a record data structure using page metadata of a bookkeeping page
CN115604290B (en) Kafka message execution method, device, equipment and storage medium
CN112069067B (en) Data testing method and device based on block chain and computer readable storage medium
CN111984460B (en) Metadata recovery methods and devices
CN120029721A (en) A transaction processing method, device and computer equipment for multi-version concurrent control
CN119249505A (en) Data verification method, device, electronic device, readable medium and program product
CN119149087A (en) Configuration item updating and querying method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination