[go: up one dir, main page]

CN119848050B - Data block processing method, system, equipment, storage medium and product - Google Patents

Data block processing method, system, equipment, storage medium and product Download PDF

Info

Publication number
CN119848050B
CN119848050B CN202510332559.2A CN202510332559A CN119848050B CN 119848050 B CN119848050 B CN 119848050B CN 202510332559 A CN202510332559 A CN 202510332559A CN 119848050 B CN119848050 B CN 119848050B
Authority
CN
China
Prior art keywords
data block
node
data
metadata
storage table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510332559.2A
Other languages
Chinese (zh)
Other versions
CN119848050A (en
Inventor
王继玉
陈培
荆荣讯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202510332559.2A priority Critical patent/CN119848050B/en
Publication of CN119848050A publication Critical patent/CN119848050A/en
Application granted granted Critical
Publication of CN119848050B publication Critical patent/CN119848050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据块处理方法、系统、设备、存储介质及产品,涉及数据存储技术领域,包括:获取树状结构存储表、叶子节点存储表和数据块存储表,在叶子节点存储表和数据块存储表中均查找到与第一节点标识相匹配的节点标识时,建立第一节点与其属性信息、数据块信息之间的第一层级关系;并确定第一小文件列表信息并将其添加到第一层级关系得到第二层级关系;根据第二层级关系和其他节点的至少一个第三层级关系生成元数据快照,并将其发送到客户端。本申请将元数据快照下发到客户端,使客户端直接从本地缓存读取元数据快照,减少对元数据管理中心的访问压力,加速元数据访问效率,解决深度学习训练作业中小文件数据集的供给效率低的问题。

The present application discloses a data block processing method, system, device, storage medium and product, which relates to the field of data storage technology, including: obtaining a tree structure storage table, a leaf node storage table and a data block storage table, and when a node identifier matching the first node identifier is found in both the leaf node storage table and the data block storage table, establishing a first hierarchical relationship between the first node and its attribute information and data block information; and determining the first small file list information and adding it to the first hierarchical relationship to obtain a second hierarchical relationship; generating a metadata snapshot based on the second hierarchical relationship and at least one third hierarchical relationship with other nodes, and sending it to the client. The present application sends the metadata snapshot to the client, so that the client reads the metadata snapshot directly from the local cache, reduces the access pressure on the metadata management center, speeds up metadata access efficiency, and solves the problem of low supply efficiency of small file data sets in deep learning training operations.

Description

Data block processing method, system, equipment, storage medium and product
Technical Field
The present application relates to the field of data storage technologies, and in particular, to a method, a system, an apparatus, a storage medium, and a product for processing a data block.
Background
At present, more and more artificial intelligence training operations run in a cloud data center. The elastic GPU computing resource capacity provided by the cloud data center can meet the diversified demands of artificial intelligent training operation on computing resources. For training data, especially large-scale small file data sets, the supply efficiency of the training data is directly related to the calculation performance and efficiency of a distributed deep learning operation system of a cloud data center.
In the related art, for a large-scale small file data set, data is usually provided for training by providing high-performance storage on a computing node of a deep learning task and copying the small file data set in a remote storage system on the high-performance storage, but due to limited capacity of the high-performance storage, the small file data set exceeding the capacity cannot accelerate data supply in this way, or the small file data set is stored in a parallel file system and data in the small file data set is pulled from the parallel file system to be fed to the computing node of the deep learning task, but due to a large performance bottleneck of the parallel file system in data query and random reading, a mismatch exists between data supply efficiency and computing speed of the computing node. Therefore, the two modes cannot effectively improve the supply efficiency of the small file data set in the deep learning training operation.
Disclosure of Invention
The application provides a data block processing method, a system, equipment, a storage medium and a product, which at least solve the problem of low supply efficiency of small file data sets in deep learning training operation in the related technology.
The application provides a data block processing method which is applied to a metadata management center and comprises the steps of obtaining a tree structure storage table, a leaf node storage table and a data block storage table, wherein the tree structure storage table stores a first node identifier of a first node, the leaf node storage table comprises attribute information and node identifiers of all leaf nodes in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identifiers and node identifiers of all the leaf nodes of all the plurality of leaf nodes, the data block information and the data block identifiers are stored in all the leaf nodes, the node identifiers of all the leaf nodes of the plurality of leaf nodes are obtained after a plurality of small files required by training operation are aggregated, and the first node is a root node or any one of the leaf nodes in the tree structure.
The method comprises the steps of respectively searching whether node identifiers matched with first node identifiers exist in a leaf node storage table and a data block storage table, establishing a first hierarchical relationship between attribute information of the first node and between data block information of the first node if the matched node identifiers are found, determining first small file list information included in the data block corresponding to the first node according to the first data block identifier of the data block corresponding to the first node, adding the first small file list information to the first hierarchical relationship to obtain a second hierarchical relationship, generating metadata snapshots according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and sending the metadata snapshots to a client, so that the client stores the metadata snapshots in a local cache, and obtains the metadata snapshots when training operation is started, wherein the other nodes are at least one leaf node except the first node in a tree structure.
The application further provides a data block processing method which is applied to a client, and the method comprises the steps of loading metadata snapshots corresponding to a plurality of data blocks from a local cache after training operation is started, generating the metadata snapshots based on a tree structure storage table, a leaf node storage table and a data block storage table, wherein the tree structure storage table stores first node identifiers of first nodes, the leaf node storage table comprises attribute information of each leaf node in the tree structure and each node identifier, the data block storage table comprises data block information of each data block stored by each leaf node in the plurality of leaf nodes, the data block identifiers and node identifiers of each leaf node in the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, the plurality of data blocks are obtained after aggregation of a plurality of small files required by the training operation, detecting whether snapshot time stamps of the metadata snapshots are consistent with latest snapshot time stamps of a metadata management center, and if yes, reading a plurality of small file aggregation operation systems from the rear end aggregation file systems to obtain the plurality of small file aggregation operation blocks after the metadata snapshots are loaded.
The application also provides a data block processing device, which comprises a first transceiver module, a first node identification module and a second transceiver module, wherein the first transceiver module is used for acquiring a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores a first node identification of a first node, the leaf node storage table comprises attribute information and each node identification of each leaf node in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identification and node identification of each leaf node of each data block stored in each leaf node, the first node is a root node or any leaf node in the tree structure, and the plurality of data blocks are obtained after a plurality of small files required by training operation are aggregated.
The first processing module is used for searching whether node identifiers matched with the first node identifiers exist in the leaf node storage table and the data block storage table respectively, if the node identifiers matched with the first node identifiers are found, a first hierarchical relationship between the attribute information of the first node and the data block information of the first node is established, the first small file list information included in the data block corresponding to the first node is determined according to the first data block identifier of the data block corresponding to the first node, the first small file list information is added to the first hierarchical relationship to obtain a second hierarchical relationship, a metadata snapshot is generated according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and the metadata snapshot is sent to the client, so that the client is stored in a local cache, and the metadata snapshot is obtained when training operation is started, and the other nodes are at least one leaf node except the first node in a tree structure.
The application also provides a data block processing device, which comprises a second transceiver module, a first receiving module and a second receiving module, wherein the second transceiver module is used for loading metadata snapshots corresponding to a plurality of data blocks from a local cache after a training operation is started, the metadata snapshots are generated based on a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores first node identifications of a first node, the leaf node storage table comprises attribute information of each leaf node in the tree structure and each node identification, the data block storage table comprises data block information of each data block stored by each leaf node in the plurality of leaf nodes, the data block identifications of each leaf node in the plurality of leaf nodes and node identifications of each leaf node in the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, and the plurality of data blocks are obtained after a plurality of small files required by the training operation are aggregated.
The second processing module is used for detecting whether the snapshot time stamp of the metadata snapshot is consistent with the latest snapshot time stamp of the metadata management center or not, if so, reading a plurality of data blocks from the back-end aggregate file system based on the metadata snapshot after the metadata snapshot is loaded, and performing depolymerization operation on the plurality of data blocks to obtain a plurality of small files.
The application also provides a data block processing system which comprises a metadata management center and a client. The metadata management center is used for acquiring a tree structure storage table, a leaf node storage table and a data block storage table, wherein the tree structure storage table stores a first node identifier of a first node, the leaf node storage table comprises attribute information and node identifiers of all leaf nodes in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identifiers and node identifiers of all the leaf nodes of all the data blocks stored in all the leaf nodes, the first node is a root node or any leaf node in the tree structure, and the data blocks are obtained after a plurality of small files required by training operation are aggregated.
The metadata management center is further configured to search for node identifiers that match the first node identifier in the leaf node storage table and the data block storage table, if the node identifiers that match the first node identifier are found, establish a first hierarchical relationship between the first node and attribute information of the first node and data block information of the first node, determine, according to the first data block identifier of the data block corresponding to the first node, first doclet list information included in the data block corresponding to the first node, and add the first doclet list information to the first hierarchical relationship to obtain a second hierarchical relationship, generate a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and send the metadata snapshot to the client, so that the client stores the metadata snapshot in a local cache and obtains the metadata snapshot when training operation is started, and the other nodes are at least one leaf node except the first node in a tree structure.
The client is used for loading metadata snapshots corresponding to a plurality of data blocks from a local cache after a training operation is started, wherein the metadata snapshots are generated based on a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores first node identifiers of first nodes, the leaf node storage table comprises attribute information of each leaf node in the tree structure and each node identifier, the data block storage table comprises data block information of each data block stored in each leaf node in the plurality of leaf nodes, the data block identifiers and node identifiers of each leaf node in the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, the plurality of data blocks are obtained after the plurality of small files required by the training operation are aggregated, if yes, the snapshot time stamp of the metadata snapshots is consistent with the latest snapshot time stamp of a metadata management center, after the loading of the metadata snapshots is completed, the plurality of data blocks are read from a rear-end aggregated file system based on the metadata snapshots, and the deaggregation operation is carried out on the plurality of data blocks to obtain a plurality of small files.
The application also provides an electronic device comprising a memory for storing a computer program and a processor for implementing the steps of any one of the data block processing methods when executing the computer program.
The application also provides a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of any of the above described data block processing methods.
The application also provides a computer program product comprising a computer program which when executed by a processor implements the steps of any of the above described data block processing methods.
According to the application, the metadata management center can send the metadata snapshot of the training data set (a plurality of data blocks) to the computing node where the client is located (the computing node where the application is located), so that the application can directly read the metadata snapshot from the local cache, the access pressure to the metadata management center is reduced, and the metadata access efficiency is accelerated. The client can quickly acquire the metadata snapshot from the local cache and read the required metadata information from the metadata snapshot without frequently accessing the metadata management center, so that network overhead is reduced, metadata access efficiency is improved, and the supply efficiency of the small file data set is improved.
Furthermore, the data set is typically read-only during execution of the training job, during which no metadata snapshot updates due to the training data set updates occur, so the localized metadata snapshot cache is suitable for deep learning training jobs. Through a metadata snapshot mechanism, high-efficiency metadata access performance can be maintained during large-scale small file data set training, meanwhile, consumption of storage and calculation resources is reduced, and analysis and supply efficiency of a training data set based on data blocks is quickened.
Drawings
For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a topology diagram of a data block processing system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a high performance data read pipeline according to an embodiment of the present application;
Fig. 3 is a flow chart of a data block processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of metadata construction for data blocks according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another method for processing a data block according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a small lot size corresponding to a read data block according to an embodiment of the present application;
FIG. 7 is a block diagram of an embodiment of the present application;
FIG. 8 is a block diagram of a data block processing apparatus according to an embodiment of the present application;
FIG. 9 is a block diagram illustrating a further embodiment of a data block processing apparatus according to the present application;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
The specific application environment architecture or specific hardware architecture upon which execution of a data block processing method depends is described herein.
The embodiment of the application is applied to a scene that the storage system needs to provide the small file data set as a training set for the distributed deep learning operation system to perform model training.
For large-scale small file data sets, the main stream processing mode is to re-aggregate a large number of small files into a new data format so as to improve the data reading performance. However, the data format is only supported by a separate programming framework, so that universality is not achieved, the original data is required to be converted before training operation starts, and the data format is converted into the original data for calculation after training starts, so that the application scene of the data format is limited.
For deep learning training jobs, which typically download or collect private data from a public library and convert to a data set format that can be used for deep learning training, then feed the prepared doclet data sets to the computing process of the computing node in a batch fashion for computation, the file access characteristics of the training job include that each training task operates on only one doclet data set and that only read-only access to the doclet data set is performed during the training process, the doclet data sets are fixed during this process, the doclet data sets are read sequentially in epochs, all doclets are read once in each epoch, and multiple doclets are randomly shuffled (shuffle) between different epochs to avoid model overfitting.
Specifically, at the beginning of a typical deep learning training job, all file names of the doclet data set are loaded into memory, and the training framework (e.g., pyTorch) randomly shuffles the index of the file names to generate a random file data read order prior to each epoch calculation. Thereafter, small batch data sets (mini-batch) are read in batches in an iterative manner, each small batch data set is fed to a computing node (e.g., the computing node may be a graphics processing unit (Graphics Processing Unit, GPU)) for performing parameter update computations of the deep neural network, and the next batch data set is read after the computation is finished until all small files are read once. Before model parameters converge, training often requires multiple rounds of epoch computation, and the overall training time depends on the training data supply efficiency and the GPU computation efficiency. In a large computing cluster, many training jobs are concurrently running, and each training job may use a different data set, where the training jobs may send a large number of small file reading requests to a storage system where the data set is located, where if the storage system is inefficient in acquiring small files, the training data supply is slow, and frequent waiting of the GPU may be caused, resulting in a slow training process.
In order to solve the technical problems, the embodiment of the application provides a data block processing method, which comprises the steps of obtaining a tree structure storage table, a leaf node storage table and a data block storage table, establishing a first hierarchical relationship between attribute information and data block information of a first node and the first node when node identifiers matched with the first node identifiers exist in the leaf node storage table and the data block storage table, determining first small file list information included in the data block corresponding to the first node according to the first data block identifier of the data block corresponding to the first node, adding the first small file list information to the first hierarchical relationship to obtain a second hierarchical relationship, generating a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, sending the metadata snapshot to a client, enabling the client to store the metadata snapshot in a local cache, acquiring the metadata snapshot when training operation is started, rapidly acquiring a plurality of data blocks required by the training operation from a file system based on the metadata snapshot, further obtaining a small file set required to improve the training operation efficiency, and supplying the small file set of the required small file set to the training operation.
The method provided by the embodiment of the present application is described below by taking the data block processing system shown in fig. 1 as an example.
Referring to fig. 1, fig. 1 is a topology structure diagram of a data block processing system according to an embodiment of the present application. In fig. 1, a data block processing system 100 includes a metadata management center 101, a client 102, and a back-end aggregate file system 103. Optionally, the data block processing system 100 also includes a global shared file system 104.
The metadata management center 101 (also referred to as CacheFS-Server) in the embodiment of the present application is used for managing and storing metadata information of the entire doclet dataset. During the aggregation of small file datasets and writing to the back-end aggregate file system 103, the metadata management center 101 receives metadata information extracted from the clients 102 and stores the metadata information in a distributed database. In order to increase the access speed, the metadata management center 101 also caches a copy of metadata information in the memory, and the metadata management center 101 supports a metadata snapshot function, constructs a metadata snapshot in a preset format, and issues the metadata snapshot to the client 102.
The metadata management center 101 has backup and fault tolerance functions. In a distributed storage system, the availability and integrity of metadata directly affects the operation of the overall system. Therefore, the metadata management center ensures the stability and reliability of the metadata management center, furthest reduces the influence of system faults on the service, ensures the safety of metadata and maintains the continuous service capability of the file system through the measures of regular backup, transaction log recording, data replication, multi-node deployment and the like.
The metadata management center 101 is configured with full back-up and incremental back-up policies to meet different restore requirements. All metadata in the full-volume backup full-copy database is usually executed in a period of low system load, and the execution frequency is low, so that the influence on normal business is reduced. The incremental backup only records the changed data from the last backup, and the execution frequency is higher, so that the storage space is saved and the backup efficiency is improved.
In the fault tolerance aspect, the metadata management center relies on the copy function of the database and the master-slave switching mechanism thereof, so that the synchronization and fault tolerance processing of metadata are realized. The replication function of the database allows metadata to be synchronized between multiple database instances, providing data redundancy and load balancing. When the main database instance fails, the system can automatically switch to the standby database instance, so that the continuity of the service is ensured. In the master-slave replication mode, the master database synchronizes the updated metadata to the slave database in real time, and when the master database is unavailable, the slave database can immediately take over, so that service interruption is avoided.
It will be appreciated that the backup and fault tolerance mechanism enables a quick start of the restore process when the metadata management center encounters data corruption or other unpredictable system failure. By using the backup data, the metadata is restored in a short time, ensuring that the service is not interrupted.
The Client 102 (also referred to as CacheFS-Client) in the embodiment of the present application is configured to perform a read-write operation on a small file dataset. The client 102 is composed of a write operation module, a read operation module and a general operation module, and each bears different functions.
The write operation module is configured to scan the small files stored in the global shared file system, aggregate the small files into larger data blocks according to a specific aggregation algorithm, store the aggregated data blocks in the back-end aggregate file system, and report metadata information of each data block to the metadata management center 101 in real time.
And the reading operation module is used for directly reading the aggregated data blocks from the back-end aggregation file system by utilizing the constructed high-performance data reading pipeline after the training task starts, and carrying out deaggregation operation on the aggregated data blocks. In addition, the read operation module integrates a data randomization function, so that the data can meet the requirement of deep learning training on data randomization in the depolymerization process. The method can integrate data reading, deaggregation and randomization into a continuous and efficient operation flow, improves the efficiency of accessing the small file data set, and ensures the efficient data supply during model training.
Optionally, a high performance data read pipeline integrates a data Chunk parser (DATA READER), a metadata snapshot parser (META INTERPRETER), a data Chunk address parser (Chunk PATH PARSER), and a Chunk-Based Shuffle mechanism (Chunk-Based Shuffle), enabling efficient reading of training data. Fig. 2 is a schematic diagram of a high performance data reading pipeline according to an embodiment of the present application, as shown in fig. 2. In fig. 2, the data reading pipeline acquires a metadata snapshot from CacheFS metadata of the CacheFS metadata management center, parses out a plurality of original small file data from the stored data blocks in real time based on the metadata snapshot, forms a mini-batch data, and feeds the mini-batch data to a training job through a POSIX interface. The data read pipeline also integrates a data block based shuffle mechanism that performs a random out of order process (shuffle) on the order of all small files at the beginning of each round of (epoch) training.
The general operation module is used for executing general file system operation such as data updating, cache management and directory structure reconstruction. The generic operation module can periodically evict expired doclet datasets and provide a transition from the data block storage view to a user-friendly tree structure view. After the training operation is started, the general operation module takes the data block as the minimum granularity to read data from the back-end aggregate file system 103, namely, pre-read small file data sets, and performs cache management on the pre-read small file data sets.
The backend aggregate file system 103 in the embodiment of the present application may be a parallel file system that aggregates high-speed storage resources scattered on each computing node to form a high performance. For example, the parallel file system may be BeeOND systems.
The global shared file system 104 in the embodiment of the present application may be any storage system. The global shared file system 104 is used to store a large number of small file datasets.
In the embodiment of the application, the metadata management center 101 can be deployed as a container service on a Master node in a cluster, and can fully utilize the full life cycle management capability of the Master node on the service. The client 102 may be deployed directly on a computing node of the cluster, interacting with the metadata management center through a Restful API. The back-end aggregate file system 103 may also be deployed on computing nodes of a cluster, with data block storage functionality provided by clients.
When the training job is dispatched to the target computing node, before the container corresponding to the training job is started, the job management system of the container cloud platform aggregates small file data required by the training job into a large data block from the remote global shared file system 104 by calling the client 102, and writes the large data block into the back-end aggregation file system 103. When a container corresponding to the training operation is started, a data catalog corresponding to the big data block is mounted in the container, the training program in the container can access the data block stored in the back-end aggregation file system 103 like accessing the local catalog, and the data is efficiently and quickly supplied to the training operation through a high-performance data reading pipeline constructed by the reading operation of the client.
In the embodiment of the present application, the metadata management center 101, the client 102, and the back-end aggregate file system 103 together form a distributed cache system (CacheFS).
The data block processing system shown in fig. 1 is only for example and is not intended to limit the technical solution of the present application. Those skilled in the art will appreciate that the data block processing system may include additional devices in the specific implementation, and is not limited.
The method is described in detail below in connection with the execution flow of the data block processing method.
The embodiment of the application provides a data block processing method, which is applied to a metadata management center shown in fig. 1, as shown in fig. 3, and fig. 3 is a flow diagram of the data block processing method provided by the embodiment of the application, and the data block processing method comprises the following steps:
S301, acquiring a tree structure storage table, a leaf node storage table and a data block storage table.
The tree structure memory table is also called a Leaf table. The Leaf table is used for storing Leaf node information in a tree structure corresponding to a data set catalog of a plurality of data blocks in the back-end aggregation file system. The leaf nodes include data block files and directories. The Leaf table stores a first node identification of the first node, node identifications of other nodes (also referred to as inodes), leaf names of each Leaf node, node identifications of parent directory nodes, and node types. The inodes in the Leaf table are unique identifiers for each data block file or directory, and the inodes of the parent directory are used to represent hierarchical relationships in the tree structure. The node type is used to indicate whether each leaf node is a data block file or a directory. Each data block file and directory node is associated with a parent directory node by a node identification.
The leaf node storage table is also referred to as LeafAttributes table. The LeafAttributes table is an extension of the Leaf table, and stores attribute information of each Leaf node and each node identifier in the tree structure corresponding to a plurality of data blocks. The attribute information includes an inode of a data block file or directory, an inode of a parent directory, a node type, a modification time, a creation time, a size, and a read-write authority.
It will be appreciated that LeafAttributes tables are associated with Leaf tables by inode to ensure that all relevant metadata information for a file or directory can be queried.
The data block storage table is also referred to as a Chunk table. The Chunk table is used for metadata information for a plurality of data blocks. The Chunk table stores data block information of each data block stored by each of the plurality of leaf nodes and a node identification of each of the plurality of leaf nodes. Wherein the data block information includes a data block identification (chunkid), a data block name, a data block size. One data block corresponds to the node identification of one leaf node. Wherein the data block identification is used for fast positioning to a specific data block.
It will be appreciated that the Chunk table is associated with the Leaf table and LeafAttributes tables by inode.
The first node is a root node or any one of leaf nodes in the tree structure.
The plurality of data blocks are obtained by aggregating a plurality of small files required for training operation.
It will be appreciated that the metadata snapshot is generated based on an update timestamp of the metadata management center, which is automatically refreshed by the metadata management center whenever metadata of the data set changes. When the update time stamp is detected to be unchanged within a time window (for example, 1 hour), the metadata of the data blocks are indicated to be constructed, namely, the tree structure storage table, the leaf node storage table and the data block storage table are constructed, and the metadata management center can be triggered to acquire the tree structure storage table, the leaf node storage table and the data block storage table.
And S302, respectively searching whether node identifiers matched with the first node identifier exist in the leaf node storage table and the data block storage table.
And S303, if the matched node identifications are found, establishing a first hierarchical relationship between the attribute information of the first node and the data block information of the first node.
In an example, the metadata management center writes attribute information of the first node and data block information of the first node into a position corresponding to a preset first key field in the initial hierarchical structure based on the initial hierarchical structure in a preset format, and establishes a first hierarchical relationship.
The preset format may be JSON format.
The preset first Key field (also referred to as Key field) may be attr field.
Illustratively, the metadata management center writes the first node identifier, the leaf name, the inode of the parent directory, the node type, the modification time, the creation time, the file size, and the read-write permission to corresponding positions in an attr field of the initial hierarchy of the JSON format based on the initial hierarchy of the JSON format, and establishes a first hierarchical relationship.
It can be appreciated that, since the Chunk table is associated with the Leaf table and LeafAttributes tables through the inode, the attribute information of the first node and the data block information of the first node can be associated based on the first node identifier of the first node, so as to establish the first hierarchical relationship.
S304, according to a first data block identifier of the data block corresponding to the first node, determining first small file list information included in the data block corresponding to the first node, and adding the first small file list information to the first hierarchical relationship to obtain a second hierarchical relationship.
In one example, a metadata management center obtains a data block attribute storage table, searches for whether a data block identifier matched with a first data block identifier exists in the data block attribute storage table, and obtains first small file list information included in a data block corresponding to the first data block identifier in the data block attribute storage table if the matched data block identifier is found.
Wherein the data block attribute storage table is an extension to the Chunk table contents. The data block attribute storage table includes a data block identification for each data block and doclet list information included for each data block. The small file list information comprises the name, the small file size and the small file modification time of each small file in each data block. The doclet list information is a series of character strings. If the number of small files contained in one data block is large, the metadata management center compresses the character string of the small file list information into large object (Blob) binary format data through the zlib compression algorithm.
It can be appreciated that the compressed small file list information can reduce the occupied storage space, and is also beneficial to improving the metadata query efficiency.
In one example, the metadata management center writes the first doclet list information into a position corresponding to a preset second key field in the first hierarchical relationship to obtain a second hierarchical relationship.
The preset second key field may be chunks fields.
Illustratively, the metadata management center writes the inode, the data block identifier, the data block name, the data block size, and the doclet list information of the data block into corresponding positions in the chunks fields, and establishes a second hierarchical relationship.
It will be appreciated that the ChunkAttributes table is associated with the Chunk table by chunkid, enabling precise tracking of metadata information for each doclet while managing the data blocks.
In an example, as shown in fig. 4, fig. 4 is a schematic diagram of metadata construction for a data block according to an embodiment of the present application. In fig. 4, metadata information of a data block is stored in a tree structure storage table (table 1), a leaf node storage table (table 2), a data block storage table (table 3), and a data block attribute storage table (table 4). The tree structure storage table, the leaf node storage table, the data block storage table, and the data block attribute storage table correspond to tree directory structures (also referred to as tree structures) in the file system under Linux.
All metadata stored in the tree structure storage table, the leaf node storage table, the data block storage table and the data block attribute storage table are stored in a database relational database, and the database can be deployed on a high-speed storage medium of a single server node or distributed on a high-speed storage medium of a plurality of server nodes.
It can be appreciated that the database is deployed on a high-speed storage medium of the multi-server node, so that metadata loss caused by single-point failure can be avoided, and high concurrency performance and expandability of metadata query are ensured. The metadata management center adopts a database stored in sub-tables, and the query efficiency and the system expandability of the metadata are improved by dispersing different types of metadata into a plurality of storage tables. Each storage table is focused on storing specific metadata information, so that unnecessary data access is reduced, and the query speed is increased. Meanwhile, the structure of sub-table storage enables flexible expansion when the data volume is increased, and complexity of reconstructing the whole database is avoided. In addition, sub-table storage also enhances maintainability and ease of use, helping to reduce the risk of error.
And S305, generating a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and sending the metadata snapshot to the client so that the client can store the metadata snapshot in a local cache and acquire the metadata snapshot when the training operation is started.
Wherein the other nodes are at least one leaf node in the tree structure except the first node.
In one example, the tree directory structure of a metadata snapshot in JSON format is as follows:
cachefs├── attr (inode: 2, type: "directory", mode: 493, ...)└── entries├── imagenet│ ├── attr (inode: 39051, type: "directory", mode: 493, ...)│ └── entries│ ├── imagenet-train_100│ │ ├── attr (inode: 39152, type: "regular", mode: 420, ...)│ │ └── chunks│ │ ├── chunkid: 49283│ │ └── slices│ │ ├── size: 4054528│ │ └── files│ │ ├── "imagenet/train/n10148035/n10148035_11304.JPEG"│ │ ├── "imagenet/train/n10148035/n10148035_11195.JPEG"│ │ ├── "imagenet/train/n10148035/n10148035_11268.JPEG"
The tree directory structure is used to represent the hierarchical relationship in the metadata snapshot. In the tree directory structure, starting from the root node (cachefs), the tree directory structure is sequentially expanded to the directory imagenet, then to the aggregated data block chunk imagenet-train_100, and finally, the small file list information of the data block is displayed, for example, "imagenet/train/n10148035/n10148035_11304.JPEG".
In some alternative embodiments, the metadata management center obtains the accessed times of the metadata snapshot in the first time period and the latest updating time of the metadata snapshot, and clears the metadata snapshot when the second time period between the latest updating time and the current time is greater than or equal to a first threshold value and the accessed times are less than or equal to a second threshold value.
Based on the method shown in fig. 3, the metadata management center can acquire a tree structure storage table, a leaf node storage table and a data block storage table, establish a first hierarchical relationship between attribute information of a first node and the first node and data block information of the first node when node identifiers matched with the first node identifiers exist in the leaf node storage table and the data block storage table, determine first small file list information included in the data block corresponding to the first node based on the first data block identifier of the data block corresponding to the first node, and add the first small file list information to the first hierarchical relationship to obtain a second hierarchical relationship, generate a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and send the metadata snapshot to a client.
Because the metadata management center can issue the metadata snapshot of the training data set (a plurality of data blocks) to the computing node where the client is located (the computing node where the application is located), the application can directly read the metadata snapshot from the local cache, the access pressure to the metadata management center is reduced, and the metadata access efficiency is accelerated.
When the training job is started, the client downloads the metadata snapshot of the latest timestamp from the metadata management center to the local storage. The metadata information query operation after the model training operation is started is queried from the locally cached metadata snapshot. Through a metadata snapshot mechanism, high-efficiency metadata access performance can be maintained during large-scale small file data set training, meanwhile, consumption of storage and calculation resources is reduced, and analysis and supply efficiency of a training data set based on data blocks is quickened.
An embodiment of the present application provides a data block processing method applied to a client shown in fig. 1, as shown in fig. 5, fig. 5 is a flow chart of another data block processing method provided in an embodiment of the present application, where the data block processing method includes the following steps:
S501, after the training operation is started, metadata snapshots corresponding to a plurality of data blocks are loaded from a local cache.
In one example, after the training job is started, the client loads metadata snapshots corresponding to the plurality of data blocks from the local cache, and performs preprocessing operations during the metadata snapshot loading process.
The preprocessing operation comprises file integrity checking and format verification. It will be appreciated that the preprocessing operation may ensure that the metadata snapshot is uncorrupted and conforms to a predefined JSON format.
S502, detecting whether the snapshot time stamp of the metadata snapshot is consistent with the latest snapshot time stamp of the metadata management center.
In one example, a metadata snapshot interpreter of a client detects whether a snapshot time stamp of a metadata snapshot is consistent with a latest snapshot time stamp of a metadata management center.
It can be appreciated that, to ensure real-time performance of the metadata snapshot, a timestamp checking mechanism is built in the metadata snapshot interpreter. The metadata snapshot interpreter may compare the snapshot time stamp of the metadata snapshot with the latest snapshot time stamp of the metadata management center.
If the snapshot time of the metadata snapshot is inconsistent with the latest snapshot time stamp, the client downloads the latest metadata snapshot from the metadata management center and updates the local cache to ensure that the training job acquires a plurality of small files based on the latest metadata snapshot.
If the snapshot time of the metadata snapshot coincides with the latest snapshot time stamp, the client performs S503.
And S503, if yes, reading a plurality of data blocks from the back-end aggregate file system based on the metadata snapshot after the metadata snapshot is loaded.
And S504, performing depolymerization operation on the plurality of data blocks to obtain a plurality of small files.
In some alternative embodiments, the client parses the metadata snapshot to obtain doclet list information and a data block identification for each data block, and reads a plurality of data blocks from the back-end aggregate file system based on the doclet list information and the data block identification.
In an example, a client analyzes a position corresponding to a first key field in a metadata snapshot to obtain a data block identifier and a data block name of each data block in a plurality of data blocks, and stores the plurality of data block identifiers corresponding to the plurality of data blocks in a global data block list, and analyzes a position corresponding to a second key field in the metadata snapshot to obtain small file list information included in each data block in the plurality of data blocks, wherein the small file list information includes the data block identifier of each data block and a small file list included in each data block.
It can be appreciated that in the process of parsing the metadata snapshot, the metadata snapshot interpreter also has a perfect error handling mechanism. When an exception condition such as parsing failure or data inconsistency is encountered, the interpreter triggers error handling logic (e.g., logging or warning prompt) and re-acquires the latest metadata snapshot from the metadata management center for parsing again if necessary.
In one example, a client may establish a first hash table between a data block identification and a plurality of small file names, and a second hash table between a small file name and at least one data block, and a data block name for each of the at least one data block.
The first hash table is also called a hash_files_to_chunk table. The second hash table is also referred to as the hash_files_to_ chunkid table.
In some alternative embodiments, a client acquires a plurality of file indexes, generates a sub-file list belonging to a first data block based on the plurality of file indexes, acquires the total file amount of a plurality of first small files, creates a data block object of the first data block and acquires small file header information and small file header information length of each first small file if the total file amount is smaller than a first threshold, acquires address offset of each first small file in the first data block according to the small file size and small file header information length of each first small file, acquires a plurality of first small files according to the address offset, and stores a data block identifier of the first data block, a small file name and address offset of each first small file in a global hash table corresponding to the data block object.
Wherein the plurality of file indexes are indexes of each first small file included in the small file list in a small batch data set required for executing a training program of a training job. A small batch of data set is the training data of a mini-batch.
The sub-file list corresponds to a plurality of first small files. The doclet header information includes a doclet size and doclet name of each first doclet.
Exemplary, as shown in fig. 6, fig. 6 is a schematic diagram illustrating the small batch number corresponding to the read data block according to the embodiment of the present application. In fig. 6, the training data parser of the client parses the data of the data block (chunk) into the original small file data after the shuffle, forming mini-batch data, and feeding the mini-batch data to the model training program.
In the initialization stage of small file reading, the metadata snapshot interpreter generates a second hash table for mapping the small file and the data block where the small file is located, wherein the Key Value is a small file name, and the Value is a data block identifier and a data block name. When training program needs training data of a mini-batch, the client side takes out a plurality of file indexes of a new mini-batch from the randomized small file list, maps the file indexes into corresponding data blocks, generates a sub-file list belonging to the same data block, then creates a data block object (chunk object) for each hit data block, and traverses all small file header information in the data block when initializing the data block object.
It will be appreciated that the client records the addressing information (address offset) in a global hash table, where key is the file name and value is the object containing the addressing information. Thus, when the file is accessed later, the position of the file data can be positioned quickly without resolving the header information again. The global addressing operation is performed only once for each data block and the result is stored in a global hash table. Other processes of model training can directly extract the address offset of the small file from the cached chunk when accessing the data block, thereby avoiding repeated addressing operations. This may increase addressing efficiency and reduce unnecessary I/O operations.
Optionally, the training data parser of the client supports batch addressing. When the client creates the chunk object, the client can judge whether the current file name is in the sub-list or not when traversing the data block header information by transmitting the sub-file list parameter. If hit, its addressing information is obtained and an addressing object is generated. When all files in the subfolder list are addressed, the chunk object is initialized without triggering the actual data read operation.
In some optional embodiments, when the client needs to read the first target small file, the client obtains the first target name of the first target small file, reads a first address offset of the first target small file corresponding to the first target name from the global hash table, and loads the target data block including the first target small file into the local cache based on the first address offset.
In some optional embodiments, when the client needs to read the second target small file in the target data block, the client obtains the second target name of the second target small file, reads the second address offset of the second target small file corresponding to the second target name from the global hash table, and reads the second target small file from the target data block stored in the local cache based on the first address offset.
It can be understood that when any small file included in the data block is triggered to be read, the client loads the data of the whole data block into the local cache. If other small files in the data block are hit again in the current mini-batch or the subsequent mini-batch, the corresponding file data is directly read from the local cache without repeated reading from the back-end aggregate file system again. The method is equivalent to introducing a data prefetching mechanism, improves the reading efficiency of small files and reduces the I/O burden of the system.
Throughout the training period, as long as a block of data is loaded into the local cache, subsequent accesses to files within the block of data no longer need to interact with the backend aggregate file system unless the cached block of data is purged or invalidated. In the subsequent epoch, if the local cache hits the target file, the system directly loads data from the local cache, so that the training data supply efficiency is improved.
Based on the method shown in fig. 5, after the training operation is started, the client loads metadata snapshots corresponding to a plurality of data blocks from the local cache, detects whether the snapshot time stamps of the metadata snapshots are consistent with the latest snapshot time stamp of the metadata management center, if so, reads the plurality of data blocks from the back-end aggregate file system based on the metadata snapshots after the metadata snapshot is loaded, and performs depolymerization operation on the plurality of data blocks to obtain a plurality of small files.
Localized metadata snapshot caching is suitable for deep learning training jobs because the data set is typically read-only during execution of the training job, during which no metadata snapshot updates due to training data set updates occur. The client can quickly acquire the metadata snapshot from the local cache and read the required metadata information from the metadata snapshot without frequently accessing the metadata management center, so that network overhead is reduced, metadata access efficiency is improved, and the supply efficiency of the small file data set is improved.
Optionally, the client may further aggregate a plurality of small files stored in the global shared file system into a data block for storage in the backend aggregate file system. The method specifically comprises the following stages:
and 1, the client scans the data set catalogue stored in the remote global shared file system and collects the metadata information of each data block in the data set catalogue.
It can be appreciated that the scanning process is performed in units of subdirectories of the dataset directory, avoiding files under different directories being aggregated into the same data block, to reduce the complexity of system implementation. In order to improve the scanning efficiency, a multi-worker concurrent processing strategy can be adopted, and a plurality of scanning workers can be started to work in parallel according to factors such as the scale of a data set, the number of files, the directory structure, the size of a preset data block and the like. Each scanning worker is responsible for scanning a part of files under the sub-directory and sending the collected file information to a scanning channel.
And 2, adopting a dynamic channel allocation algorithm to enable each scanning worker to transmit file information to a plurality of channels in a staggered manner, so as to avoid channel congestion and data inclination.
Specifically, the client maintains a channel list for each scanning worker, and records the use state of each channel. When the scanning worker needs to send file information, a channel which is not used currently is selected from the channel list, and the file information is sent to the channel. After the transmission is completed, the scanning worker updates the use state of the channel and moves it to the end of the channel list. In addition, each channel has a buffer queue to cope with transient data traffic peaks.
It will be appreciated that such dynamic channel allocation algorithms may improve scan efficiency and scan channel utilization.
And 3, the aggregation worker receives file information from the scanning channel and aggregates small files into a data block list according to the catalog and the preset data block size.
It is understood that the aggregation process only involves the processing of metadata information, and does not perform actual data read-write operations. Each aggregation worker is responsible for processing file information under a part of the directory and sending the aggregated data block list to the packaging channel. Similar to the scan channels, the packing channels also employ dynamic channel allocation algorithms to ensure load balancing and efficient utilization of the channels.
And 4, receiving a data block list from the packaging channel by the packaging worker, reading data of the corresponding small files from the remote global shared file system according to file information in the list, and generating corresponding header information (header) according to the data of each small file. The header information is then packed into a block (chunk) along with the doclet data.
In fig. 7, the size of the data block is a pre-configured value (e.g. 4 MB), and the header information (header) of each small file includes a file name (name), a file size (size), a file modification time (mtime), and a file reading authority (mode). The header file is fixed to 512 bytes in length, and the remainder is automatically padded.
And 5, writing the generated data blocks into a new data set catalog of a back-end aggregation file system deployed at the computing node by the packaging worker through the client. The client uses the FUSE file system interface to synchronously generate the metadata information of the data blocks by intercepting the writing operation, and reports the metadata information to a synchronous channel of the metadata management center in a batch processing mode.
And 6, the synchronous worker receives the metadata information of the data blocks from the synchronous channel and writes the metadata information into a database of the metadata management center in batches. The aggregated file list for each data block is compressed prior to writing to the database to improve efficiency and reduce storage space usage.
It can be understood that the client adopts multiple workers to concurrently process tasks of each stage in each stage, so that the processing efficiency is improved. And secondly, a dynamic channel allocation algorithm is adopted, so that each worker can use a plurality of channels averagely and alternately, channel congestion and data inclination are avoided, and the channel utilization rate is improved. And thirdly, adopting a batch processing mechanism to reduce the interaction times with the metadata center database and reduce the load pressure of the database. And finally, adopting a metadata compression technology to reduce the occupation of metadata storage space and data transmission capacity and improve the metadata reading efficiency.
Alternatively, the client may also convert the original small file data set stored in the tree directory result into a flat directory structure stored in the back-end aggregate file system based on data blocks through a small file aggregation operation. Correspondingly, the client can also reconstruct the directory structure. The directory structure rebuild function is to reparse the data blocks stored in the back-end aggregate file system into the same small file data set directory mechanism as originally stored in the global shared file system to provide the application with a consistent view of the data set directory structure.
Specifically, the directory structure reconstruction is dynamically generated based on the metadata of the data blocks, and the client acquires metadata information from a Leaf table, a LeafAttributes table, a Chunk table and a ChunkAttributes table of the metadata management center, and dynamically restores and constructs a tree directory structure of the small file. For example, the tree structure hierarchy of the directory is first restored by the Leaf table, and attribute information of each Leaf node is acquired from the LeafAttributes table. For a catalog containing data blocks, the system queries the data block identifiers of all the aggregated data blocks under the catalog in a Leaf table and a Chunk table by using an inode through association query, and searches an aggregated small file list from a ChunkAttributes table through the data block identifiers. And restoring the directory hierarchy layer by layer through the relative path information reserved in the data block by the small file list, adding each level of directory into the corresponding leaf node of the tree-shaped linked list, constructing a tree-shaped directory structure, and returning to the tree-shaped linked list structure body according to the standard format of the Linux file system to finish the reconstruction of the directory structure.
In order to effectively solve the problem that data inconsistency occurs between an original small file data set in the global shared file system and data blocks stored in the back-end aggregation file system, a client can execute a complete small file aggregation operation on the data set used for training before a new training job is started. The full-scale updating method is simple and reliable, and can ensure that the latest data set is used for each training operation.
However, if the new data set is only partially changed to the old data set, the full volume update may result in a large number of repeated small file aggregation operations, thereby wasting unnecessary time. Thus, in this case, the client may first detect whether the data set used for training has been cached in the backend aggregate file system before a new training job is started. If the same name data set is cached, the comparison of the modification time is performed with the directory as the minimum granularity. If the modification time of a certain directory changes, that is, the data in the directory has been modified, the small file aggregation operation will be re-executed on the directory. In contrast, if the modification time of the directory is unchanged, there is no need to re-aggregate the directory. In this way, by updating only the changed portions, a full update of the entire data set is avoided, thereby significantly improving efficiency.
Furthermore, the client can also perform cache data management. Cache data management includes cleanup operations on obsolete data, junk data, and invalid data to ensure that data in the metadata management center and back-end data block store remains valid at all times. First, the target doclet or doclet dataset is marked "deleted" on the metadata. If all files within a data block are marked as "deleted," the data block is also marked as "deleted. When an entire dataset is marked as "deleted," all data blocks under the dataset are uniformly marked as "deleted. So that the data blocks which are not referenced any more can be effectively tracked, and a basis is provided for the subsequent garbage collection operation.
The garbage data cleaning mechanism relies on periodically running garbage collection tasks. The client may periodically scan the metadata to identify those data blocks that are not referenced by any file and that have been marked as "deleted". And marks these data blocks as recyclable and deletes the corresponding records from the metadata management center upon confirming that they are no longer needed. In order to ensure the consistency of the data deleting operation, related metadata and data blocks are locked in the deleting process, so that other operations are prevented from interfering with the deleting flow. The locking mechanism ensures the safe and reliable deleting operation in the concurrent environment and avoids the risks of inconsistent or accidental loss of data.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
The embodiment of the application also provides a data block processing system which comprises a metadata management center and a client.
The metadata management center is used for acquiring a tree structure storage table, a leaf node storage table and a data block storage table, wherein the tree structure storage table stores a first node identifier of a first node, the leaf node storage table comprises attribute information and node identifiers of all leaf nodes in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identifiers and node identifiers of all the leaf nodes of all the data blocks stored in all the leaf nodes, the first node is a root node or any leaf node in the tree structure, and the data blocks are obtained after a plurality of small files required by training operation are aggregated.
The metadata management center is further configured to search for node identifiers that match the first node identifier in the leaf node storage table and the data block storage table, if the node identifiers that match the first node identifier are found, establish a first hierarchical relationship between the first node and attribute information of the first node and data block information of the first node, determine, according to the first data block identifier of the data block corresponding to the first node, first doclet list information included in the data block corresponding to the first node, and add the first doclet list information to the first hierarchical relationship to obtain a second hierarchical relationship, generate a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and send the metadata snapshot to the client, so that the client stores the metadata snapshot in a local cache and obtains the metadata snapshot when training operation is started, and the other nodes are at least one leaf node except the first node in a tree structure.
The client is used for loading metadata snapshots corresponding to a plurality of data blocks from a local cache after a training operation is started, wherein the metadata snapshots are generated based on a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores first node identifiers of first nodes, the leaf node storage table comprises attribute information of each leaf node in the tree structure and each node identifier, the data block storage table comprises data block information of each data block stored in each leaf node in the plurality of leaf nodes, the data block identifiers and node identifiers of each leaf node in the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, the plurality of data blocks are obtained after the plurality of small files required by the training operation are aggregated, if yes, the snapshot time stamp of the metadata snapshots is consistent with the latest snapshot time stamp of a metadata management center, after the loading of the metadata snapshots is completed, the plurality of data blocks are read from a rear-end aggregated file system based on the metadata snapshots, and the deaggregation operation is carried out on the plurality of data blocks to obtain a plurality of small files.
The description of the features in the embodiments corresponding to the data block processing system may refer to the related description of the embodiments corresponding to the data block processing method, which is not described in detail herein.
The embodiment of the application also provides a data block processing device, as shown in fig. 8, fig. 8 is a structural block diagram of the data block processing device provided by the embodiment of the application, and the device comprises a first transceiver module 801, a leaf node storage table and a data block storage table, wherein the first transceiver module is used for acquiring a tree structure storage table, the leaf node storage table and the data block storage table, the tree structure storage table stores first node identifications of a first node, the leaf node storage table comprises attribute information and each node identification of each leaf node in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identifications and node identifications of each leaf node in a plurality of leaf nodes, the data block information and the data block identifications of each leaf node in the plurality of leaf nodes are obtained by aggregating a plurality of small files required by training operation, and the first node is a root node or any leaf node in the tree structure.
The first processing module 802 is configured to search for node identifiers that match the first node identifier in the leaf node storage table and the data block storage table, if the node identifiers that match the first node identifier are found, establish a first hierarchical relationship between the first node and attribute information of the first node and data block information of the first node, determine, according to the first data block identifier of the data block corresponding to the first node, first doclet list information included in the data block corresponding to the first node, and add the first doclet list information to the first hierarchical relationship to obtain a second hierarchical relationship, generate a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and send the metadata snapshot to the client, so that the client stores the metadata snapshot in a local cache and obtains the metadata snapshot when a training operation is started, where the other nodes are at least one leaf node except the first node in a tree structure.
In some optional embodiments, the first processing module 802 is specifically configured to obtain a data block attribute storage table, where the data block attribute storage table includes a data block identifier of each data block and doclet list information included in each data block, search for whether a data block identifier matching the first data block identifier exists in the data block attribute storage table, and if the data block identifier matching the first data block identifier is found, obtain first doclet list information included in the data block corresponding to the first data block identifier in the data block attribute storage table.
In some optional embodiments, the first processing module 802 is further specifically configured to write, based on the initial hierarchy in the preset format, attribute information of the first node and data block information of the first node into a position corresponding to a preset first key field in the initial hierarchy, so as to establish a first hierarchical relationship.
In some optional embodiments, the first processing module 802 is further specifically configured to write the first doclet list information into a position corresponding to the preset second key field in the first hierarchical relationship, so as to obtain the second hierarchical relationship.
In some optional embodiments, the first transceiver module 801 is further configured to obtain the number of times the metadata snapshot is accessed in the first period of time and the latest update time of the metadata snapshot, and the first processing module 802 is further configured to clear the metadata snapshot if the second period of time between the latest update time and the current time is equal to or greater than a first threshold and the number of times the metadata snapshot is accessed is equal to or less than a second threshold.
The embodiment of the application also provides a data block processing device, as shown in fig. 9, fig. 9 is a structural block diagram of the data block processing device provided by the embodiment of the application, where the device includes:
The second transceiver module 901 is configured to load metadata snapshots corresponding to a plurality of data blocks from a local cache after a training operation is started, where the metadata snapshots are generated based on a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores a first node identifier of a first node, the leaf node storage table includes attribute information of each leaf node in the tree structure and each node identifier, the data block storage table includes data block information of each data block stored in each leaf node in the plurality of leaf nodes, a data block identifier, and a node identifier of each leaf node in the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, and the plurality of data blocks are obtained after a plurality of small files required by the training operation are aggregated.
And the second processing module 902 is configured to detect whether a snapshot time stamp of the metadata snapshot is consistent with a latest snapshot time stamp of the metadata management center, if so, read a plurality of data blocks from the back-end aggregate file system based on the metadata snapshot after the metadata snapshot is loaded, and perform a deaggregation operation on the plurality of data blocks to obtain a plurality of small files.
In some alternative embodiments, the second processing module 902 is specifically configured to parse the metadata snapshot to obtain the doclet list information and the data block identifier of each data block, and read the plurality of data blocks from the back-end aggregate file system based on the doclet list information and the data block identifier.
In some optional embodiments, the second processing module 902 is further specifically configured to parse from a position corresponding to the preset first key field in the metadata snapshot to obtain a data block identifier and a data block name of each of the plurality of data blocks, and store the plurality of data block identifiers corresponding to the plurality of data blocks in the global data block list, and parse from a position corresponding to the preset second key field in the metadata snapshot to obtain small file list information included in each of the plurality of data blocks, where the small file list information includes the data block identifier of each of the plurality of data blocks and a small file list included in each of the plurality of data blocks.
In some optional embodiments, the second transceiver module 901 is further configured to obtain a plurality of file indexes, where the plurality of file indexes are indexes of each first small file in a small file list included in a small batch data set required for executing a training program of the training job, the second processor module 902 is further configured to generate a sub file list belonging to a first data block based on the plurality of file indexes, where the sub file list corresponds to a plurality of first small files, the second transceiver module 901 is further configured to obtain a total file amount of the plurality of first small files, the second processor module 902 is further configured to create a data block object of the first data block if the total file amount is less than a first threshold, and obtain small file header information and a small file header information length of each first small file, where the small file header information includes a small file size and a small file name of each first small file, obtain an address offset of each first small file in the first data block according to the small file size and the small file header information length of each first small file, and obtain an address offset of each first small file in the first data block, and store the data block and the address offset of the data block in the global data block.
In some optional embodiments, the second transceiver module 901 is further configured to, when the first target small file needs to be read, obtain a first target name of the first target small file, read a first address offset of the first target small file corresponding to the first target name from the global hash table, and the second processing module 902 is further configured to load, based on the first address offset, a target data block including the first target small file into the local cache.
In some optional embodiments, the second transceiver module 901 is further configured to obtain a second target name of the second target small file when the second target small file in the target data block needs to be read, read a second address offset of the second target small file corresponding to the second target name from the global hash table, and the second processing module 902 is further configured to read the second target small file from the target data block stored in the local cache based on the first address offset.
The description of the features in the embodiment corresponding to the data block processing apparatus may refer to the related description of the embodiment corresponding to the data block processing method, which is not described in detail herein.
The embodiment of the application also provides an electronic device, as shown in fig. 10, fig. 10 is a schematic hardware structure of the electronic device provided by the embodiment of the application, where the electronic device includes a processor 10 and a memory 20, and the memory 20 stores a computer program, and the processor 10 is configured to execute the computer program to perform steps in any of the embodiments of the data block processing method.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the data block processing method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the data block processing method embodiments described above.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the data block processing method embodiments described above.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method, the system, the equipment, the storage medium and the product for processing the data block provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims (15)

1. A data block processing method, applied to a metadata management center, the method comprising:
Acquiring a tree structure storage table, a leaf node storage table and a data block storage table, wherein the tree structure storage table stores a first node identifier of a first node, the leaf node storage table comprises attribute information of each leaf node and each node identifier in a tree structure corresponding to a plurality of data blocks, the data block storage table comprises data block information, data block identifiers and node identifiers of each leaf node, stored by each leaf node, of the plurality of leaf nodes, the first node is a root node or any leaf node in the tree structure, and the plurality of data blocks are obtained by aggregating a plurality of small files required by training operation;
searching whether node identifiers matched with the first node identifier exist in the leaf node storage table and the data block storage table respectively;
If the matched node identifiers are found, a first hierarchical relationship between the first node and attribute information of the first node and data block information of the first node is established;
Determining first small file list information included in the data block corresponding to the first node according to a first data block identifier of the data block corresponding to the first node, and adding the first small file list information to the first hierarchical relationship to obtain a second hierarchical relationship;
generating a metadata snapshot according to the second hierarchical relationship and at least one third hierarchical relationship of other nodes, and sending the metadata snapshot to a client, so that the client can store the metadata snapshot in a local cache and acquire the metadata snapshot when the training operation is started, wherein the other nodes are at least one leaf node except the first node in the tree structure.
2. The method according to claim 1, wherein the determining, according to the first data block identifier of the data block corresponding to the first node, the doclet list information included in the data block corresponding to the first node includes:
Acquiring a data block attribute storage table, wherein the data block attribute storage table comprises a data block identifier of each data block and small file list information included in each data block;
Searching whether a data block identifier matched with the first data block identifier exists in the data block attribute storage table;
And if the matched data block identifier is found, acquiring the first small file list information included in the data block corresponding to the first data block identifier from the data block attribute storage table.
3. The method according to claim 2, wherein the establishing a first hierarchical relationship between the first node and attribute information of the first node, and data block information of the first node includes:
based on an initial hierarchical structure in a preset format, writing the attribute information of the first node and the data block information of the first node into a position corresponding to a preset first key field in the initial hierarchical structure, and establishing the first hierarchical relationship.
4. The method of claim 3, wherein adding the first doclet list information to the first hierarchical relationship results in a second hierarchical relationship, comprising:
And writing the first small file list information into a position corresponding to a preset second key field in the first hierarchical relationship to obtain the second hierarchical relationship.
5. The method according to claim 4, wherein the method further comprises:
acquiring the accessed times of the metadata snapshot in a first time period and the latest updating time of the metadata snapshot;
And clearing the metadata snapshot under the condition that a second time period between the latest updating time and the current time is larger than or equal to a first threshold value and the accessed times are smaller than or equal to a second threshold value.
6. A method for processing a data block, the method being applied to a client, the method comprising:
When training operation is started, loading metadata snapshots corresponding to a plurality of data blocks from a local cache, wherein the metadata snapshots are generated based on a tree structure storage table, a leaf node storage table and a data block storage table, the tree structure storage table stores first node identifications of first nodes, the leaf node storage table comprises attribute information and each node identification of each leaf node in a tree structure, the data block storage table comprises data block information, data block identifications and node identifications of each data block stored by each leaf node in the plurality of leaf nodes, the node identifications of each leaf node in the plurality of leaf nodes are root nodes or any leaf node in the tree structure, and the plurality of data blocks are obtained by aggregating a plurality of small files required by the training operation;
Detecting whether the snapshot time stamp of the metadata snapshot is consistent with the latest snapshot time stamp of the metadata management center;
If yes, reading the plurality of data blocks from a back-end aggregate file system based on the metadata snapshot after the metadata snapshot is loaded;
And carrying out depolymerization operation on the plurality of data blocks to obtain the plurality of small files.
7. The method of claim 6, wherein the reading the plurality of data blocks from the back-end aggregate file system based on the metadata snapshot after the metadata snapshot loading is completed comprises:
analyzing the metadata snapshot to obtain small file list information and the data block identification of each data block;
and reading the plurality of data blocks from a back-end aggregate file system based on the doclet list information and the data block identification.
8. The method of claim 7, wherein said parsing said metadata snapshot to obtain doclet list information and said data block identification that make up each of said data blocks, comprises:
Presetting a position corresponding to a first key field in the metadata snapshot, analyzing to obtain a data block identifier and a data block name of each data block in the plurality of data blocks, and storing the plurality of data block identifiers corresponding to the plurality of data blocks in a global data block list;
And presetting a position corresponding to a second key field in the metadata snapshot, and analyzing to obtain the small file list information included in each data block in the plurality of data blocks, wherein the small file list information comprises the data block identification of each data block and a small file list included in each data block.
9. The method of claim 8, wherein the method further comprises:
Acquiring a plurality of file indexes, wherein the file indexes are indexes of each first small file in the small file list, which are included in a small batch data set required by a training program for executing the training operation;
generating a sub-file list belonging to a first data block based on the plurality of file indexes, wherein the sub-file list corresponds to a plurality of first small files;
acquiring the total file amount of the plurality of first small files;
If the total file amount is smaller than a first threshold value, creating a data block object of the first data block, and acquiring small file header information and small file header information length of each first small file, wherein the small file header information comprises small file size and small file name of each first small file;
obtaining the address offset of each first small file in the first data block according to the small file size and the small file header information length of each first small file;
And acquiring the plurality of first small files according to the address offset, and storing the data block identification of the first data block, the small file name of each first small file and the address offset in a global hash table corresponding to the data block object.
10. The method according to claim 9, wherein the method further comprises:
when a first target small file needs to be read, acquiring a first target name of the first target small file, and reading a first address offset of the first target small file corresponding to the first target name from the global hash table;
And loading the target data block comprising the first target small file to the local cache based on the first address offset.
11. The method according to claim 10, wherein the method further comprises:
when a second target small file in the target data block needs to be read, acquiring a second target name of the second target small file, and reading a second address offset of the second target small file corresponding to the second target name from the global hash table;
and reading the second target small file from the target data block stored in the local cache based on the first address offset.
12. A data block processing system, wherein the data block processing system comprises a metadata management center and a client;
the metadata management center for performing the data block processing method of any one of claims 1 to 5;
the client being adapted to perform the data block processing method of any of claims 6-11.
13. An electronic device, comprising:
a memory for storing a computer program;
processor for implementing the steps of the data block processing method according to any of claims 1 to 11 when executing said computer program.
14. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the data block processing method according to any of claims 1 to 11.
15. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the data block processing method according to any one of claims 1 to 11.
CN202510332559.2A 2025-03-20 2025-03-20 Data block processing method, system, equipment, storage medium and product Active CN119848050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510332559.2A CN119848050B (en) 2025-03-20 2025-03-20 Data block processing method, system, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510332559.2A CN119848050B (en) 2025-03-20 2025-03-20 Data block processing method, system, equipment, storage medium and product

Publications (2)

Publication Number Publication Date
CN119848050A CN119848050A (en) 2025-04-18
CN119848050B true CN119848050B (en) 2025-06-20

Family

ID=95358453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510332559.2A Active CN119848050B (en) 2025-03-20 2025-03-20 Data block processing method, system, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN119848050B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8818951B1 (en) * 2011-12-29 2014-08-26 Emc Corporation Distributed file system having separate data and metadata and providing a consistent snapshot thereof
CN113377292A (en) * 2021-07-02 2021-09-10 北京青云科技股份有限公司 Single machine storage engine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8818951B1 (en) * 2011-12-29 2014-08-26 Emc Corporation Distributed file system having separate data and metadata and providing a consistent snapshot thereof
CN113377292A (en) * 2021-07-02 2021-09-10 北京青云科技股份有限公司 Single machine storage engine

Also Published As

Publication number Publication date
CN119848050A (en) 2025-04-18

Similar Documents

Publication Publication Date Title
US12229148B2 (en) Data connector component for implementing integrity checking, anomaly detection, and file system metadata analysis
US11704290B2 (en) Methods, devices and systems for maintaining consistency of metadata and data across data centers
US11755415B2 (en) Variable data replication for storage implementing data backup
US12169483B2 (en) On-demand parallel processing of objects using data connector components
US12147699B2 (en) Data management system and method of controlling preliminary class
US20210173746A1 (en) Backup and restore in a distributed database utilizing consistent database snapshots
US9384207B2 (en) System and method for creating deduplicated copies of data by tracking temporal relationships among copies using higher-level hash structures
JP6196368B2 (en) Avoiding system-wide checkpoints in distributed database systems
US8266114B2 (en) Log structured content addressable deduplicating storage
US9501546B2 (en) System and method for quick-linking user interface jobs across services based on system implementation information
US20220138152A1 (en) Full and incremental scanning of objects
US12174789B2 (en) Sibling object generation for storing results of operations performed upon base objects
US12182067B2 (en) Containerization and serverless thread implementation for processing objects
US20160110408A1 (en) Optimized log storage for asynchronous log updates
JP2018077895A (en) Fast crash recovery for distributed database systems
CN105183400B (en) A method and system for object storage based on content addressing
JP2013541057A (en) Map Reduce Instant Distributed File System
US20180107404A1 (en) Garbage collection system and process
EP3998533B1 (en) On-demand parallel processing of objects using data connector components
CN119848050B (en) Data block processing method, system, equipment, storage medium and product
US12222821B2 (en) Server-side inline generation of virtual synthetic backups using group fingerprints
US12222816B2 (en) Targeted deduplication using group fingerprints and auto-generated backup recipes for virtual synthetic replication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant