[go: up one dir, main page]

CN113656363A - A data deduplication method, system, device and storage medium based on HDFS - Google Patents

A data deduplication method, system, device and storage medium based on HDFS Download PDF

Info

Publication number
CN113656363A
CN113656363A CN202110805256.XA CN202110805256A CN113656363A CN 113656363 A CN113656363 A CN 113656363A CN 202110805256 A CN202110805256 A CN 202110805256A CN 113656363 A CN113656363 A CN 113656363A
Authority
CN
China
Prior art keywords
file
array
key
block
hdfs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110805256.XA
Other languages
Chinese (zh)
Other versions
CN113656363B (en
Inventor
尹明俊
常洪耀
潘利杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202110805256.XA priority Critical patent/CN113656363B/en
Publication of CN113656363A publication Critical patent/CN113656363A/en
Application granted granted Critical
Publication of CN113656363B publication Critical patent/CN113656363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提出了一种基于HDFS的数据去重方法、系统、设备和存储介质,该方法包括计算待去重工作节点下文件块的标识值,将文件全路径与标识值组成键值对,然后将键值对聚合成第一映射数组并发送至命名空间;在命名空间,二次聚合映射数组得到第二映射数组;查询第二映射数组中键值对的标识值在映射表中如果已经存在,获取第二映射数组中键值对对应的块索引信息,根据块索引信息,将重复标识值块对应的块信息替换为第二映射数组中键值对的标识值对应的块信息。基于该方法还提出了一种基于HDFS的数据去重系统、设备和存储介质。本发明基于HDFS的数据去重,实现数据去重服务化,免去外部依赖,可有效降低存储成本。

Figure 202110805256

The present invention provides a data deduplication method, system, device and storage medium based on HDFS. The method includes calculating the identification value of the file block under the working node to be deduplicated, forming a key-value pair with the full path of the file and the identification value, and then Aggregate the key-value pairs into the first mapping array and send it to the namespace; in the namespace, aggregate the mapping array twice to obtain the second mapping array; query the identification value of the key-value pair in the second mapping array if it already exists in the mapping table , obtain the block index information corresponding to the key-value pair in the second mapping array, and replace the block information corresponding to the duplicate identification value block with the block information corresponding to the identification value of the key-value pair in the second mapping array according to the block index information. Based on this method, a data deduplication system, device and storage medium based on HDFS are also proposed. The present invention is based on HDFS data deduplication, realizes data deduplication service, eliminates external dependencies, and can effectively reduce storage costs.

Figure 202110805256

Description

Data deduplication method, system, equipment and storage medium based on HDFS
Technical Field
The invention belongs to the technical field of data deduplication, and particularly relates to a data deduplication method, system, device and storage medium based on an HDFS.
Background
HDFS (Hadoop distributed File System): hadoop Distributed File System, Hadoop Distributed File System designed to fit Distributed File systems running on general purpose hardware. It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for application on large-scale data sets. HDFS relaxes a portion of the POSIX constraints to achieve the goal of streaming file system data. HDFS was originally developed as an infrastructure for the Apache Nutch search engine project. HDFS is part of the Apache Hadoop Core project. HDFS is characterized by high fault tolerance and is designed to be deployed on inexpensive hardware. And it provides high throughput access to application data, suitable for applications with very large data sets. HDFS relaxes the POSIX requirements so that access to data in a file system in the form of streams can be achieved. In a large-scale production environment, the storage cost is one of important indexes for constructing the HDFS cluster scale, and the storage cost directly determines the upper limit of the storage total amount of the cluster. In order to reduce the storage cost, the prior art makes many efforts from the perspective of data optimization and storage strategies. In the aspect of data optimization, the existing data can be compressed by using a compression algorithm, and the size of the file is reduced, so that the aim of reducing the storage cost is fulfilled. In the aspect of a storage strategy, the existing three-copy strategy is converted into an erasure code strategy, so that the occupation space of redundant files is greatly reduced on the premise of ensuring the file availability, and the aim of reducing the storage cost can be fulfilled. However, the above methods all need to change the original file structure, consume a lot of CPU resources of the cluster, and have poor support for upper layer applications.
The data deduplication technology is essentially to delete duplicate data, focuses more on duplication inside files, and is also a data compression technology in a certain sense. In the HDFS scene, the file is composed of blocks, and the purpose of data deduplication can be achieved by comparing similarities and differences of file blocks. In a production environment, besides original data, the HDFS also stores a large amount of multi-level data processed based on the original data, and the multi-level data contains a large amount of repeated data and can be optimized through a data deduplication technology. The existing HDFS data deduplication technology mainly stores repeated data block information in an external database in a data fingerprint mode through a MapReduce task, deletes repeated file blocks according to the data fingerprint information, and obtains repeated file block indexes by using the external database, so that the repeated blocks are read normally after being deleted, but the existing HDFS data deduplication technology still has the following defects: the data deduplication inspection stage depends on external computing resources similar to MapReduce, is not beneficial to storage internal management, and occupies computing resources; after the repeated blocks are deleted, the block index is maintained by an external database, and the data reading performance is influenced; the existing data is processed, and when the data is added or deleted, reasonable processing logic and methods are lacked.
Disclosure of Invention
In order to solve the technical problems, the invention provides a data deduplication method, a system, equipment and a storage medium based on an HDFS (Hadoop distributed File System), which realize the data deduplication service of the HDFS, avoid external dependence and effectively reduce the storage cost.
In order to achieve the purpose, the invention adopts the following technical scheme:
a data deduplication method based on an HDFS comprises the following steps:
calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;
in the name space of the HDFS, secondarily aggregating the mapping array to obtain a second mapping array;
and if the identification value of the key value pair in the second mapping array exists in the mapping table, acquiring block index information corresponding to the key value pair in the second mapping array, and replacing the block information corresponding to the repeated identification value block with the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information.
Further, the method for calculating the identification value of the file block under the to-be-deduplicated work node comprises the following steps: any size file block generates a256 byte long hash value and is then represented by a 64 length hexadecimal string.
Further, the calculating an identification value of a file block under a node to be deduplicated, aggregating a full path of the file and the identification value into a mapping array, and sending the mapping array to the namespace of the HDFS further includes: the work node periodically receives the deduplication tasks distributed in the form of heartbeats and sends the mapping array to the namespace of the HDFS in the form of heartbeats.
Further, the process of secondarily aggregating the mapping arrays is as follows:
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
Further, the method further comprises: and if the identification value of the key-value pair in the second mapping array does not exist in the mapping table, directly inserting the key-value pair into the mapping table.
Further, the method further comprises:
receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying the block of the repeated mark value of the file to be modified, and updating the copied block index information to the name space of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
Further, the method further comprises:
receiving an instruction for deleting a file;
inquiring all block identification values of a file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
and if the inquired files have files which are not to be deleted, only deleting the index information of the files to be deleted in the namespace of the HDFS and the key value pair information in the mapping table.
The invention also provides a data deduplication system based on the HDFS, which comprises a file identification module, a deduplication task module, a duplicate checking module and an index updating module;
the file identification module is used for calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;
the duplication removal task module is used for aggregating the mapping array for the second time in the name space of the HDFS to obtain a second mapping array;
the repeated query module is used for querying whether the identification value of the key-value pair in the second mapping array exists in the mapping table or not, and if not, the key-value pair is directly inserted into the mapping table; if so, performing file deduplication;
the index updating module is used for executing file deduplication, acquiring block index information corresponding to key values in the second mapping array, and replacing the block information corresponding to repeated identification value blocks with the block information corresponding to the identification values of the key values in the second mapping array according to the block index information.
The invention also proposes a device comprising:
a memory for storing a computer program;
a processor for implementing the method steps when executing the computer program.
The invention also proposes a readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the method steps.
The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
the invention provides a data deduplication method, a system, equipment and a storage medium based on an HDFS (Hadoop distributed File System), wherein the method comprises the steps of calculating an identification value of a file block under a work node to be deduplicated, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS; in the name space of the HDFS, secondarily aggregating the mapping array to obtain a second mapping array; and if the identification value of the key value pair in the second mapping array exists in the mapping table, acquiring block index information corresponding to the key value pair in the second mapping array, and replacing the block information corresponding to the repeated identification value block with the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information. The method also provides a method for modifying and deleting the data when the data deduplication operation is carried out. The invention carries out duplicate removal on the files written with the HDFS, sets and searches the existence of the same file blocks of different files through identification values, and reconstructs the file index by deleting repeated blocks to realize duplicate removal. Meanwhile, in order to not influence the deletion and modification operations of the deduplicated files, a complete compatible logical link is constructed. For deleting duplicate-removed files, if the same file block corresponds to a plurality of files, only deleting metadata information related to the index without deleting the file block; and for the operation of modifying the deduplicated file, if the same file block corresponds to a plurality of files, the same file block is copied to replace the original index so as to support modification. The invention provides the data deduplication based on the HDFS, realizes the data deduplication servitization of the HDFS, avoids external dependence and can effectively reduce the storage cost. The file block index is automatically updated after the duplication is removed, the read-write performance of the file is not influenced, the non-perception of an application layer and the data duplication removal transparence are realized, the key interface overwriting is realized, and the interface compatibility of the duplicate removal file is improved.
Based on a data deduplication method based on the HDFS, a data deduplication system based on the HDFS, equipment and a storage medium are also provided, and the data deduplication system based on the HDFS, the equipment and the storage medium also have the functions of the method and are not described in detail herein.
Drawings
Fig. 1 is a flowchart of a data deduplication method based on HDFS according to embodiment 1 of the present invention;
fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;
fig. 3 is a schematic diagram illustrating a replacement process of deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention;
fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in an HDFS-based data deduplication method according to embodiment 1 of the present invention;
fig. 5 is a schematic diagram of a HDFS-based data deduplication system according to embodiment 2 of the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
Example 1
The embodiment 1 of the invention provides a data deduplication method based on an HDFS (Hadoop distributed File System), which is used for realizing data deduplication servitization of the HDFS, avoiding external dependence, effectively reducing storage cost, automatically updating file block indexes after deduplication, not influencing file reading and writing performance, and realizing no perception of an application layer and data deduplication transparentization.
Fig. 1 shows a flowchart of a data deduplication method based on HDFS according to embodiment 1 of the present invention.
In step S101, an identification value of a file block under a work node to be deduplicated is calculated, a key-value pair is formed by a file full path and the identification value, and then the key-value pair is aggregated into a first mapping array and sent to a namespace of the HDFS. After each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.
Wherein, NameNode: NameNode is the name space of HDFS, which maintains the file system tree and all the files and directories in the whole tree, including the mapping information of files and DataNode nodes, the mapping information of files and file blocks, etc.;
a DataNode: the DataNode is a working node of the HDFS, and the DataNode stores and retrieves data blocks according to needs, and periodically sends a block list stored by the DataNode to the NameNode in a heartbeat mode.
The identification value of each file block is calculated by adopting an SHA256 algorithm, and a hash value with the length of 256 bytes is generated for any size of file block and is represented by a hexadecimal character string with the length of 64. Compared with the MD5 and SHA1 algorithms, the probability of collision of the algorithms is far less than that of the first two, so that the algorithms are more suitable as identification values of file blocks more than billions in number.
And forming key value pairs by the node file full path character string and the file block identification value to which the node file belongs, then aggregating all the key value pairs into an array, and reporting the array to the NameNode through heartbeat.
In step S102, in the namespace of the HDFS, the mapping array is aggregated for the second time to obtain a second mapping array. In this step, a trigger cycle for distributing the deduplication tasks is set, and the deduplication tasks are triggered every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. In step S101, after the aggregated mapping result is returned, in this step, secondary aggregation is performed to remove the mapping pairs of different copies of the same file.
Fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
In step S103, querying whether an identification value of a key-value pair in the second mapping array exists in the mapping table, and if not, directly inserting the key-value pair into the mapping table; if so, file deduplication is performed.
And after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.
In step S104, if the identification value of the key-value pair in the second mapping array already exists in the mapping table, the block index information corresponding to the key-value pair in the second mapping array is obtained, and according to the block index information, the block information corresponding to the repeated identification value block is replaced with the block information corresponding to the identification value of the key-value pair in the second mapping array.
Fig. 3 is a schematic diagram illustrating a process of replacing deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention.
Firstly, block index information blockinfo-i of a file corresponding to a key value pair in an unregistered mapping table is obtained, then block information blockinfo-j of a repeated identification value block in the registered mapping table is obtained, the block information blockinfo-i2 corresponding to the repeated identification value block is replaced by the block information blockinfo-j2, and finally key value pair information Map1 is registered in the mapping table.
The HDFS-based data deduplication method further provides a method for modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.
Receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying a block of a repeated mark value of the file to be modified, and updating the copied block index information blockinfo-copy to a NameNode of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
The process of deleting data and modifying data is similar:
receiving an instruction for deleting a file;
inquiring all block identification values of the file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
if the inquired files have files which are not to be deleted, only the index information of the files to be deleted in the NameNode of the HDFS and the key value pair information in the mapping table are deleted.
After the operation on the data is completed, the modified file and the deleted file logic of the original client are overwritten, and the access index is added in the file modification and deletion logic process, so that the normal addition, deletion, modification and check after the data is deduplicated are not influenced.
The data deduplication method based on the HDFS, provided by the invention, has the advantages that the data deduplication based on the HDFS is provided, the data deduplication service of the HDFS is realized, the external dependence is avoided, and the storage cost can be effectively reduced. The file block index is automatically updated after the duplication is removed, the read-write performance of the file is not influenced, the non-perception of an application layer and the data duplication removal transparence are realized, the key interface overwriting is realized, and the interface compatibility of the duplicate removal file is improved.
Example 2
Based on the HDFS-based data deduplication method provided in embodiment 1 of the present invention, embodiment 2 of the present invention also provides a HDFS-based data deduplication system, as shown in fig. 5, which is a schematic diagram of an HDFS-based data deduplication system in embodiment 2 of the present invention. The system comprises: the system comprises a file identification module, a duplication elimination task module, a duplication checking module and an index updating module;
the file identification module is used for calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;
the duplication removal task module is used for aggregating the mapping array for the second time in the name space of the HDFS to obtain a second mapping array;
the repeated query module is used for querying whether the identification value of the key-value pair in the second mapping array exists in the mapping table or not, and if not, the key-value pair is directly inserted into the mapping table; if so, performing file deduplication;
the index updating module is used for executing file deduplication, acquiring block index information corresponding to key values in the second mapping array, and replacing the block information corresponding to repeated identification value blocks with the block information corresponding to the identification values of the key values in the second mapping array according to the block index information.
The detailed process executed by the file identification module in embodiment 2 of the present invention is as follows:
after each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.
Wherein, NameNode: NameNode is the name space of HDFS, which maintains the file system tree and all the files and directories in the whole tree, including the mapping information of files and DataNode nodes, the mapping information of files and file blocks, etc.;
a DataNode: the DataNode is a working node of the HDFS, and the DataNode stores and retrieves data blocks according to needs, and periodically sends a block list stored by the DataNode to the NameNode in a heartbeat mode.
The identification value of each file block is calculated by adopting an SHA256 algorithm, and a hash value with the length of 256 bytes is generated for any size of file block and is represented by a hexadecimal character string with the length of 64. Compared with the MD5 and SHA1 algorithms, the probability of collision of the algorithms is far less than that of the first two, so that the algorithms are more suitable as identification values of file blocks more than billions in number.
And forming key value pairs by the node file full path character string and the file block identification value to which the node file belongs, then aggregating all the key value pairs into an array, and reporting the array to the NameNode through heartbeat.
The process of the duplication elimination task module is as follows: and setting a triggering period for distributing the deduplication tasks, and triggering the deduplication tasks every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. And after the file identification module returns the aggregated mapping result, secondary aggregation is carried out on the duplicate removal task module to remove the mapping pairs of different copies of the same file.
Fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
The process of the repeated checking module is realized as follows: and after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.
The index updating module realizes the following processes: if the identification value of the key value pair in the second mapping array exists in the mapping table, the block index information corresponding to the key value pair in the second mapping array is obtained, and the block information corresponding to the repeated identification value block is replaced by the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information.
Fig. 3 is a schematic diagram illustrating a process of replacing deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention.
Firstly, block index information blockinfo-i of a file corresponding to a key value pair in an unregistered mapping table is obtained, then block information blockinfo-j of a repeated identification value block in the registered mapping table is obtained, the block information blockinfo-i2 corresponding to the repeated identification value block is replaced by the block information blockinfo-j2, and finally key value pair information Map1 is registered in the mapping table.
The index updating module also provides the processes of modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.
Receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying a block of a repeated mark value of the file to be modified, and updating the copied block index information blockinfo-copy to a NameNode of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
The process of deleting data and modifying data is similar:
receiving an instruction for deleting a file;
inquiring all block identification values of the file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
if the inquired files have files which are not to be deleted, only the index information of the files to be deleted in the NameNode of the HDFS and the key value pair information in the mapping table are deleted.
The client module mainly has the functions of overwriting the logic of the modified file and the deleted file of the original client, and the process of accessing the corresponding unit of the index updating module is added in the logic process of modifying and deleting the file, so that the original functions of the client do not influence the normal addition, deletion, modification and check of the data after the data is deduplicated.
Example 3
The invention also proposes a device comprising:
a memory for storing a computer program;
a processor for implementing the method steps when executing the computer program as follows:
in step S101, an identification value of a file block under a work node to be deduplicated is calculated, a key-value pair is formed by a file full path and the identification value, and then the key-value pair is aggregated into a first mapping array and sent to a namespace of the HDFS. After each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.
Wherein, NameNode: NameNode is the name space of HDFS, which maintains the file system tree and all the files and directories in the whole tree, including the mapping information of files and DataNode nodes, the mapping information of files and file blocks, etc.;
a DataNode: the DataNode is a working node of the HDFS, and the DataNode stores and retrieves data blocks according to needs, and periodically sends a block list stored by the DataNode to the NameNode in a heartbeat mode.
The identification value of each file block is calculated by adopting an SHA256 algorithm, and a hash value with the length of 256 bytes is generated for any size of file block and is represented by a hexadecimal character string with the length of 64. Compared with the MD5 and SHA1 algorithms, the probability of collision of the algorithms is far less than that of the first two, so that the algorithms are more suitable as identification values of file blocks more than billions in number.
And forming key value pairs by the node file full path character string and the file block identification value to which the node file belongs, then aggregating all the key value pairs into an array, and reporting the array to the NameNode through heartbeat.
In step S102, in the namespace of the HDFS, the mapping array is aggregated for the second time to obtain a second mapping array. In this step, a trigger cycle for distributing the deduplication tasks is set, and the deduplication tasks are triggered every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. In step S101, after the aggregated mapping result is returned, in this step, secondary aggregation is performed to remove the mapping pairs of different copies of the same file.
Fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
In step S103, querying whether an identification value of a key-value pair in the second mapping array exists in the mapping table, and if not, directly inserting the key-value pair into the mapping table; if so, file deduplication is performed.
And after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.
In step S104, if the identification value of the key-value pair in the second mapping array already exists in the mapping table, the block index information corresponding to the key-value pair in the second mapping array is obtained, and according to the block index information, the block information corresponding to the repeated identification value block is replaced with the block information corresponding to the identification value of the key-value pair in the second mapping array.
Fig. 3 is a schematic diagram illustrating a process of replacing deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention.
Firstly, block index information blockinfo-i of a file corresponding to a key value pair in an unregistered mapping table is obtained, then block information blockinfo-j of a repeated identification value block in the registered mapping table is obtained, the block information blockinfo-i2 corresponding to the repeated identification value block is replaced by the block information blockinfo-j2, and finally key value pair information Map1 is registered in the mapping table.
The HDFS-based data deduplication method further provides a method for modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.
Receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying a block of a repeated mark value of the file to be modified, and updating the copied block index information blockinfo-copy to a NameNode of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
The process of deleting data and modifying data is similar:
receiving an instruction for deleting a file;
inquiring all block identification values of the file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
if the inquired files have files which are not to be deleted, only the index information of the files to be deleted in the NameNode of the HDFS and the key value pair information in the mapping table are deleted.
After the operation on the data is completed, the modified file and the deleted file logic of the original client are overwritten, and the access index is added in the file modification and deletion logic process, so that the normal addition, deletion, modification and check after the data is deduplicated are not influenced.
Need to explain: the technical solution of the present invention also provides an electronic device, including: the communication interface can carry out information interaction with other equipment such as network equipment and the like; and the processor is connected with the communication interface to realize information interaction with other equipment, and is used for executing a data deduplication method based on the HDFS provided by one or more technical schemes when running a computer program, and the computer program is stored on the memory. Of course, in practice, the various components in an electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus. The memory in the embodiments of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device. It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory. The method disclosed in the embodiments of the present application may be applied to a processor, or may be implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a DSP (Digital Signal Processing, i.e., a chip capable of implementing Digital Signal Processing technology), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in a memory where a processor reads the programs in the memory and in combination with its hardware performs the steps of the method as previously described. When the processor executes the program, corresponding processes in the methods of the embodiments of the present application are implemented, and for brevity, are not described herein again.
Example 4
The invention also proposes a readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method steps of:
in step S101, an identification value of a file block under a work node to be deduplicated is calculated, a key-value pair is formed by a file full path and the identification value, and then the key-value pair is aggregated into a first mapping array and sent to a namespace of the HDFS. After each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.
Wherein, NameNode: NameNode is the name space of HDFS, which maintains the file system tree and all the files and directories in the whole tree, including the mapping information of files and DataNode nodes, the mapping information of files and file blocks, etc.;
a DataNode: the DataNode is a working node of the HDFS, and the DataNode stores and retrieves data blocks according to needs, and periodically sends a block list stored by the DataNode to the NameNode in a heartbeat mode.
The identification value of each file block is calculated by adopting an SHA256 algorithm, and a hash value with the length of 256 bytes is generated for any size of file block and is represented by a hexadecimal character string with the length of 64. Compared with the MD5 and SHA1 algorithms, the probability of collision of the algorithms is far less than that of the first two, so that the algorithms are more suitable as identification values of file blocks more than billions in number.
And forming key value pairs by the node file full path character string and the file block identification value to which the node file belongs, then aggregating all the key value pairs into an array, and reporting the array to the NameNode through heartbeat.
In step S102, in the namespace of the HDFS, the mapping array is aggregated for the second time to obtain a second mapping array. In this step, a trigger cycle for distributing the deduplication tasks is set, and the deduplication tasks are triggered every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. In step S101, after the aggregated mapping result is returned, in this step, secondary aggregation is performed to remove the mapping pairs of different copies of the same file.
Fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
In step S103, querying whether an identification value of a key-value pair in the second mapping array exists in the mapping table, and if not, directly inserting the key-value pair into the mapping table; if so, file deduplication is performed.
And after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.
In step S104, if the identification value of the key-value pair in the second mapping array already exists in the mapping table, the block index information corresponding to the key-value pair in the second mapping array is obtained, and according to the block index information, the block information corresponding to the repeated identification value block is replaced with the block information corresponding to the identification value of the key-value pair in the second mapping array.
Fig. 3 is a schematic diagram illustrating a process of replacing deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention.
Firstly, block index information blockinfo-i of a file corresponding to a key value pair in an unregistered mapping table is obtained, then block information blockinfo-j of a repeated identification value block in the registered mapping table is obtained, the block information blockinfo-i2 corresponding to the repeated identification value block is replaced by the block information blockinfo-j2, and finally key value pair information Map1 is registered in the mapping table.
The HDFS-based data deduplication method further provides a method for modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.
Receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying a block of a repeated mark value of the file to be modified, and updating the copied block index information blockinfo-copy to a NameNode of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
The process of deleting data and modifying data is similar:
receiving an instruction for deleting a file;
inquiring all block identification values of the file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
if the inquired files have files which are not to be deleted, only the index information of the files to be deleted in the NameNode of the HDFS and the key value pair information in the mapping table are deleted.
After the operation on the data is completed, the modified file and the deleted file logic of the original client are overwritten, and the access index is added in the file modification and deletion logic process, so that the normal addition, deletion, modification and check after the data is deduplicated are not influenced.
For a description of a device and a related part in a storage medium for data deduplication based on an HDFS provided in the embodiment of the present application, reference may be made to a detailed description of a corresponding part in a data deduplication method based on an HDFS provided in embodiment 1 of the present application, and details are not repeated here.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include elements inherent in the list. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, the scope of the present invention is not limited thereto. Various modifications and alterations will occur to those skilled in the art based on the foregoing description. And are neither required nor exhaustive of all embodiments. On the basis of the technical scheme of the invention, various modifications or changes which can be made by a person skilled in the art without creative efforts are still within the protection scope of the invention.

Claims (10)

1. A data deduplication method based on an HDFS (Hadoop distributed File System) is characterized by comprising the following steps of:
calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;
in the name space of the HDFS, secondarily aggregating the mapping array to obtain a second mapping array;
and if the identification value of the key value pair in the second mapping array exists in the mapping table, acquiring block index information corresponding to the key value pair in the second mapping array, and replacing the block information corresponding to the repeated identification value block with the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information.
2. The HDFS-based data deduplication method according to claim 1, wherein the method for calculating the identification value of the file block under the work node to be deduplicated is: any size file block generates a256 byte long hash value and is then represented by a 64 length hexadecimal string.
3. The HDFS-based data deduplication method according to claim 1, wherein the calculating an identification value of a file block under a work node to be deduplicated, aggregating a file full path and the identification value into a mapping array, and sending the mapping array to a namespace of the HDFS further comprises: the work node periodically receives the deduplication tasks distributed in the form of heartbeats and sends the mapping array to the namespace of the HDFS in the form of heartbeats.
4. The HDFS-based data deduplication method according to claim 1, wherein the second aggregating the mapping array comprises:
reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;
inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.
5. The HDFS-based data deduplication method of claim 1, further comprising: and if the identification value of the key-value pair in the second mapping array does not exist in the mapping table, directly inserting the key-value pair into the mapping table.
6. The HDFS-based data deduplication method of claim 1, further comprising:
receiving an instruction for modifying a file;
inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;
if the inquired file has a file which is not to be modified, firstly copying the block of the repeated mark value of the file to be modified, and updating the copied block index information to the name space of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.
7. The HDFS-based data deduplication method of claim 1, further comprising:
receiving an instruction for deleting a file;
inquiring all block identification values of a file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;
and if the inquired files have files which are not to be deleted, only deleting the index information of the files to be deleted in the namespace of the HDFS and the key value pair information in the mapping table.
8. A data deduplication system based on an HDFS (Hadoop distributed File System) is characterized by comprising a file identification module, a deduplication task module, a duplicate checking module and an index updating module;
the file identification module is used for calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;
the duplication removal task module is used for aggregating the mapping array for the second time in the name space of the HDFS to obtain a second mapping array;
the repeated query module is used for querying whether the identification value of the key-value pair in the second mapping array exists in the mapping table or not, and if not, the key-value pair is directly inserted into the mapping table; if so, performing file deduplication;
the index updating module is used for executing file deduplication, acquiring block index information corresponding to key values in the second mapping array, and replacing the block information corresponding to repeated identification value blocks with the block information corresponding to the identification values of the key values in the second mapping array according to the block index information.
9. An apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 7 when executing the computer program.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
CN202110805256.XA 2021-07-16 2021-07-16 A data deduplication method, system, device and storage medium based on HDFS Active CN113656363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110805256.XA CN113656363B (en) 2021-07-16 2021-07-16 A data deduplication method, system, device and storage medium based on HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110805256.XA CN113656363B (en) 2021-07-16 2021-07-16 A data deduplication method, system, device and storage medium based on HDFS

Publications (2)

Publication Number Publication Date
CN113656363A true CN113656363A (en) 2021-11-16
CN113656363B CN113656363B (en) 2025-01-07

Family

ID=78478042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110805256.XA Active CN113656363B (en) 2021-07-16 2021-07-16 A data deduplication method, system, device and storage medium based on HDFS

Country Status (1)

Country Link
CN (1) CN113656363B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115328677A (en) * 2022-08-19 2022-11-11 济南浪潮数据技术有限公司 An interface adaptation method, apparatus, device and readable storage medium
CN119377265A (en) * 2024-12-31 2025-01-28 成都卓拙科技有限公司 Data structure optimization method and device for querying IP address location information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130091102A1 (en) * 2011-10-11 2013-04-11 Netapp, Inc. Deduplication aware scheduling of requests to access data blocks
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN108369487A (en) * 2015-11-25 2018-08-03 华睿泰科技有限责任公司 System and method for shooting snapshot in duplicate removal Virtual File System
CN111078704A (en) * 2019-12-18 2020-04-28 鹏城实验室 Method, device and medium for constructing overtravel package warehouse based on hierarchical incremental storage
CN112328435A (en) * 2020-12-07 2021-02-05 武汉绿色网络信息服务有限责任公司 Method, device, equipment and storage medium for backing up and recovering target data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130091102A1 (en) * 2011-10-11 2013-04-11 Netapp, Inc. Deduplication aware scheduling of requests to access data blocks
CN108369487A (en) * 2015-11-25 2018-08-03 华睿泰科技有限责任公司 System and method for shooting snapshot in duplicate removal Virtual File System
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN111078704A (en) * 2019-12-18 2020-04-28 鹏城实验室 Method, device and medium for constructing overtravel package warehouse based on hierarchical incremental storage
CN112328435A (en) * 2020-12-07 2021-02-05 武汉绿色网络信息服务有限责任公司 Method, device, equipment and storage medium for backing up and recovering target data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘俊龙;刘光明;张黛;喻杰;: "基于Redis的海量互联网小文件实时存储与索引策略研究", 计算机研究与发展, no. 2, 31 December 2015 (2015-12-31) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115328677A (en) * 2022-08-19 2022-11-11 济南浪潮数据技术有限公司 An interface adaptation method, apparatus, device and readable storage medium
CN119377265A (en) * 2024-12-31 2025-01-28 成都卓拙科技有限公司 Data structure optimization method and device for querying IP address location information

Also Published As

Publication number Publication date
CN113656363B (en) 2025-01-07

Similar Documents

Publication Publication Date Title
US12038906B2 (en) Database system with database engine and separate distributed storage service
US20230401150A1 (en) Management of repeatedly seen data
US11755415B2 (en) Variable data replication for storage implementing data backup
US10534768B2 (en) Optimized log storage for asynchronous log updates
US10437721B2 (en) Efficient garbage collection for a log-structured data store
US10229011B2 (en) Log-structured distributed storage using a single log sequence number space
US11256720B1 (en) Hierarchical data structure having tiered probabilistic membership query filters
US9507843B1 (en) Efficient replication of distributed storage changes for read-only nodes of a distributed database
US20190188406A1 (en) Dynamic quorum membership changes
US10725666B2 (en) Memory-based on-demand data page generation
US7487138B2 (en) System and method for chunk-based indexing of file system content
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US10909091B1 (en) On-demand data schema modifications
CN106649676B (en) HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files
JP2016511495A (en) Log record management
US10223184B1 (en) Individual write quorums for a log-structured distributed storage system
CN113656363B (en) A data deduplication method, system, device and storage medium based on HDFS
US12216622B2 (en) Supporting multiple fingerprint formats for data file segment
CN118550477B (en) Data deduplication methods, products, computer devices and storage media
CN114625751A (en) Blockchain-based data traceability query method and device
CN113495807A (en) Data backup method, data recovery method and device
US12050549B2 (en) Client support of multiple fingerprint formats for data file segments
CN118227565A (en) Data management method and device and electronic equipment
CN118277357A (en) A data management method, corresponding device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant