CN113656363A

CN113656363A - A data deduplication method, system, device and storage medium based on HDFS

Info

Publication number: CN113656363A
Application number: CN202110805256.XA
Authority: CN
Inventors: 尹明俊; 常洪耀; 潘利杰
Original assignee: Inspur Jinan data Technology Co ltd
Current assignee: Inspur Jinan data Technology Co ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-11-16
Anticipated expiration: 2041-07-16
Also published as: CN113656363B

Abstract

The present invention provides a data deduplication method, system, device and storage medium based on HDFS. The method includes calculating the identification value of the file block under the working node to be deduplicated, forming a key-value pair with the full path of the file and the identification value, and then Aggregate the key-value pairs into the first mapping array and send it to the namespace; in the namespace, aggregate the mapping array twice to obtain the second mapping array; query the identification value of the key-value pair in the second mapping array if it already exists in the mapping table , obtain the block index information corresponding to the key-value pair in the second mapping array, and replace the block information corresponding to the duplicate identification value block with the block information corresponding to the identification value of the key-value pair in the second mapping array according to the block index information. Based on this method, a data deduplication system, device and storage medium based on HDFS are also proposed. The present invention is based on HDFS data deduplication, realizes data deduplication service, eliminates external dependencies, and can effectively reduce storage costs.

Description

Data deduplication method, system, equipment and storage medium based on HDFS

Technical Field

The invention belongs to the technical field of data deduplication, and particularly relates to a data deduplication method, system, device and storage medium based on an HDFS.

Background

HDFS (Hadoop distributed File System): hadoop Distributed File System, Hadoop Distributed File System designed to fit Distributed File systems running on general purpose hardware. It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for application on large-scale data sets. HDFS relaxes a portion of the POSIX constraints to achieve the goal of streaming file system data. HDFS was originally developed as an infrastructure for the Apache Nutch search engine project. HDFS is part of the Apache Hadoop Core project. HDFS is characterized by high fault tolerance and is designed to be deployed on inexpensive hardware. And it provides high throughput access to application data, suitable for applications with very large data sets. HDFS relaxes the POSIX requirements so that access to data in a file system in the form of streams can be achieved. In a large-scale production environment, the storage cost is one of important indexes for constructing the HDFS cluster scale, and the storage cost directly determines the upper limit of the storage total amount of the cluster. In order to reduce the storage cost, the prior art makes many efforts from the perspective of data optimization and storage strategies. In the aspect of data optimization, the existing data can be compressed by using a compression algorithm, and the size of the file is reduced, so that the aim of reducing the storage cost is fulfilled. In the aspect of a storage strategy, the existing three-copy strategy is converted into an erasure code strategy, so that the occupation space of redundant files is greatly reduced on the premise of ensuring the file availability, and the aim of reducing the storage cost can be fulfilled. However, the above methods all need to change the original file structure, consume a lot of CPU resources of the cluster, and have poor support for upper layer applications.

The data deduplication technology is essentially to delete duplicate data, focuses more on duplication inside files, and is also a data compression technology in a certain sense. In the HDFS scene, the file is composed of blocks, and the purpose of data deduplication can be achieved by comparing similarities and differences of file blocks. In a production environment, besides original data, the HDFS also stores a large amount of multi-level data processed based on the original data, and the multi-level data contains a large amount of repeated data and can be optimized through a data deduplication technology. The existing HDFS data deduplication technology mainly stores repeated data block information in an external database in a data fingerprint mode through a MapReduce task, deletes repeated file blocks according to the data fingerprint information, and obtains repeated file block indexes by using the external database, so that the repeated blocks are read normally after being deleted, but the existing HDFS data deduplication technology still has the following defects: the data deduplication inspection stage depends on external computing resources similar to MapReduce, is not beneficial to storage internal management, and occupies computing resources; after the repeated blocks are deleted, the block index is maintained by an external database, and the data reading performance is influenced; the existing data is processed, and when the data is added or deleted, reasonable processing logic and methods are lacked.

Disclosure of Invention

In order to solve the technical problems, the invention provides a data deduplication method, a system, equipment and a storage medium based on an HDFS (Hadoop distributed File System), which realize the data deduplication service of the HDFS, avoid external dependence and effectively reduce the storage cost.

In order to achieve the purpose, the invention adopts the following technical scheme:

a data deduplication method based on an HDFS comprises the following steps:

calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;

in the name space of the HDFS, secondarily aggregating the mapping array to obtain a second mapping array;

and if the identification value of the key value pair in the second mapping array exists in the mapping table, acquiring block index information corresponding to the key value pair in the second mapping array, and replacing the block information corresponding to the repeated identification value block with the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information.

Further, the method for calculating the identification value of the file block under the to-be-deduplicated work node comprises the following steps: any size file block generates a256 byte long hash value and is then represented by a 64 length hexadecimal string.

Further, the calculating an identification value of a file block under a node to be deduplicated, aggregating a full path of the file and the identification value into a mapping array, and sending the mapping array to the namespace of the HDFS further includes: the work node periodically receives the deduplication tasks distributed in the form of heartbeats and sends the mapping array to the namespace of the HDFS in the form of heartbeats.

Further, the process of secondarily aggregating the mapping arrays is as follows:

reporting the key value pair array of each node as an initial array; if the aggregated array is empty, designating the key-value pair in any initial array as the aggregated array; if not, inputting the key value pairs of the initial array into the aggregated array one by one;

inquiring whether the aggregated array contains the same key-value pairs, if so, failing to insert the key-value pairs, and removing the key-value pairs from the initial array; if not, the key-value pair is inserted into the aggregated array and removed from the initial array.

Further, the method further comprises: and if the identification value of the key-value pair in the second mapping array does not exist in the mapping table, directly inserting the key-value pair into the mapping table.

Further, the method further comprises:

receiving an instruction for modifying a file;

inquiring all block identification values of a file to be modified through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be modified one by one in the mapping table; if the inquired files are all files to be modified, modifying operation is executed;

if the inquired file has a file which is not to be modified, firstly copying the block of the repeated mark value of the file to be modified, and updating the copied block index information to the name space of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.

Further, the method further comprises:

receiving an instruction for deleting a file;

inquiring all block identification values of a file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;

and if the inquired files have files which are not to be deleted, only deleting the index information of the files to be deleted in the namespace of the HDFS and the key value pair information in the mapping table.

The invention also provides a data deduplication system based on the HDFS, which comprises a file identification module, a deduplication task module, a duplicate checking module and an index updating module;

the file identification module is used for calculating an identification value of a file block under a to-be-deduplicated working node, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS;

the duplication removal task module is used for aggregating the mapping array for the second time in the name space of the HDFS to obtain a second mapping array;

the repeated query module is used for querying whether the identification value of the key-value pair in the second mapping array exists in the mapping table or not, and if not, the key-value pair is directly inserted into the mapping table; if so, performing file deduplication;

the index updating module is used for executing file deduplication, acquiring block index information corresponding to key values in the second mapping array, and replacing the block information corresponding to repeated identification value blocks with the block information corresponding to the identification values of the key values in the second mapping array according to the block index information.

The invention also proposes a device comprising:

a memory for storing a computer program;

a processor for implementing the method steps when executing the computer program.

The invention also proposes a readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the method steps.

The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

the invention provides a data deduplication method, a system, equipment and a storage medium based on an HDFS (Hadoop distributed File System), wherein the method comprises the steps of calculating an identification value of a file block under a work node to be deduplicated, forming a key value pair by a file full path and the identification value, then aggregating the key value pair into a first mapping array and sending the first mapping array to a namespace of the HDFS; in the name space of the HDFS, secondarily aggregating the mapping array to obtain a second mapping array; and if the identification value of the key value pair in the second mapping array exists in the mapping table, acquiring block index information corresponding to the key value pair in the second mapping array, and replacing the block information corresponding to the repeated identification value block with the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information. The method also provides a method for modifying and deleting the data when the data deduplication operation is carried out. The invention carries out duplicate removal on the files written with the HDFS, sets and searches the existence of the same file blocks of different files through identification values, and reconstructs the file index by deleting repeated blocks to realize duplicate removal. Meanwhile, in order to not influence the deletion and modification operations of the deduplicated files, a complete compatible logical link is constructed. For deleting duplicate-removed files, if the same file block corresponds to a plurality of files, only deleting metadata information related to the index without deleting the file block; and for the operation of modifying the deduplicated file, if the same file block corresponds to a plurality of files, the same file block is copied to replace the original index so as to support modification. The invention provides the data deduplication based on the HDFS, realizes the data deduplication servitization of the HDFS, avoids external dependence and can effectively reduce the storage cost. The file block index is automatically updated after the duplication is removed, the read-write performance of the file is not influenced, the non-perception of an application layer and the data duplication removal transparence are realized, the key interface overwriting is realized, and the interface compatibility of the duplicate removal file is improved.

Based on a data deduplication method based on the HDFS, a data deduplication system based on the HDFS, equipment and a storage medium are also provided, and the data deduplication system based on the HDFS, the equipment and the storage medium also have the functions of the method and are not described in detail herein.

Drawings

Fig. 1 is a flowchart of a data deduplication method based on HDFS according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a method for quadratic aggregation mapping of arrays in embodiment 1 of the present invention;

fig. 3 is a schematic diagram illustrating a replacement process of deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in an HDFS-based data deduplication method according to embodiment 1 of the present invention;

fig. 5 is a schematic diagram of a HDFS-based data deduplication system according to embodiment 2 of the present invention.

Detailed Description

In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

Example 1

The embodiment 1 of the invention provides a data deduplication method based on an HDFS (Hadoop distributed File System), which is used for realizing data deduplication servitization of the HDFS, avoiding external dependence, effectively reducing storage cost, automatically updating file block indexes after deduplication, not influencing file reading and writing performance, and realizing no perception of an application layer and data deduplication transparentization.

Fig. 1 shows a flowchart of a data deduplication method based on HDFS according to embodiment 1 of the present invention.

In step S101, an identification value of a file block under a work node to be deduplicated is calculated, a key-value pair is formed by a file full path and the identification value, and then the key-value pair is aggregated into a first mapping array and sent to a namespace of the HDFS. After each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.

Wherein, NameNode: NameNode is the name space of HDFS, which maintains the file system tree and all the files and directories in the whole tree, including the mapping information of files and DataNode nodes, the mapping information of files and file blocks, etc.;

a DataNode: the DataNode is a working node of the HDFS, and the DataNode stores and retrieves data blocks according to needs, and periodically sends a block list stored by the DataNode to the NameNode in a heartbeat mode.

The identification value of each file block is calculated by adopting an SHA256 algorithm, and a hash value with the length of 256 bytes is generated for any size of file block and is represented by a hexadecimal character string with the length of 64. Compared with the MD5 and SHA1 algorithms, the probability of collision of the algorithms is far less than that of the first two, so that the algorithms are more suitable as identification values of file blocks more than billions in number.

And forming key value pairs by the node file full path character string and the file block identification value to which the node file belongs, then aggregating all the key value pairs into an array, and reporting the array to the NameNode through heartbeat.

In step S102, in the namespace of the HDFS, the mapping array is aggregated for the second time to obtain a second mapping array. In this step, a trigger cycle for distributing the deduplication tasks is set, and the deduplication tasks are triggered every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. In step S101, after the aggregated mapping result is returned, in this step, secondary aggregation is performed to remove the mapping pairs of different copies of the same file.

In step S103, querying whether an identification value of a key-value pair in the second mapping array exists in the mapping table, and if not, directly inserting the key-value pair into the mapping table; if so, file deduplication is performed.

And after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.

In step S104, if the identification value of the key-value pair in the second mapping array already exists in the mapping table, the block index information corresponding to the key-value pair in the second mapping array is obtained, and according to the block index information, the block information corresponding to the repeated identification value block is replaced with the block information corresponding to the identification value of the key-value pair in the second mapping array.

Fig. 3 is a schematic diagram illustrating a process of replacing deduplication unit block information in a data deduplication method based on HDFS according to embodiment 1 of the present invention.

Firstly, block index information blockinfo-i of a file corresponding to a key value pair in an unregistered mapping table is obtained, then block information blockinfo-j of a repeated identification value block in the registered mapping table is obtained, the block information blockinfo-i2 corresponding to the repeated identification value block is replaced by the block information blockinfo-j2, and finally key value pair information Map1 is registered in the mapping table.

The HDFS-based data deduplication method further provides a method for modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.

Receiving an instruction for modifying a file;

if the inquired file has a file which is not to be modified, firstly copying a block of a repeated mark value of the file to be modified, and updating the copied block index information blockinfo-copy to a NameNode of the HDFS; and then continuing to modify the file to be modified until the modification is completed, and regenerating the key value pair of the file to be modified and the block identification value to be inserted into the mapping table.

The process of deleting data and modifying data is similar:

receiving an instruction for deleting a file;

inquiring all block identification values of the file to be deleted through a mapping table in a key value pair database, and then inquiring whether the file corresponding to the block identification values is the file to be deleted one by one in the mapping table; if the inquired files are all files to be deleted, executing deletion operation;

if the inquired files have files which are not to be deleted, only the index information of the files to be deleted in the NameNode of the HDFS and the key value pair information in the mapping table are deleted.

After the operation on the data is completed, the modified file and the deleted file logic of the original client are overwritten, and the access index is added in the file modification and deletion logic process, so that the normal addition, deletion, modification and check after the data is deduplicated are not influenced.

The data deduplication method based on the HDFS, provided by the invention, has the advantages that the data deduplication based on the HDFS is provided, the data deduplication service of the HDFS is realized, the external dependence is avoided, and the storage cost can be effectively reduced. The file block index is automatically updated after the duplication is removed, the read-write performance of the file is not influenced, the non-perception of an application layer and the data duplication removal transparence are realized, the key interface overwriting is realized, and the interface compatibility of the duplicate removal file is improved.

Example 2

Based on the HDFS-based data deduplication method provided in embodiment 1 of the present invention, embodiment 2 of the present invention also provides a HDFS-based data deduplication system, as shown in fig. 5, which is a schematic diagram of an HDFS-based data deduplication system in embodiment 2 of the present invention. The system comprises: the system comprises a file identification module, a duplication elimination task module, a duplication checking module and an index updating module;

The detailed process executed by the file identification module in embodiment 2 of the present invention is as follows:

after each DataNode receives the duplicate removal task through heartbeat, all file blocks under the node are calculated, the calculated identification value is constructed into a key value pair through a mapping aggregation module, and then the key value pair is sent to the NameNode in the form of heartbeat.

The process of the duplication elimination task module is as follows: and setting a triggering period for distributing the deduplication tasks, and triggering the deduplication tasks every six hours by default. After the task is triggered, the task is sent to each DataNode node in a heartbeat mode, and the task distribution is completed by calling the starting identification calculation module through a remote process. And after the file identification module returns the aggregated mapping result, secondary aggregation is carried out on the duplicate removal task module to remove the mapping pairs of different copies of the same file.

The process of the repeated checking module is realized as follows: and after receiving the aggregation array, inserting the aggregation array into a database mapping table one by one, wherein before each key value pair is inserted, whether the identification value of the key value pair is the same in the mapping table or not needs to be inquired, if the identification value is not the same, the key value pair is directly inserted into the mapping table, and if the identification value is the same, file deduplication is performed.

The index updating module realizes the following processes: if the identification value of the key value pair in the second mapping array exists in the mapping table, the block index information corresponding to the key value pair in the second mapping array is obtained, and the block information corresponding to the repeated identification value block is replaced by the block information corresponding to the identification value of the key value pair in the second mapping array according to the block index information.

The index updating module also provides the processes of modifying data and deleting data. Fig. 4 is a schematic diagram illustrating a modification process of modifying cell block information in a HDFS-based data deduplication method according to embodiment 1 of the present invention.

Receiving an instruction for modifying a file;

The process of deleting data and modifying data is similar:

receiving an instruction for deleting a file;

The client module mainly has the functions of overwriting the logic of the modified file and the deleted file of the original client, and the process of accessing the corresponding unit of the index updating module is added in the logic process of modifying and deleting the file, so that the original functions of the client do not influence the normal addition, deletion, modification and check of the data after the data is deduplicated.

Example 3

The invention also proposes a device comprising:

a memory for storing a computer program;

a processor for implementing the method steps when executing the computer program as follows:

Receiving an instruction for modifying a file;

The process of deleting data and modifying data is similar:

receiving an instruction for deleting a file;

Need to explain: the technical solution of the present invention also provides an electronic device, including: the communication interface can carry out information interaction with other equipment such as network equipment and the like; and the processor is connected with the communication interface to realize information interaction with other equipment, and is used for executing a data deduplication method based on the HDFS provided by one or more technical schemes when running a computer program, and the computer program is stored on the memory. Of course, in practice, the various components in an electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus. The memory in the embodiments of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device. It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory. The method disclosed in the embodiments of the present application may be applied to a processor, or may be implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a DSP (Digital Signal Processing, i.e., a chip capable of implementing Digital Signal Processing technology), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in a memory where a processor reads the programs in the memory and in combination with its hardware performs the steps of the method as previously described. When the processor executes the program, corresponding processes in the methods of the embodiments of the present application are implemented, and for brevity, are not described herein again.

Example 4

The invention also proposes a readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method steps of:

Receiving an instruction for modifying a file;

The process of deleting data and modifying data is similar:

receiving an instruction for deleting a file;

For a description of a device and a related part in a storage medium for data deduplication based on an HDFS provided in the embodiment of the present application, reference may be made to a detailed description of a corresponding part in a data deduplication method based on an HDFS provided in embodiment 1 of the present application, and details are not repeated here.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include elements inherent in the list. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, the scope of the present invention is not limited thereto. Various modifications and alterations will occur to those skilled in the art based on the foregoing description. And are neither required nor exhaustive of all embodiments. On the basis of the technical scheme of the invention, various modifications or changes which can be made by a person skilled in the art without creative efforts are still within the protection scope of the invention.

Claims

1. A data deduplication method based on an HDFS (Hadoop distributed File System) is characterized by comprising the following steps of:

2. The HDFS-based data deduplication method according to claim 1, wherein the method for calculating the identification value of the file block under the work node to be deduplicated is: any size file block generates a256 byte long hash value and is then represented by a 64 length hexadecimal string.

3. The HDFS-based data deduplication method according to claim 1, wherein the calculating an identification value of a file block under a work node to be deduplicated, aggregating a file full path and the identification value into a mapping array, and sending the mapping array to a namespace of the HDFS further comprises: the work node periodically receives the deduplication tasks distributed in the form of heartbeats and sends the mapping array to the namespace of the HDFS in the form of heartbeats.

4. The HDFS-based data deduplication method according to claim 1, wherein the second aggregating the mapping array comprises:

5. The HDFS-based data deduplication method of claim 1, further comprising: and if the identification value of the key-value pair in the second mapping array does not exist in the mapping table, directly inserting the key-value pair into the mapping table.

6. The HDFS-based data deduplication method of claim 1, further comprising:

receiving an instruction for modifying a file;

7. The HDFS-based data deduplication method of claim 1, further comprising:

receiving an instruction for deleting a file;

8. A data deduplication system based on an HDFS (Hadoop distributed File System) is characterized by comprising a file identification module, a deduplication task module, a duplicate checking module and an index updating module;

9. An apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 7 when executing the computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.