CN119441163A

CN119441163A - A data compression method, data compression device and related equipment

Info

Publication number: CN119441163A
Application number: CN202310985239.8A
Authority: CN
Inventors: 谭浩良; 王鹏; 万斌朝硕; 陈仁海; 张弓; 邹翔宇; 夏文
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2025-02-14

Abstract

The application discloses a data compression method, a data compression device and related equipment. N similar data sets are acquired, wherein each similar data set comprises two similar data blocks, one is a reference block and the other is a similar block. The N similar data sets do not include the same data features, and N is an integer greater than or equal to 2. And aggregating the N similar blocks in the N similar data groups to obtain a second aggregated block. The first aggregate block and the second aggregate block are delta compressed. In the present application, since the N similar data groups do not include the same data features, the reference blocks in the first aggregate block do not include the same data features, and the similar blocks in the second aggregate block do not include the same data features. Therefore, the first data block and the second aggregation block are subjected to differential compression, repeated data from a plurality of data blocks can be calculated, and the data compression ratio is improved.

Description

Data compression method, data compression device and related equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data compression method, a data compression device, and related devices.

Background

With the rapid development of big data, mass data is in explosive growth, and great challenges are brought to data storage management. In order to improve the utilization of storage space, a Data Compression (Data Compression) technique is an indispensable key technique in a storage system. Data compression is a process of encoding original data with less space. The data compression refers to a technical method for reducing the data volume to reduce the storage space and improve the transmission, storage and processing efficiency of the data or reorganizing the data according to a certain algorithm on the premise of not losing useful information and reducing the redundancy and storage space of the data.

Differential compression (Delta Compression) techniques are used to eliminate redundant data between two different pieces of similar data, so that only one additional piece of differential data is needed to be stored on the basis of preserving the same data between the two pieces of data. Differential compression techniques include two links, similarity detection and differential encoding. The similarity detection mainly finds a pair of similar data blocks, namely a reference block and a similar block, and can encode the similar blocks into copy instructions and insert instructions on the basis of the reference block through a differential encoding link. The instruction encoding is performed by copy on the same data content as the reference block contained in the similar block, and the insert instruction is performed on the data content different from the reference block contained in the similar block.

In practical applications, in a storage space, there will often be multiple data blocks similar to one data block. The data block is delta compressed with one of the similar data blocks, and the obtained reference block or delta block also contains the same data content as the other similar data blocks, so that the delta compressed data compression is lower.

Disclosure of Invention

The application provides a data compression method, a data compression device and related equipment, which are used for improving the data compression ratio.

In a first aspect, the present application provides a method of data compression. The compression object of delta compression is two similar data blocks in the storage space. Therefore, it is necessary to perform similarity detection on the data blocks in the storage space, so as to screen out two similar data blocks. Of the two similar data blocks, either one of them may be selected as a reference block and the other one as a similar block. In the similarity detection technique, at least one identical data feature is included between two similar data blocks that are detected, and the data content between the two similar data blocks is not exactly the same. In other words, between two similar data blocks, at least one identical data feature is included between the reference block and the similar block, and the data content between the reference block and the similar block is not exactly the same.

Typically, such two similar data blocks (considered as one set) will be detected in the memory space in multiple sets. In the application, N similar data blocks are acquired from multiple groups of similar two data blocks, wherein N is an integer greater than or equal to 2. The N similar data sets do not include the same data features, but one similar data set in the present application is a group of two similar data blocks, so two data blocks in one similar data set necessarily include the same data features.

It should be understood that the data features mentioned in the present application correspond to fingerprints of data blocks. The data characteristic may be a characteristic value obtained by performing a hash operation on a part of data content of the data block, or may be a Super-characteristic value Super-characteristic obtained by combining a plurality of characteristic values, or may be other characteristics for representing a part of data content of the data block, which is not limited in the present application.

After aggregating the N data blocks, an aggregate block in the present application (e.g., a first aggregate block and a second aggregate block) will be obtained, while the same data characteristics will not be included in a single aggregate block in the present application. Specifically, N reference blocks in N similar data groups are aggregated to obtain a first aggregation block, and N similar blocks in N similar data groups are aggregated to obtain a second aggregation block. As can be seen from the above, the N similar data sets do not include the same data features, and therefore, the reference blocks in the first aggregate block should not include the same data features, and similarly, the similar blocks in the second aggregate block should not include the same data features.

In the present application, the process of aggregating a plurality of data blocks into an aggregate block corresponds to a logical division, and does not change the physical properties of the data blocks, such as the data content, the data structure or the storage location of each data block. Aggregating a plurality of data blocks into one aggregate block refers to performing subsequent processing (e.g., performing delta compression) on the plurality of data blocks conforming to the aggregate condition as a whole. The multiple data blocks within a single aggregate block may be from different memory addresses, and the memory addresses of the data blocks may or may not be contiguous, as the application is not limited in this respect.

Since the two data blocks (the reference block and the similar block) in each similar data group are respectively allocated to the first aggregation block and the second aggregation block, the first aggregation block and the second aggregation block are necessarily similar, and therefore, the first aggregation block and the second aggregation block can be used as two similar data blocks. In the application, a first aggregation block is used as a reference block, a second aggregation block is used as a similar block, and differential compression is carried out to obtain an aggregation difference block. Wherein the aggregate difference block includes metadata for pointing to repeated data between the first aggregate block and the second aggregate block, and a difference between the first aggregate block and the second aggregate block.

In the present application, since the N similar data groups do not include the same data features, the reference blocks in the first aggregate block should not include the same data features, and the similar blocks in the second aggregate block should not include the same data features. However, since two data blocks in each similar data group are respectively allocated to the first aggregation block and the second aggregation block, the first aggregation block and the second aggregation block are necessarily similar, and thus repeated data from a plurality of data blocks can be calculated in the differential compression process of the first data block and the second aggregation block, thereby improving the data compression ratio.

On the other hand, since the similarity detection is performed on two data blocks in the process of acquiring the similar data group. When two data blocks are similar and do not include the same data features as other similar data sets, then the two data blocks can be used as one similar data set, and then aggregated with other similar data sets, and differential compression is completed. Therefore, the data compression method in the application does not need to traverse all data blocks in the storage space. The method can be suitable for a storage space (such as an online storage scene or a cloud storage scene) in which new data are stored in real time, and the flexibility of a scheme is improved.

Based on the first aspect, in an alternative implementation manner, a first data block is acquired, where the first data block is a new data block stored in the storage space. The first data block is repeated with a certain data block (first similar block) of the second aggregate block. I.e. the data content is exactly the same between a first data block and a first similar block in the second aggregate block, the first similar block being one of the similar blocks of the second aggregate block. In this case, in the present application, instead of performing deduplication on the first data block and the first similar block, delta compression may be performed on the first reference block and the first data block to obtain a first difference block, where the first difference block includes metadata for pointing to repeated data between the first reference block and the first data block, and a difference between the first reference block and the first data block, where the first reference block is one of the reference blocks of the first aggregate block, and the first reference block and the first similar block are from the same similar data group. Therefore, when the first data block is decompressed and restored, only the first reference block is read into the memory, and the whole second aggregation block is not needed to be decompressed, so that the data decompression and restoration efficiency is improved.

Based on the first aspect, in an alternative implementation manner, a second data block is acquired, where the second data block is a new data block stored in the storage space. The second data block and a second similar block in the second polymer block comprise at least one same data characteristic, the data content between the second data block and the second similar block is not identical, and the second similar block is one similar block of the second polymer block. In other words, the second data block is similar to a certain data block (second similar block) in the second aggregate block. At this time, in the case that the second reference block is also similar to the second data block, differential compression is performed on the second reference block and the second data block to obtain a second differential block, where the second differential block includes metadata for pointing to the second reference block, and a differential between the second reference block and the second data block, where the second reference block is one of the reference blocks of the first aggregate block, and the second reference block and the second similar block are from the same similar data set.

Based on the first aspect, in an optional implementation manner, the third reference block includes metadata of a third data block, data contents between the third data block and the third reference block are identical, the third data block is a data block except for N similar data groups, and the third reference block is one of the reference blocks of the first aggregation block. In other words, when one data block in the aggregate block is repeated with other data blocks except for N similar data groups, the data block in the aggregate block may be metadata of the repeated data block, that is, the data block in the aggregate block does not have to be an original data, thereby improving flexibility of the aggregate block.

Based on the first aspect, in an alternative embodiment, a third aggregate block (for example, after the new data blocks are stacked, the third aggregate block is aggregated) exists, and the data content between the third aggregate block and the second aggregate block is identical, at this time, the third aggregate block and the second aggregate block are subjected to deduplication, which does not result in read amplification. Thus, the third aggregated block may be updated, the updated third aggregated block comprising metadata for pointing to the second aggregated block.

Based on the first aspect, in an optional implementation manner, there is a fourth aggregation block, the fourth aggregation block includes N data blocks, at least one identical data feature is included between the fourth aggregation block and the second aggregation block, and data content between the fourth aggregation block and the second aggregation block is not identical. In other words, the fourth aggregate block is similar to the second aggregate block, at which time the first aggregate block and the fourth aggregate block may be delta compressed to obtain an updated fourth aggregate block, the updated fourth aggregate block including the delta between the first aggregate block and the fourth aggregate block and metadata for pointing to the duplicate data between the first aggregate block and the fourth aggregate block.

Based on the first aspect, in an optional implementation manner, M data blocks to be compressed are acquired, the data contents of the M data blocks to be compressed are not identical, and M is an integer greater than or equal to 2N. And performing similarity detection on the M data blocks to be compressed, and determining N similar data groups in the M data blocks to be compressed.

Based on the first aspect, in an optional implementation manner, X initial data blocks are acquired, where X is an integer greater than or equal to M. And performing de-duplication processing on the initial data blocks with identical data content in the X initial data blocks to obtain M data blocks to be compressed.

In a second aspect, the present application provides a data compression apparatus comprising:

An obtaining unit, configured to obtain N similar data sets, where each similar data set includes a reference block and a similar block, at least one identical data feature is included between the reference block and the similar block in each similar data set, the data content between the reference block and the similar block in each similar data set is not identical, the N similar data sets do not include identical data features, and N is an integer greater than or equal to 2;

The processing unit is used for aggregating the N similar data groups to obtain a first aggregation block and a second aggregation block, wherein the first aggregation block comprises N reference blocks of the N similar data groups, and the second aggregation block comprises N similar blocks in the N similar data groups;

And the processing unit is further used for performing differential compression on the first aggregation block and the second aggregation block to obtain an aggregation differential block, wherein the aggregation differential block comprises metadata for pointing to repeated data between the first aggregation block and the second aggregation block and differential between the first aggregation block and the second aggregation block.

Based on the second aspect, in an optional implementation manner, the acquiring unit is further configured to acquire a first data block, where data content between the first data block and a first similar block in the second aggregate block is identical, and the first similar block is one of similar blocks in the second aggregate block;

The processing unit is further configured to perform differential compression on the first reference block and the first data block to obtain a first differential block, where the first differential block includes metadata for pointing to repeated data between the first reference block and the first data block, and differential between the first reference block and the first data block, and the first reference block is one of reference blocks of the first aggregate block, and the first reference block and the first similar block are from a same similar data set.

Based on the second aspect, in an optional implementation manner, the acquiring unit is further configured to acquire a second data block, where at least one same data feature is included between the second data block and a second similar block in the second aggregate block, and data content between the second data block and the second similar block is not completely the same, and the second similar block is one of similar blocks in the second aggregate block;

The processing unit is further configured to perform differential compression on a second reference block and a second data block to obtain a second difference block, where the second difference block includes metadata for pointing to the second reference block, and a differential between the second reference block and the second data block, and the second reference block is one of the reference blocks of the first aggregate block, and the second reference block and the second similar block are from the same similar data set.

Based on the second aspect, in an alternative implementation manner, the third reference block includes metadata of a third data block, the data content between the third data block and the third reference block is identical, the third data block is a data block except for N similar data groups, and the third reference block is one of the reference blocks of the first aggregation block.

Based on the second aspect, in an optional implementation manner, the acquiring unit is further configured to acquire a third polymer block, where the third polymer block includes N data blocks, and the data content between the third polymer block and the second polymer block is identical;

the processing unit is further configured to update a third aggregate block, where the updated third aggregate block includes metadata for pointing to the second aggregate block.

Based on the second aspect, in an optional implementation manner, the acquiring unit is further configured to acquire a fourth aggregate block, where the fourth aggregate block includes N data blocks, at least one identical data feature is included between the fourth aggregate block and the second aggregate block, and data content between the fourth aggregate block and the second aggregate block is not completely identical;

The processing unit is further configured to perform delta compression on the fourth aggregate according to the first aggregate, to obtain an updated fourth aggregate, where the updated fourth aggregate includes a delta between the first aggregate and the fourth aggregate, and metadata for pointing to repeated data between the first aggregate and the fourth aggregate.

Based on the second aspect, in an alternative embodiment, the processing unit is specifically configured to:

The method comprises the steps of aggregating N reference blocks in N similar data sets to obtain a first aggregation block;

and aggregating N similar blocks in the N similar data groups to obtain a second aggregation block.

Based on the second aspect, in an optional implementation manner, the acquiring unit is specifically configured to:

obtaining M data blocks to be compressed, wherein the data contents among the M data blocks to be compressed are not identical, and M is an integer greater than or equal to 2N;

And performing similarity detection on the M data blocks to be compressed, and determining N similar data groups in the M data blocks to be compressed.

acquiring X initial data blocks, wherein X is an integer greater than or equal to M;

and performing de-duplication processing on the initial data blocks with identical data content in the X initial data blocks to obtain M data blocks to be compressed.

The content of the information interaction and the execution process of the embodiment shown in the present aspect is based on the same concept as the embodiment shown in the first aspect, so the description of the beneficial effects shown in the present aspect is shown in the above first aspect, and details are not repeated here.

In a third aspect, the application provides a computing device comprising a processor coupled with a memory for storing instructions that, when executed by the processor, cause the computing device to implement the method of the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the application provides a computer readable storage medium having stored thereon instructions which, when executed, cause a computer to perform the method of the first aspect or any of the possible implementations of the first aspect.

In a fifth aspect, the application provides a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of the first aspect, or any of the possible implementation manners of the first aspect.

In a sixth aspect, the application provides a chip comprising a processor coupled to a memory for storing instructions which, when executed by the processor, cause the chip to implement the method of the first aspect or any of the possible implementations of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of a scene of similarity detection based on a supereigenvalue method;

FIG. 2 is a schematic diagram of a scenario of a conventional data compression scheme;

FIG. 3 is a schematic diagram of a scenario of another conventional data compression scheme;

FIG. 4 is a flow chart of a data compression method according to an embodiment of the application;

FIG. 5 is a schematic diagram of a scenario of one possible data compression in an embodiment of the present application;

FIG. 6 is a schematic diagram of one possible configuration of a redundant data reduction system;

FIG. 7 is a diagram illustrating a compression flow for existing data according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an experimental method of data compression according to an embodiment of the present application;

FIG. 9 is a schematic diagram of one compression method for new data;

FIG. 10 is a schematic diagram of a compression method after optimization for new data in an embodiment of the present application;

FIG. 11 is a schematic diagram of a compression flow for new data according to an embodiment of the present application;

fig. 12 is a schematic diagram of an application scenario of a data compression method according to an embodiment of the present application;

fig. 13 is a schematic diagram of another application scenario of the data compression method in the embodiment of the present application;

FIG. 14 is a schematic diagram of a data compression device according to an embodiment of the present application;

fig. 15 is a schematic diagram of a logic structure of a computing device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a data compression method, a data compression device and related equipment, which are used for improving the data compression ratio.

Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the description of the embodiments of the application is for the purpose of describing particular embodiments of the application only and is not intended to be limiting of embodiments of the application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or" describes an association of associated objects, meaning that there may be three relationships, e.g., A and/or B, and that there may be A alone, while A and B are present, and B alone, where A, B may be singular or plural. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a, b, or c) of a, b, c, a-b, a-c, b-c, or a-b-c may be represented, wherein a, b, c may be single or plural.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The following description is given of some terms or terminology used in connection with the embodiments of the present application, which also form part of the description.

Data Compression (Data Compression) is that mass Data is exploded along with the rapid development of big Data, and great challenges are brought to Data storage management. In order to improve the utilization of storage space, data compression technology is an essential key technology in storage systems. Data compression is a process of encoding original data with less space. The data compression refers to a technical method for reducing the data volume to reduce the storage space and improve the transmission, storage and processing efficiency of the data or reorganizing the data according to a certain algorithm on the premise of not losing useful information and reducing the redundancy and storage space of the data.

The data deduplication technology is a technology for effectively eliminating redundant data on a large scale, and is a hot spot for storage system research in recent years. The data deduplication technology can not only greatly save storage space and improve the utilization rate of storage resources, but also improve the transmission efficiency of network bandwidth by avoiding the transmission of redundant data. However, with the development of data deduplication technology, data deduplication technology also faces many challenges. Since the data deduplication technique uses a secure hash digest to uniquely identify a data block, this greatly simplifies the lookup and management operations of duplicate data, but this results in the data deduplication technique being able to identify only completely duplicate data blocks, but not those that are very similar. For example, in the case that the two data blocks A1 and A2 are only different in a few bytes, although the data contents of the A1 data block and the A2 data block are very close, after the A1 data block and the A2 data block are subjected to the calculation of the secure hash digest, two different data block fingerprints are obtained, so that redundant processing on similar data of the A1 data block and the A2 data block is omitted. Because delta compression (Delta Compression) technology well recognizes and eliminates this type of redundant data, there is increasing interest.

The delta compression technique is a technique of eliminating redundant data between two different pieces of data, so that only one piece of differential data needs to be additionally stored. Data deduplication can only eliminate duplicate data that has the same content, while delta compression can eliminate redundancy for two pieces of data of different content. The compression objects of delta compression are similar files or data blocks. In order to improve the delta compression effect as much as possible and reduce the calculation overhead of delta encoding, a similarity detection technique is often used to select the target of delta compression in the delta compression process.

Most of the existing similarity feature calculation methods are based on classical Broder theory, broder describes the similarity problem as an intersection set problem in a set, and MinHash is provided as a theoretical basis for estimating the similarity of the set. In order to quickly evaluate the similarity of two sets, broder proposes the 'Broder theorem' assuming that A and B are two sets and F (A) and F (B) are corresponding sets obtained by calculating elements in A and B by using a MinHash function F. If min (S) is used to represent the smallest element in the set S, it can be formulated as:

Broder theory describes that the probability that two sets possess the same minimum hash element is equal to the Jaccard coefficient for both sets. If the elements in sets A and B consist of all of the shines in the two data blocks C1 and C2 (dividing the file into overlapping fixed-length continuous strings, called shines), the probability that F (A) and F (B) have the same minimum element is equal to the similarity of the two data blocks according to Broder theory.

Odess uses a feature calculation method based on Broder theory, called "NTransform", in combination with the Super-feature method proposed by Broder. NTransform calculate a plurality of hash functions and extract a plurality of eigenvalues. In practical application, the cost of calculating a plurality of hash functions is too large, so as to extract a plurality of independent characteristic values, NTransform calculates a Rabin fingerprint fp on the content of the sliding window, and then performs linear transformation on the fingerprint for a plurality of times to generate a plurality of independent hash values, as shown in the following formula:

h_i(fp)＝(m_i*fp+a_i)mod2^32

Where m _i and a _i are a set of random numbers. In a large-scale storage system, the calculation cost is high for calculating the similarity between a certain data block and all other data blocks and selecting the data block with the highest similarity. In most cases, however, it is sufficient to find a relatively similar data block. Super-feature combines multiple feature values into one Super-feature value for similarity matching, which indicates that two data blocks are similar as long as they have one and the same Super-feature value match. The strategy of combining a plurality of characteristic values into a super characteristic value avoids the calculation of specific similarity, and greatly reduces the calculation cost of a matching link. Referring to fig. 1, fig. 1 is a schematic view of a scene of similarity detection based on a super eigenvalue method. As shown in FIG. 1, super-feature can pack a plurality of feature values into Super-feature values for matching, so that on one hand, the probability that dissimilar data blocks have the same feature value is reduced, and on the other hand, the similarity threshold of the searched similar blocks is improved. In the scenario shown in fig. 1, 4 eigenvalues are combined into one super eigenvalue, and 12 eigenvalues are combined into 3 super eigenvalues, SF1 super eigenvalue, SF2 super eigenvalue, and SF3 super eigenvalue. As long as any one of the super feature values is equal between two data blocks, it means that the similarity between the two data blocks is high.

Delta compression Delta Compression is a compression technique derived on the basis of dictionary coding. Unlike dictionary coding, delta compression does not code within a compression window of a single object, but between two objects. The compression process of the LZ77 compression algorithm can be simply understood as finding the longest match between the subsequent strings and the previously processed strings and replacing them with the COPY command if they are identical. The most commonly used differential compression tools at present, such as vcdiff, xdelta and Zdelta, can be simply considered as LZ77 compression in two similar documents. Assuming that there are two similar data blocks a and B, data block B is a "similar block" that needs to be compressed, data block a is referred to as a "reference block" for data block B. Delta compression finds that the content that exists in data block B but does not exist in data block a is written into a file, and the process of generating the file is called delta encoding. Specifically, delta encoding uses a sliding window technique to detect the content in both data block a and data block B. For the content in data block B, if the same content exists in data block a, the COPY command is used for encoding, otherwise the INSERT command is used for encoding. The COPY command contains the offset and length of the repeated content in data block a between data block a and data block B, while the INSERT command contains the length of the non-repeated content in data block B between data block a and data block B and the content itself, and the resulting file containing the COPY and INSERT commands is called a difference block. The size of the difference block will generally be smaller than the data block B, and the difference block will be stored instead of the data block B, thereby reducing the storage overhead. Both the rate of delta encoding and the size of delta blocks are related to the similarity of data block a to data block B. The higher the similarity, the faster the delta encoding speed, and the smaller the delta block. The delta decoding is to regenerate the data block B according to the command based on the delta block and the data block a.

Next, an existing scheme of data compression will be described.

Referring to fig. 2, fig. 2 is a schematic diagram of a conventional data compression scheme. As shown in fig. 2, it is necessary to traverse all the data blocks in the storage space, find a plurality of different similar blocks (e.g., data block B, data block C, data block D, and data block E shown in fig. 2) of one data block (e.g., data block a shown in fig. 2), then aggregate the data block and the similar blocks of the data block into one whole, and then self-compress the whole.

For the data compression scheme shown in fig. 2, since all data blocks in the storage space need to be traversed, when new data is stored in the storage space, all similar blocks of the new data need to be searched, which causes huge computing resource overhead. Therefore, the data compression scheme is difficult to be applied to a storage space (e.g., an online storage scene or a cloud storage scene) where new data is stored in real time.

Referring to fig. 3, fig. 3 is a schematic diagram of a scenario of another conventional data compression scheme. As shown in fig. 3, it is necessary to traverse all the data blocks in the storage space, find a plurality of data blocks having at least one identical characteristic fingerprint, and combine these data blocks into a joint compression group. One reference block (e.g., data block a shown in fig. 3) is selected as a reference block in the joint compression group, and the remaining data blocks (e.g., data block B, data block C, data block D, and data block E shown in fig. 3) are delta-compressed with respect to the reference block (data block a). Finally, the reference block (data block a) is self-compressed again.

For the illustration of fig. 3, in one memory space, there will often be multiple data blocks similar to one data block. The data block is delta compressed with one of the similar data blocks, and the obtained reference block or delta block also contains the same data content as the other similar data blocks, so that the delta compressed data compression is lower.

On the other hand, the data compression scheme shown in fig. 3 is the same as the data compression scheme shown in fig. 2, and all data blocks in the storage space need to be traversed to find similar blocks of the new data, so the data compression scheme shown in fig. 3 is also difficult to be applied to the storage space (such as an online storage scene or a cloud storage scene) where the new data is stored in real time.

In view of the above, the embodiment of the application discloses a data compression method, a data compression device and related equipment, which are used for improving the data compression ratio. The data compression method in the embodiment of the application can be applied to a device (including but not limited to a mobile phone, a computer, a network device, a cloud storage device or a server, etc.) which stores a plurality of data blocks and has a data storage function, and is used for compressing data stored by the device. The embodiment of the application is not limited to the execution body for executing the data compression method, for example, the execution body of the data compression method in the embodiment of the application may be a device itself storing a plurality of data blocks, or may also be a chip, a chip system, or a processor supporting the device to implement the data compression method, or may also be a logic node, a logic module, or software capable of implementing all or part of the data compression function. Referring to fig. 4, fig. 4 is a flow chart illustrating a data compression method according to an embodiment of the application. As shown in fig. 4, the method for compressing data in the embodiment of the present application includes, but is not limited to, step 101, step 102, and step 103. Next, steps 101, 102, and 103 will be described in detail.

101. N similar data sets are obtained, each similar data set comprises a reference block and a similar block, the N similar data sets do not comprise the same data characteristics, and N is an integer greater than or equal to 2.

The compression object of delta compression is two similar data blocks in the storage space. Therefore, it is necessary to perform similarity detection on the data blocks in the storage space, so as to screen out two similar data blocks. Of the two similar data blocks, either one of them may be selected as a reference block and the other one as a similar block. In the similarity detection technique, at least one identical data feature is included between two similar data blocks that are detected, and the data content between the two similar data blocks is not exactly the same. In other words, between two similar data blocks, at least one identical data feature is included between the reference block and the similar block, and the data content between the reference block and the similar block is not exactly the same.

Typically, such two similar data blocks (considered as one set) will be detected in the memory space in multiple sets. In the embodiment of the present application, N similar data sets are acquired from multiple similar two data blocks, where N is an integer greater than or equal to 2. The N similar data sets do not include the same data features, but in the embodiment of the present application, one similar data set is a set of two similar data blocks, so two data blocks in one similar data set necessarily include the same data features.

It should be understood that the data features mentioned in the embodiments of the present application correspond to fingerprints of data blocks. The data characteristic may be a characteristic value obtained by performing a hash operation on a part of data content of the data block, or may be a Super-characteristic value Super-characteristic obtained by combining a plurality of characteristic values, or may be other characteristics for representing a part of data content of the data block, which is not limited in the embodiment of the present application.

102. And aggregating the N similar data groups to obtain a first aggregation block and a second aggregation block, wherein the first aggregation block comprises N reference blocks of the N similar data groups, and the second aggregation block comprises N similar blocks in the N similar data groups.

After the N data blocks are aggregated, an aggregate block (e.g., a first aggregate block and a second aggregate block) in an embodiment of the present application is obtained, where the same data features are not included in a single aggregate block. Specifically, N reference blocks in N similar data groups are aggregated to obtain a first aggregation block, and N similar blocks in N similar data groups are aggregated to obtain a second aggregation block. As can be seen from the above, the N similar data sets do not include the same data features, and therefore, the reference blocks in the first aggregate block should not include the same data features, and similarly, the similar blocks in the second aggregate block should not include the same data features.

In the embodiment of the present application, the process of aggregating a plurality of data blocks into an aggregate block corresponds to a logical division, and does not change the physical properties of the data blocks, such as the data content, the data structure or the storage location of each data block. Aggregating a plurality of data blocks into one aggregate block refers to performing subsequent processing (e.g., performing delta compression) on the plurality of data blocks conforming to the aggregate condition as a whole. The multiple data blocks within a single aggregate block may be from different memory addresses, and the memory addresses of the data blocks may or may not be contiguous, which is not limited by embodiments of the present application.

In one possible implementation, the third reference block includes metadata of a third data block, the data content between the third data block and the third reference block is identical, the third data block is a data block other than the N similar data groups, and the third reference block is one of the reference blocks of the first aggregate block. In other words, when one data block in the aggregate block is repeated with other data blocks other than the N similar data groups, the data block in the aggregate block may be metadata of its repeated data block (e.g., a reference pointer of the repeated data block), that is, the data block in the aggregate block does not have to be an original data, thereby improving flexibility of the aggregate block.

103. And performing differential compression on the first aggregation block and the second aggregation block to obtain an aggregation differential block.

Since the two data blocks (the reference block and the similar block) in each similar data group are respectively allocated to the first aggregation block and the second aggregation block, the first aggregation block and the second aggregation block are necessarily similar, and therefore, the first aggregation block and the second aggregation block can be used as two similar data blocks. In the embodiment of the application, the first aggregation block is used as a reference block, the second aggregation block is used as a similar block, and differential compression is carried out to obtain an aggregation difference block. Wherein the aggregate difference block includes metadata for pointing to repeated data between the first aggregate block and the second aggregate block, and a difference between the first aggregate block and the second aggregate block.

In the embodiment of the present application, since the N similar data groups do not include the same data features, the reference blocks in the first aggregate block should not include the same data features, and similarly, the similar blocks in the second aggregate block should not include the same data features. However, since two data blocks in each similar data group are respectively allocated to the first aggregation block and the second aggregation block, the first aggregation block and the second aggregation block are necessarily similar, and thus repeated data from a plurality of data blocks can be calculated in the differential compression process of the first data block and the second aggregation block, thereby improving the data compression ratio.

For a better understanding of the embodiments of the present application, please refer to fig. 5, fig. 5 is a schematic diagram of a possible data compression scenario in the embodiment of the present application. As shown in fig. 5, 3 similar data sets were acquired, the first set (C1 data block and C1' data block), the second set (C2 data block and C2' data block) and the third set (C3 data block and C3' data block) shown in fig. 5, respectively, and the data blocks within the similar data sets were similar to each other. Assuming that the C1 data block, the C2' data block, and the C3' data block are similar, there is compressibility, whereas conventional delta compression performs delta compression only between the C1 data block and the C1' data block, performs delta compression between the C2 data block and the C2' data block, performs delta compression between the C3 data block and the C3' data block, ignores redundant data between the C1 data block and the C2' data block and the C3' data block, and results in insufficient compression. In the data compression method in the embodiment of the present application, the C1 data block, the C2 data block and the C3 data block are aggregated into one aggregation block and serve as a reference block (corresponding to the first aggregation block in the embodiment of the present application), and the C1' data block, the C2' data block and the C3' data block are aggregated into another aggregation block and serve as similar blocks (corresponding to the second aggregation block in the embodiment of the present application), so as to obtain two aggregation blocks. And then performing differential compression on the two aggregation blocks, so that redundant data among the C1 data block, the C2 'data block and the C3' data block is calculated, and the compression ratio is improved.

On the other hand, since the similarity detection is performed on two data blocks in the process of acquiring the similar data group. When two data blocks are similar and do not include the same data features as other similar data sets, then the two data blocks can be used as one similar data set, and then aggregated with other similar data sets, and differential compression is completed. Therefore, the data compression method in the embodiment of the application does not need to traverse all the data blocks in the storage space. The method can be suitable for a storage space (such as an online storage scene or a cloud storage scene) in which new data are stored in real time, and the flexibility of a scheme is improved.

The flow of the data compression method shown in steps 101 to 103 may be deployed as a functional module (such as the delta compression module shown in fig. 6) for executing the data compression method according to the embodiment of the present application, and the functional module is integrated into a redundant data reduction system. Referring to fig. 6, fig. 6 is a schematic diagram of one possible configuration of a redundant data reduction system. As shown in fig. 6, the redundant data reduction system includes a hash calculation module, a data deduplication module, a delta compression module, and a data persistence module. For the read data to be compressed, the redundant data reduction system can perform block processing on the data by adopting various block algorithms such as variable length block division, fixed length block division and the like, so that the data to be compressed flow becomes data blocks to be compressed with finer granularity, and then the data blocks to be compressed are input into the modules to be processed. In the example of fig. 5, the modules may execute in parallel in multiple threads, thereby increasing the operating speed of the redundant data reduction system. Next, the respective modules shown in fig. 5 are described.

And the hash calculation module is used for processing the data to be compressed after the data stream is divided into data blocks with fine granularity by the data blocking algorithm. Specifically, the hash calculation module calculates a secure hash digest of a fixed length for each data block, so that the data block is uniquely identified, which is equivalent to configuring a unique fingerprint for the data block, and prepares for a subsequent data deduplication process and delta compression process.

And the data de-duplication module is used for comparing fingerprints among different data blocks so as to quickly detect repeated data blocks, avoid byte-by-byte comparison among the contents of the data blocks, improve the efficiency of detecting the repeated data blocks and further improve the data compression performance. For the detected repeated data blocks, the data deduplication module deduplicates the repeated data blocks, and only one of the repeated data blocks is reserved, so that data redundancy and storage space waste are reduced.

And the delta compression module can only identify the completely repeated data blocks, but can not identify the similar data blocks. The delta compression module finds a data block that is relatively similar to the current data block by calculating and comparing data characteristics between the data blocks. For the detected similar data blocks, the delta compression module executes the data compression method provided by the embodiment of the application on the similar data blocks, and the similar blocks in the similar data blocks are updated to delta blocks, so that redundant data among the similar data blocks is further reduced, and storage overhead is further reduced.

And the data persistence module is used for persistence processing of the generated difference block after the difference compression module finishes processing the similar data block.

The redundant data reduction system in the embodiment of the application can be used for compressing the existing data in the storage space or compressing the new data stored in the storage space in real time. Next, a compression flow for existing data and a compression flow for new data in the redundant data reduction system will be described with reference to the accompanying drawings, respectively. Referring to fig. 7, fig. 7 is a schematic diagram of a compression flow for existing data in an embodiment of the application. As shown in fig. 7, the compression flow for the existing data includes:

201. A capacity threshold for a single aggregate block is configured.

The capacity threshold of a single aggregation block (for example, the first aggregation block or the second aggregation block in the embodiment of the present application) may be preconfigured according to factors such as performance of the storage device, requirements of an actual application scenario, or data update frequency, where the capacity threshold of the single aggregation block indicates a maximum number of data blocks or a maximum number of data blocks included in the aggregation block. By way of example, the capacity of a single aggregate block may be configured to be 70 data blocks, meaning that a single aggregate block can aggregate up to 70 data blocks. While the capacity threshold of a single aggregate block may be suitably tuned down if higher on-line compression speeds are desired.

202. And performing similarity matching on the data blocks.

As shown in fig. 5, the existing data blocks in the storage space are subjected to data deduplication processing. And performing similarity matching on the rest data blocks after data deduplication processing.

203. And judging whether the data blocks are matched through the similarity.

If the data block passes, i.e. the data block is similar to other data blocks in the storage space, step 204 is performed, and if the data block does not pass, the data block cannot be matched with any one of the data blocks in the storage space to be the similar data block, step 208 is performed.

204. It is determined whether the capacity of the aggregate block is full.

If yes, go to step 205 and step 207. If not, go to step 206.

205. A new aggregate block is created.

The data block, although matching to similar other data blocks, creates a new aggregate block and places the data block into the new aggregate block because the capacity of the existing aggregate block is full.

206. Aggregation into existing aggregation blocks.

The data block is matched to similar other data blocks, and the data block is placed in the existing aggregate block because the capacity of the existing aggregate block is not full. From the above, the same data features are not included in a single aggregate block in the embodiment of the present application. Thus, the existing aggregate block into which the data block is fused at step 206 does not include the same data characteristics as the data block. The specific process of step 206 is referred to step 102 corresponding to fig. 4, and will not be described herein.

207. Differential compression is performed on aggregate blocks that are full in capacity.

Since the capacity of the current aggregate block is full, differential compression can be performed on the aggregate block. The specific process of step 207 is referred to step 103 corresponding to fig. 4, and will not be described herein. After delta compression, the resulting aggregate difference block is subjected to step 208.

It should be understood that in the example of fig. 7, the data blocks within the aggregate block are brought to the upper capacity limit as a condition for triggering the aggregate block to perform delta compression. In practical applications, the flow of differential compression on the aggregate block may be triggered by other conditions, for example, the aggregate block is periodically differential compressed, which is not limited herein.

In the embodiment of the present application, step 205 may be performed first and then step 207 may be performed, or step 207 may be performed first and then step 205 may be performed, or step 205 and step 207 may be performed simultaneously, which is not limited herein.

208. And (5) persistent storage.

For the embodiment of fig. 4 (steps 101 to 103) and the embodiment of fig. 7 (steps 201 to 208), the embodiment of the present application provides an experimental manner shown in fig. 8 for checking the data compression method in the embodiment of the present application.

After the compression process for the existing data in the storage space is completed, if new data is stored in the storage space, the compression process for the new data may be performed. Generally, after new data is stored in the storage space, the new data is subjected to deduplication processing. Referring to fig. 9, fig. 9 is a schematic diagram of a compression method for new data. As shown in fig. 9, version 1 and version 2 are existing data in the storage space, and new data in the 3-bit storage space.

In the conventional delta compression scheme, since the C1 data block, the C2 data block, and the C1 data block and the C2 data block in version 2 are identical, when the C1 data block and the C2 data block of version 2 are saved, only metadata (e.g., a reference pointer of the C1 data block and a reference pointer of the C2 data block) pointing to the C1 data block and the C2 data block need to be saved. The C3 'data block and the C4' data block of the version 2 are similar to the C3 data block and the C4 data block of the version 1, so that differential compression storage difference blocks are respectively performed on the C3 'data block of the version 2 and the C3 data block of the version 1, and differential compression storage difference blocks are respectively performed on the C4' data block of the version 2 and the C4 data block of the version 1. Similarly, the C1 data block, the C2 data block in version 3 and the C1 data block and the C2 data block in version 1 are identical, and only metadata (such as a reference pointer of the C1 data block and a reference pointer of the C2 data block) pointing to C1 and C2 needs to be saved, and the C3' data block, the C4' data block and the C3' data block and the C4' data block in version 3 and the C4' data block in version 2 are repeated, so that only metadata (such as a reference pointer of the C3' data block and a reference pointer of the C4' data block) pointing to C3' and C4' in version 2 need to be saved. When recovering the C3 'metadata and the C4 metadata' in the version 3, only the C3 metadata and the C4 metadata of the version 1 are read into the memory for decompression.

In the data compression method of the embodiment of the application, when the C3 'data block and the C4' data block in the version 2 are stored, an aggregation difference block generated by an aggregation block where the C3 'data block and the C4' data block are located is stored. Since the C3 'data block, C4' data block in version 3 and the C3 'data block, C4' data block in version 2 are duplicated, we only need to hold pointers to the C3 'data block, C4' data block in version 2. Therefore, when recovering the C3 'data block and the C4' data block in the version 3, the aggregate block where the C3 data block is located, the aggregate block where the C4 data block is located, and the corresponding aggregate difference block need to be read into the memory for decompression, so as to cause read amplification and decompression amplification, that is, the reference block read into the memory is changed from one data block into one aggregate block, the random reading of the disk is increased, the decompression of one data block into one aggregate block, the calculation amount is increased, and the speed of recovering the version 3 data block is reduced.

In view of the above problems, an optimization scheme is provided in the embodiments of the present application. Specifically, taking the stored new data as a first data block as an example, where the newly stored first data block and a certain data block (a first similar block) in the second aggregate block are repeated. I.e. the data content is exactly the same between a first data block and a first similar block in the second aggregate block, the first similar block being one of the similar blocks of the second aggregate block. In this case, in the embodiment of the present application, instead of performing deduplication on the first data block and the first similar block, delta compression may be performed on the first reference block and the first data block to obtain a first difference block, where the first difference block includes metadata for pointing to repeated data between the first reference block and the first data block, and delta between the first reference block and the first data block, where the first reference block is one of the reference blocks of the first aggregate block, and the first reference block and the first similar block are from the same similar data group. Therefore, when the first data block is decompressed and restored, only the first reference block is read into the memory, and the whole second aggregation block is not needed to be decompressed, so that the data decompression and restoration efficiency is improved.

Referring to fig. 10, fig. 10 is a schematic diagram of an optimized compression method for new data according to an embodiment of the application. The left side of fig. 10 is a compression scheme for new data that has not been optimized in the embodiment of the present application, where when the C3 'data block and the C4' data block of version 3 are stored, the difference portion of the aggregate block where the C3 'data block and the C4' data block are located is actually stored. When recovering the C3 'data block and the C4' data block in the version 3, the aggregate blocks where the C3 data block and the C4 data block are located need to be read into the memory for decompression. In the optimized compression method for new data in the embodiment of the present application, when storing the C3' data block and the C4' data block in version 3, the C3' data block and the C4' data block are not subjected to deduplication processing, but the C3' data block and the C4' data block are directly found out to be in the reference block of version 1, and one-to-one differential compression (for example, differential compression is performed between the C3' data block in version 3 and the C3 data block in version 1) is performed, and the differential part is stored. Only the C3 data block and the C4 data block of the version 1 are required to be read into the memory when the C3 data block and the C4 data block in the version 3 are restored, so that random reading of the disk is reduced. The method can obviously reduce the read amplification during data recovery and improve the speed of data recovery.

Referring to fig. 11, fig. 11 is a schematic diagram illustrating a compression flow for new data according to an embodiment of the application. The compression flow for new data shown in fig. 11 adopts the above-described optimized compression method for new data. The compression flow for new data specifically includes:

301. A fingerprint of the new data block is calculated.

After storing the new data stream in the memory space, it is divided into a plurality of new data blocks. A fixed-length secure hash digest is computed for the new data blocks, thereby uniquely identifying the data blocks, corresponding to a unique fingerprint being assigned to each new data block.

302. It is determined whether there is a duplicate data block.

If yes, go to step 303, if no go to step 304

303. It is determined whether a duplicate data block of the new data block exists in the aggregate block.

If yes, go to step 305, otherwise go to step 306.

304. And directly storing.

Specifically, if there is no data block similar to the new data block in the storage space, the original data of the new data block is directly stored in a conventional manner, and if there is a data block similar to the new data block in the storage space and the data block similar to the new data block is not from an aggregate block, the differential compression is performed on the new data block and the similar block of the new data block.

If there is a data block in the memory space that is similar to the new data block and the new data block is from an aggregate block, differential compression is performed between the new data block and the reference block of the aggregate block. Specifically, in one possible implementation, the new data block is a second data block, where at least one identical data feature is included between the second data block and a second similar block in the second aggregate block, and the data content between the second data block and the second similar block is not exactly the same, and the second similar block is one of the similar blocks in the second aggregate block. In other words, the second data block is similar to a certain data block (second similar block) in the second aggregate block. At this time, in the case that the second reference block is also similar to the second data block, differential compression is performed on the second reference block and the second data block to obtain a second differential block, where the second differential block includes metadata for pointing to the second reference block, and a differential between the second reference block and the second data block, where the second reference block is one of the reference blocks of the first aggregate block, and the second reference block and the second similar block are from the same similar data set.

305. And performing differential compression on the new data block and the reference block corresponding to the repeated data block.

The description of step 305 is similar to the optimized compression scheme for new data shown on the right side of fig. 10, and detailed description thereof will be omitted here.

306. Direct deduplication processes, for new data blocks, retain metadata that points to their duplicate data blocks (e.g., reference pointers to the duplicate data blocks).

In one possible implementation, a third aggregate block is present (e.g., after the new data blocks are stacked, the third aggregate block is aggregated) and the data content between the third aggregate block and the second aggregate block is identical, at which time the deduplication process of the third aggregate block and the second aggregate block does not result in read amplification. Thus, the third aggregated block may be updated, the updated third aggregated block comprising metadata for pointing to the second aggregated block.

In one possible implementation, there is a fourth aggregated block, the fourth aggregated block comprising N data blocks and including at least one identical data feature between the fourth aggregated block and the second aggregated block, and the data content between the fourth aggregated block and the second aggregated block is not exactly the same. In other words, the fourth aggregate block is similar to the second aggregate block, at which time the first aggregate block and the fourth aggregate block may be delta compressed to obtain an updated fourth aggregate block, the updated fourth aggregate block including the delta between the first aggregate block and the fourth aggregate block and metadata for pointing to the duplicate data between the first aggregate block and the fourth aggregate block.

For ease of understanding, the data compression method in the embodiment of the present application will be described below in conjunction with an exemplary scenario. Referring to fig. 12, fig. 12 is a schematic view of an application scenario of a data compression method according to an embodiment of the application. As shown in fig. 12, it is assumed that the capacity of the aggregation block is 2 data blocks, that is, 2 data blocks are aggregated into an aggregation block, and delta compression can be performed. In the scenario of fig. 12, there are 4 versions of data to save.

For version 1, C2, C3, C4 and C5 data blocks are stored directly as independent data.

And repeating the C1 data block and the C2 data block in the version 2 to the C1 data block and the C2 data block in the version 1 respectively, so that the deduplication processing is carried out by only storing metadata pointing to the C1 data block and the C2 data block in the version 1. For the C3' data block, the C3' data block is similar to the C3 data block in version 1, a first aggregate block is created and the C3 data block is stored, and a second aggregate block is created and the C3' data block is stored. For the C4 'data block, since it is similar to the C4 data block in version 1, the C4 data block is stored in the first previously created aggregate block and the C4' data block is stored in the second previously created aggregate block. At this time, both aggregation blocks (the first aggregation block and the second aggregation block) reach capacity, so that the first aggregation block composed of the C3 data block and the C4 data block and the second aggregation block composed of the C3 'data block and the C4' data block perform differential compression. The C5 data block and the C5 data block in version 1 are duplicated so data deduplication is performed.

For the version 3C 1 data block and the C2 data block, the duplicate processing is carried out because the duplicate processing is repeated with the C1 data block and the C2 data block in the version 1. For the C3 'data block and the C4' data block, although they are duplicated with the C3 'data block and the C4' data block in version 2, a deduplication process may be performed, if the deduplication process is used, the C3 'data block and the C4' data block in the restored version 3 need to be read into the memory for decompression, resulting in a slow decompression speed. Therefore, in the embodiment of the application, the C3 'data block and the C4' data block are respectively subjected to one-to-one differential compression with the C3 data block and the C4 data block in the version 1, and the differential block is stored. For the C5' data block, it performs aggregate delta compression with the next data block component aggregate block having a similar block in version 2.

The C3 'data block and the C4' data block are repeated for version 4 and the C3 'data block, C4' data block in version 3, so data deduplication is performed. The C6 data block, the C7 data block and the C8 data block are independent data blocks and are directly stored.

In summary, in version 2, delta compression is performed on the aggregate blocks composed of the C3 data block and the C4 data block, and the aggregate blocks composed of the C3 'data block and the C4' data block. The C3 'and C4' in version 3 and the C3 'and C4' in version 2 are repeated, and can be subjected to deduplication processing, but when the C3 'data block and the C4' data block in version 3 are restored, the aggregate block needs to be read into the memory for decompression, so that the decompression speed is slow. Since the C3 'data block and the C4' data block in version 3 belong to the data block of the repeated block in the aggregate block, we find the reference block of the repeated block, that is, the C3 data block and the C4 data block in version 1 are subjected to differential compression and stored, so that the C3 'data block and the C4' data block in version 3 are restored only by reading the corresponding reference blocks into the memory for decompression, and the above-mentioned whole first aggregate block (C3 data block and C4 data block) is not required to be decompressed.

As can be seen from the foregoing, in the embodiment of the present application, the plurality of data blocks in a single aggregate block are from different storage addresses, and the storage addresses of the data blocks may be adjacent or non-adjacent, which is not limited in the embodiment of the present application. In the application scenario shown in fig. 12, a process is illustrated in which two adjacent data blocks are aggregated into an aggregate block, and then delta-compressed. Next, referring to fig. 13, fig. 13 is a schematic diagram of another application scenario of the data compression method in the embodiment of the application. In the application scenario shown in fig. 13, a process is shown in which non-adjacent data blocks are aggregated into an aggregate block, and then delta compression is performed. In the application scenario shown in fig. 13, the capacity of the aggregate block is still 2 data blocks, that is, 2 data blocks are aggregated into an aggregate block, so that delta compression can be performed. In the scenario of fig. 13, there are 4 versions of data to save.

And repeating the C1 data block and the C2 data block in the version 2 and the C1 data block and the C2 data block in the version 1 respectively, so that the deduplication processing is carried out by only respectively storing metadata pointing to the C1 data block and the C2 data block in the version 1. The C3 'data block is similar to the C3 data block in version 1, a first aggregate block is created and the C3 data block in version 1 is stored, and a second aggregate block is created and the C3' data block in version 2 is stored. The subsequent data blocks are then processed further, and the C4 data block in version 1 are repeated, so that a deduplication process is performed. For the C5 'data block, since it is similar to the C5 data block in version 1, the C5 data block is stored in the first previously created aggregate block and the C5' data block is stored in the second previously created aggregate block. At this time, both aggregation blocks (the first aggregation block and the second aggregation block) reach capacity, so that the first aggregation block composed of the C3 data block and the C5 data block and the second aggregation block composed of the C3 'data block and the C5' data block perform differential compression.

For the version 3C 1 data block and the C2 data block, the duplicate processing is carried out because the duplicate processing is repeated with the C1 data block and the C2 data block in the version 1. For the C3 'data block and the C5' data block, although they are duplicated with the C3 'data block and the C5' data block in version 2, a deduplication process may be performed, if the deduplication process is used, the C3 'data block and the C5' data block in the restored version 3 need to be read into the memory for decompression, resulting in a slow decompression speed. So we use optimization of the de-weight slip block, we will do one-to-one delta compression of the C3 'and C5' data blocks with the C3 and C5 data blocks in version 1, respectively, and save the slip block. The C4 data block can be directly subjected to deduplication processing with the C4 data block in the version 1.

The deduplication process is performed for the C3 'data block in version 3 and the C3' data block in version 4. The C6 data block, the C7 data block, the C8 data block and the C9 data block are independent data blocks and can be directly stored.

In the application scenario shown in fig. 13, the C5 data block and the C3 data block are not adjacent to each other, but the C5 data block and the C3 data block are still aggregated into an aggregate block for delta compression, compared to the application scenario shown in fig. 12.

In order to better describe the scheme of the embodiment of the application, correspondingly, the embodiment of the application also provides a related device for implementing the scheme. Specifically, referring to fig. 14, fig. 14 is a schematic structural diagram of a data compression device according to an embodiment of the present application. As shown in fig. 14, the communication apparatus includes an acquisition unit 401 and a processing unit 402.

An obtaining unit 401, configured to obtain N similar data groups, where each similar data group includes a reference block and a similar block, at least one identical data feature is included between the reference block and the similar block in each similar data group, the data content between the reference block and the similar block in each similar data group is not identical, the N similar data groups do not include identical data features, and N is an integer greater than or equal to 2;

A processing unit 402, configured to aggregate the N similar data groups to obtain a first aggregate block and a second aggregate block, where the first aggregate block includes N reference blocks of the N similar data groups, and the second aggregate block includes N similar blocks of the N similar data groups;

the processing unit 402 is further configured to perform delta compression on the first aggregate block and the second aggregate block to obtain an aggregate difference block, where the aggregate difference block includes metadata for pointing to repeated data between the first aggregate block and the second aggregate block, and a difference between the first aggregate block and the second aggregate block.

In one possible design, the obtaining unit 401 is further configured to obtain a first data block, where the data content between the first data block and a first similar block in the second polymer block is identical, and the first similar block is one of similar blocks in the second polymer block;

The processing unit 402 is further configured to perform differential compression on a first reference block and a first data block to obtain a first differential block, where the first differential block includes metadata for pointing to repeated data between the first reference block and the first data block, and a differential between the first reference block and the first data block, where the first reference block is one of reference blocks of the first aggregate block, and the first reference block and the first similar block are from a same similar data set.

In one possible design, the obtaining unit 401 is further configured to obtain a second data block, where at least one identical data feature is included between the second data block and a second similar block in the second aggregate block, and the data content between the second data block and the second similar block is not identical, and the second similar block is one of the similar blocks in the second aggregate block;

The processing unit 402 is further configured to perform differential compression on a second reference block and a second data block to obtain a second difference block, where the second difference block includes metadata for pointing to the second reference block, and a differential between the second reference block and the second data block, and the second reference block is one of the reference blocks of the first aggregate block, and the second reference block and the second similar block are from the same similar data set.

In one possible design, the third reference block includes metadata of a third data block, the data content between the third data block and the third reference block is identical, the third data block is a data block other than the N similar data groups, and the third reference block is one of the reference blocks of the first aggregate block.

In a possible design, the obtaining unit 401 is further configured to obtain a third polymer block, where the third polymer block includes N data blocks, and the data content between the third polymer block and the second polymer block is identical;

The processing unit 402 is further configured to update a third aggregate block, where the updated third aggregate block includes metadata for pointing to the second aggregate block.

In one possible design, the obtaining unit 401 is further configured to obtain a fourth aggregate block, where the fourth aggregate block includes N data blocks, at least one identical data feature is included between the fourth aggregate block and the second aggregate block, and the data content between the fourth aggregate block and the second aggregate block is not completely identical;

the processing unit 402 is further configured to perform delta compression on the fourth aggregate block according to the first aggregate block to obtain an updated fourth aggregate block, where the updated fourth aggregate block includes a delta between the first aggregate block and the fourth aggregate block, and metadata for pointing to repeated data between the first aggregate block and the fourth aggregate block.

In one possible design, the processing unit 402 is specifically configured to:

In one possible design, the obtaining unit 401 is specifically configured to:

It should be noted that, content such as information interaction and execution process between each module/unit in the data compression device, the method embodiments corresponding to fig. 4, fig. 7 and fig. 11 in the present application are based on the same concept, and specific content may be referred to the description in the foregoing method embodiments of the present application, which is not repeated herein.

Referring to fig. 15, fig. 15 is a schematic diagram of a logic structure of a computing device 50 according to an embodiment of the application. The computing device 50 may have disposed thereon the data compression means described in the corresponding embodiment of fig. 14 for implementing the corresponding embodiments of fig. 4, 7 and 11. The computer device 50 includes a memory 501, a processor 502, a communication interface 503, and a bus 504. The memory 501, the processor 502, and the communication interface 503 are communicatively connected to each other via a bus 504.

The memory 501 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 501 may store a program which, when executed by the processor 502, is used by the processor 502 and the communication interface 503 to perform the steps 101-103 of the data processing method embodiments described above.

The processor 502 may employ a central processing unit (central processing unit, CPU), microprocessor, application SPECIFIC INTEGRATED Circuit (ASIC), graphics processor (graphics processing unit, GPU), digital signal processor (DIGITAL SIGNAL processing, DSP), off-the-shelf programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, for executing associated programs to perform one or more of steps 101-103 of data processing method embodiments of the present application. The steps of the data processing method disclosed in connection with the embodiments of the present application may be performed by a compiler and an executor, where the compiler and the executor may be performed by a hardware decoding processor or may be performed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 501 and a processor 502 reads information in the memory 501 and in combination with its hardware performs one or more of the steps 101-103 of the data processing method embodiments of the present application.

The communication interface 503 enables communication between the computer device 50 and other devices or communication networks using a transceiver means such as, but not limited to, a transceiver.

Bus 504 may implement a pathway for information among the various components of computer device 50 (e.g., memory 501, processor 502, and communication interface 503). Bus 504 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus.

It should be further noted that the above described embodiments of the apparatus are only schematic, where the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the device embodiment drawings provided by the embodiment of the application, the connection relation between the modules represents that the modules have communication connection, and the connection relation can be specifically realized as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments of the present application may be implemented by software plus necessary general purpose hardware, or may be implemented by special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But software program implementation is a preferred implementation for many more of the embodiments of the present application. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims

1. A method of data compression, comprising:

acquiring N similar data sets, wherein each similar data set comprises a reference block and a similar block, at least one identical data characteristic is included between the reference block and the similar block in each similar data set, the data content between the reference block and the similar block in each similar data set is not identical, the identical data characteristic is not included between the N similar data sets, and N is an integer greater than or equal to 2;

The N similar data sets are aggregated to obtain a first aggregation block and a second aggregation block, wherein the first aggregation block comprises N reference blocks of the N similar data sets, and the second aggregation block comprises N similar blocks in the N similar data sets;

Performing differential compression on the first aggregation block and the second aggregation block to obtain an aggregation difference block, wherein the aggregation difference block comprises metadata for pointing to repeated data between the first aggregation block and the second aggregation block, and the differential between the first aggregation block and the second aggregation block.

2. The method according to claim 1, wherein the method further comprises:

Acquiring a first data block, wherein the data content between the first data block and a first similar block in the second polymer block is identical, and the first similar block is one of similar blocks in the second polymer block;

Performing differential compression on a first reference block and the first data block to obtain a first differential block, wherein the first differential block comprises metadata for pointing to repeated data between the first reference block and the first data block and differential between the first reference block and the first data block, the first reference block is one of the reference blocks of the first aggregation block, and the first reference block and the first similar block come from the same similar data group.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

Acquiring a second data block, wherein at least one same data characteristic is included between the second data block and a second similar block in the second polymer block, and the data content between the second data block and the second similar block is not completely the same, and the second similar block is one of the similar blocks of the second polymer block;

and performing differential compression on a second reference block and the second data block to obtain a second differential block, wherein the second differential block comprises metadata for pointing to the second reference block and differential between the second reference block and the second data block, the second reference block is one of the reference blocks of the first aggregation block, and the second reference block and the second similar block are from the same similar data group.

4. A method according to any one of claims 1 to 3, wherein a third reference block comprises metadata of a third data block, the data content between the third data block and the third reference block being identical, the third data block being a data block other than the N similar data groups, the third reference block being one of the reference blocks of the first aggregate block.

5. The method according to any one of claims 1 to 4, further comprising:

Acquiring a third aggregation block, wherein the third aggregation block comprises N data blocks, and the data content between the third aggregation block and the second aggregation block is completely the same;

Updating the third aggregate block, the updated third aggregate block including metadata for pointing to the second aggregate block.

6. The method according to any one of claims 1 to 5, further comprising:

Acquiring a fourth aggregation block, wherein the fourth aggregation block comprises N data blocks, at least one same data characteristic is included between the fourth aggregation block and the second aggregation block, and the data content between the fourth aggregation block and the second aggregation block is not completely the same;

And performing differential compression on the fourth aggregation block according to the first aggregation block to obtain an updated fourth aggregation block, wherein the updated fourth aggregation block comprises the differential between the first aggregation block and the fourth aggregation block and metadata for pointing to repeated data between the first aggregation block and the fourth aggregation block.

7. The method according to any one of claims 1 to 6, wherein the aggregating the N similar data groups to obtain a first aggregated block and a second aggregated block comprises:

the N reference blocks in the N similar data groups are aggregated to obtain a first aggregation block;

and aggregating N similar blocks in the N similar data groups to obtain a second aggregated block.

8. The method according to any one of claims 1 to 7, wherein the acquiring N similar data sets comprises:

9. The method of claim 8, wherein the obtaining M blocks of data to be compressed comprises:

and performing de-duplication processing on the initial data blocks with the identical data content in the X initial data blocks to obtain M data blocks to be compressed.

10. A data compression apparatus, comprising:

The processing unit is further configured to perform delta compression on the first aggregate block and the second aggregate block to obtain an aggregate difference block, where the aggregate difference block includes metadata for pointing to repeated data between the first aggregate block and the second aggregate block, and delta between the first aggregate block and the second aggregate block.

11. The data compression device of claim 10, wherein,

The acquiring unit is further configured to acquire a first data block, where data content between the first data block and a first similar block in the second polymer block is identical, and the first similar block is one of similar blocks in the second polymer block;

The processing unit is further configured to perform delta compression on a first reference block and the first data block to obtain a first difference block, where the first difference block includes metadata for pointing to repeated data between the first reference block and the first data block, and delta between the first reference block and the first data block, where the first reference block is one of reference blocks of the first aggregate block, and the first reference block and the first similar block are from a same similar data set.

12. The data compression device according to claim 10 or 11, wherein,

The acquiring unit is further configured to acquire a second data block, where at least one same data feature is included between the second data block and a second similar block in the second aggregate, and data content between the second data block and the second similar block is not completely the same, and the second similar block is one of similar blocks in the second aggregate;

The processing unit is further configured to perform delta compression on a second reference block and the second data block to obtain a second difference block, where the second difference block includes metadata for pointing to the second reference block, and a delta between the second reference block and the second data block, where the second reference block is one of the reference blocks of the first aggregate block, and the second reference block and the second similar block are from a same similar data set.

13. A data compression apparatus according to any one of claims 10 to 12, wherein a third reference block comprises metadata of a third data block, the data content between the third data block and the third reference block being identical, the third data block being a data block other than the N similar data groups, the third reference block being one of the reference blocks of the first aggregate block.

14. The data compression device according to any one of claims 10 to 13, wherein,

The acquisition unit is further configured to acquire a third aggregation block, where the third aggregation block includes N data blocks, and data contents between the third aggregation block and the second aggregation block are completely the same;

the processing unit is further configured to update the third aggregate block, where the updated third aggregate block includes metadata for pointing to the second aggregate block.

15. The data compression device according to any one of claims 10 to 14, wherein,

The acquisition unit is further configured to acquire a fourth aggregate block, where the fourth aggregate block includes N data blocks, at least one identical data feature is included between the fourth aggregate block and the second aggregate block, and data content between the fourth aggregate block and the second aggregate block is not completely identical;

the processing unit is further configured to perform delta compression on the fourth aggregate block according to the first aggregate block to obtain an updated fourth aggregate block, where the updated fourth aggregate block includes a delta between the first aggregate block and the fourth aggregate block, and metadata for pointing to repeated data between the first aggregate block and the fourth aggregate block.

16. The data compression device according to any one of claims 10 to 15, wherein the processing unit is specifically configured to:

17. The data compression device according to any one of claims 10 to 16, wherein the acquisition unit is specifically configured to:

18. The data compression device according to claim 17, wherein the acquisition unit is specifically configured to:

19. A computing device comprising a processor coupled to a memory,

The memory is used for storing instructions;

the processor configured to execute the instructions in the memory, to cause the computing device to perform the method of any one of claims 1 to 9.

20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.

21. A computer program product having computer readable instructions stored therein, which when executed by a processor, implement the method of any of claims 1 to 9.

22. A chip system comprising at least one processor, wherein program instructions, when executed in the at least one processor, cause the method of any one of claims 1 to 9 to be performed.