[go: up one dir, main page]

WO2024030041A1 - Method for data compression and electronic device - Google Patents

Method for data compression and electronic device Download PDF

Info

Publication number
WO2024030041A1
WO2024030041A1 PCT/RU2022/000246 RU2022000246W WO2024030041A1 WO 2024030041 A1 WO2024030041 A1 WO 2024030041A1 RU 2022000246 W RU2022000246 W RU 2022000246W WO 2024030041 A1 WO2024030041 A1 WO 2024030041A1
Authority
WO
WIPO (PCT)
Prior art keywords
incoming
blocks
bytes
block
chunks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/RU2022/000246
Other languages
French (fr)
Inventor
Alexey Evgenevich DMITRIEV
Aleksei Valentinovich ROMANOVSKII
Sergey Alexandrovich CHERNOV
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/RU2022/000246 priority Critical patent/WO2024030041A1/en
Priority to EP22765251.8A priority patent/EP4508755A1/en
Priority to CN202280095524.6A priority patent/CN119111038B/en
Publication of WO2024030041A1 publication Critical patent/WO2024030041A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3068Precoding preceding compression, e.g. Burrows-Wheeler transformation
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • H03M7/3095Data deduplication using variable length segments
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6047Power optimization with respect to the encoder, decoder, storage or transmission

Definitions

  • Embodiments of the present application relate to the field of data similarity in data storage and transmission, and more specifically, to a method a method for data compression and an electronic device, which improves delta-compression in data storage and data transmission of an electronic device.
  • Similarity detection makes it possible to apply deduplication or delta-compression for similar data, and further improves efficiency of data storage and transmission.
  • Known methods of similarity detection are quite computationally expensive and time-consuming.
  • Embodiments of this application provide a method of data similarity detection, which improves delta-compression in data storage and data transmission of an electronic device.
  • an embodiment of this application provides a method for data similarity detection, including: dividing an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; selecting one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to i a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; calculating one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and matching the incoming block according to the calculated hash values for the one or multiple chunks to other blocks to determine their similarity. Then, one or more the best similarity candidate blocks, found in this way, are used as a dictionary to delta-compress the incoming block.
  • the incoming block is divided into multiple incoming chunks and a small number of bytes (sets of bytes) are selected from each incoming chunk for hash calculation. Selecting sets of bytes reduces computational costs involved in the hash calculation as compared to using all bytes in chunk for hash calculation.
  • the calculated hash values can be employed to match the incoming block to other blocks to determine their similarity. Then, one or more the best similarity candidate blocks, found in this way, are used as dictionary to delta-compress the incoming block.
  • a quantity of the sets of bytes is multiple for each incoming chunk
  • several sets of bytes are selected from each chunk and respective number of hash values is calculated for each chunk.
  • First byte of each selected byte set has offset from the start of the chunk, equal to 0, 1, 2, 4 ... 2 m , where the m is a natural number unique for different sets of bytes.
  • the selected byte sets may shift by 0, 1, 2, 4 ... 2 m bytes, and this may help to detect similar features that are shifted due to insertions and deletions of 2 m bytes, which is quite frequent due to the ubiquitous 8-bit, 16-bit, 32-bit and 64-bit data types in modern SW, thus improving the similarity detection among blocks and as a result, improving delta-compression ratio.
  • the preset rule includes: intervals between the non-consecutive Y bytes are determined by a modified Fibonacci sequence, a logarithmic function, or an exponential function.
  • the modified Fibonacci sequence is Fibonacci sequence without the first 1, e.g. 1, 2, 3, 5 ...
  • the preset rule includes: for any two bytes in the each of the set of bytes, the two bytes are selected at nonadjacent positions from the incoming chunk.
  • the bytes are selected at intervals, which can reduce the number of bytes selected and computational costs involved for hash calculation and similarity detection.
  • the similarity detection of the incoming block to other blocks according to the incoming hash values for the multiple incoming chunks includes: determining one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first (processed) blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values. Then, one or more the best similarity candidate blocks, found in this way, are used as dictionary to delta-compress the incoming block.
  • incoming hash values calculated for the incoming chunks of the incoming block are employed to look up a key-value store for information about processed blocks that include chunks with the same hash values.
  • Similarity detection of incoming block to one or more processed data blocks is done by comparing incoming hash values to hash values, calculated for chunks of already processed data blocks, further called respectively processed hash values and processed chunks, to find one or more pairs, each pair comprising incoming chunk and processed chunk, their incoming and processed hash values being equal, and each chunk is associated with respective data block identifier, further referred as incoming blockID and processed blockID.
  • Found processed chunks, with hash values equal to incoming hash values, are sorted by processed blockID to have respective processed blockIDs ordered by number of chunks, having hash values equal to hash values of incoming block.
  • One or more processed blocks, matching sorted blockIDs with biggest number of chunks, having hash values equal to hash values of incoming block, are selected as candidate dictionaries for delta-compression of incoming block.
  • Delta-compressing the incoming block using selected processed blocks as candidate dictionaries is done.
  • each candidate dictionary contains byte strings, equal to byte strings in incoming block to make delta-compression effective - that is to encode a byte string in incoming block as reference to equal byte string in dictionary block by less number of bits than number of bits in the respective byte string.
  • a subset of incoming hash values to be stored as processed hash values, associated with respective blockID, for future similarity detection among processed blocks and future incoming blocks is selected so that the subset includes only different hash values; further the subset includes one or more minimal hash values, one or more maximal hash values.
  • the selected incoming hash values, associated with incoming blockID, are stored to be used later for similarity detection among processed blocks and future incoming block in appropriate data structure for fast lookup by hash as a key to get processed blockID and to retrieve respective processed block as candidate dictionary for delta-compression of incoming block.
  • an electronic device including: a dividing module, configured to divide an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; a selecting module, configured to select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; a calculating module, configured to calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and a compressing module, configured to compress the incoming block according to the incoming hash values calculated by the calculating module for the multiple incoming chunks.
  • the dividing module is configured to divide the incoming block into one or multiple chunks.
  • the selecting module is configured to select one or multiple sets of bytes for hash calculation by the calculating module.
  • the hash calculation result will be used by the compressing module to compress the incoming block. A small number of bytes rather than the chunk as a whole are selected for hash calculation, which reduces the computational costs involved in the compression process and improves the efficiency of compression.
  • a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2 m , where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
  • the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
  • the compressing module is specifically configured to: determine one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values; and compress the incoming block based on the one or multiple first blocks.
  • the compressing module is configured to: determine the one or multiple first chunks based on the incoming hash values, where each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; determine the one or multiple first blocks based on the one or multiple first chunks, where the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided.
  • the one or multiple first chunks are determined based on a key-value store
  • the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
  • a quantity of the one or multiple first chunks is greater than a first preset threshold.
  • the electronic device further includes a determining module, configured to determine one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where a length of matched strings is greater than a second preset threshold.
  • the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
  • the electronic device further includes an updating module, configured to update the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
  • an embodiment of this application provides a method for data compression, including: dividing an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; selecting one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; calculating one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and compressing the incoming block according to the incoming hash values for the multiple incoming chunks.
  • the incoming block is divided into multiple incoming chunks and a small number of bytes (sets of bytes) are selected from the divided incoming chunks for hash calculation.
  • sets of bytes reduces computational costs involved in the hash calculation.
  • the calculated incoming hash values can be employed to compress the incoming block.
  • the incoming block can be used as a whole to select one or multiple sets of bytes for hash calculation, then the incoming block can be compressed with hash values of processed blocks.
  • a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2 m , where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
  • the difference of the offsetO position between a first set of bytes and anyone of the other sets of bytes is 2 m
  • the selected bytes may shift 2 m bytes, and this may help to detect similar features that are shifted due to insertions and deletions of 2 m bytes, which is quite frequent due to the ubiquitous 8-bit, 16-bit, 32-bit and 64-bit data types in modem SW, thus improving the compression ratio.
  • the offsetO position of the first set of bytes is 0, and the offsetO position of the second set of bytes is 1 , and the offsetO position of the third sets of bytes is 2, etc.
  • the difference of the offsetO position between the first set of bytes and the following sets of bytes is 1, 2, 4, 8, 16...
  • the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
  • the difference of the byte position between the set of bytes satisfies the modified Fibonacci sequence or logarithmic function sequence or exponential function sequence.
  • the bytes are selected at intervals so that only a small number of bytes are selected for hash calculation, thereby reducing the computational costs and improving the compression speed.
  • the preset rule includes: for any two bytes in the each of the set of bytes, the two bytes are selected at nonadjacent positions from the incoming chunk.
  • the bytes are selected at intervals, which can reduce the number of bytes selected and computational costs involved.
  • the compressing the incoming block according to the incoming hash values for the multiple incoming chunks includes: determining one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values; and compressing the incoming block based on the one or multiple first blocks.
  • incoming hash values calculated for the incoming chunks of the incoming block are employed to look up a key-value store for blocks that include chunks with the same hash values.
  • first blocks include all the blocks corresponding to the matched records in the key- value store.
  • first blocks can be selected from the blocks corresponding to the records taken from the key-value store.
  • blocks corresponding to the records taken from the key-value store can be given a priority level according to the number of chunks each block has. Then if two chunks in different blocks are taken from the key-value store by the same hash value, the chunk in the block with a lower priority will be purged since the chunk in the block with a higher priority is possible to compress the particular incoming chunk in the incoming block with the same hash value. Then the blocks are resorted according to the number of chunks left. The resorted blocks can be deemed as the first blocks that have the potential to perform delta compression on the incoming block. The operation will improve the quality of the first blocks as candidate dictionary blocks.
  • determining one or multiple first blocks according to the incoming hash values includes: determining the one or multiple first chunks based on the incoming hash values, where each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; determining the one or multiple first blocks based on the one or multiple first chunks, where the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided from.
  • the one or multiple first chunks are determined based on a key-value store
  • the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
  • a quantity of the one or multiple first chunks is greater than a first preset threshold.
  • the number of chunks included in the first blocks is greater than a first preset threshold, which improves the possibility of the first blocks being useful to compress the incoming block and improves the compression ratio.
  • the method includes: determining one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
  • the second blocks are selected from the first blocks according to the string matching result, and only first blocks with matched strings longer than a second preset threshold will be considered as the second blocks.
  • the second blocks are blocks that are useful for compressing the incoming block.
  • the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
  • the compressed incoming block includes the block identifiers of the second blocks, for example, the compressed incoming block is prefixed with the block identifiers of the second blocks, which are useful in the compression of the incoming block.
  • the method further includes: updating the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
  • an embodiment of this application provides a computer-readable storage medium, including instructions. When the instructions runs on a computer, the computer is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • an electronic device including a processor and a memory.
  • the processor is connected to the memory.
  • the memory is configured to store instructions
  • the processor is configured to execute the instructions.
  • the processor executes the instructions stored in the memory, the processor is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • a chip system includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that an electronic device on which the chip system is disposed performs the method in the first aspect or any optional implementation of the first aspect.
  • a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • FIG.l shows a flowchart of an embodiment of a method 100 for data compression.
  • FIG. 2 shows an embodiment for selecting one set of bytes for each chunk in an incoming block.
  • FIG. 3 shows an embodiment for selecting more than one sets of bytes for chunk 0 in FIG. 1.
  • FIG. 4 is a schematic diagram of a key-value store.
  • FIG. 5 is a schematic diagram depicting a delta compression process provided in this application.
  • FIG. 6 is a schematic diagram depicting a delta compression process.
  • FIG. 7 shows a key-value store updating result based on an incoming block.
  • FIG. 8 is a flowchart of an embodiment of a method 700 for data compressing.
  • FIG. 9 is a flowchart of an embodiment of a method 800 for data compressing.
  • FIG. 10 is a schematic block diagram of an electronic device 900.
  • FIG. 11 is a schematic block diagram of an electronic device 1000.
  • block refers to a sequence of bytes having a predetermined length.
  • the term “chunk” refers to a sequence of bytes with a predetermined length smaller than that of the block. In some cases, when the whole block is deemed as a chunk, the chunk is a sequence of bytes with the same length as that of the block.
  • bit refers to a sequence of eight bits, where a bit is the smallest unit in the computer science field.
  • phrase “hash value” refers to the output that is returned after the application of a hash function algorithm.
  • the phrases “hash function algorithm” and “hash function” refer to an algorithm or subroutine that maps large data sets (optionally of variable length) to smaller data sets that have a fixed size for a particular hash function.
  • the values returned by the algorithm may also be called hash function values, hash codes, hash sums, checksums, or hash values.
  • Data compression, source coding, or bit-rate reduction is the process of encoding information by using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information.
  • an encoder a device that performs data compression
  • a decoder one that performs the reversal of the process (decompression)
  • Compression is useful because it reduces resources required to store and transmit data. Computational resources are consumed in the compression and decompression processes. Data compression is subject to a space-time complexity trade-off. The design of data compression schemes involves trade-offs among various factors, including a degree of compression, an amount of distortion introduced (when using lossy data compression), and computational resources required to compress and decompress data.
  • Delta encoding is one kind of lossless compression method, which is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; and more generally, this is known as data differencing. Delta encoding is sometimes called delta compression, particularly in a case where archival histories of changes are required (e.g., in revision control software).
  • Data compression ratio or compression ratio also known as compression power, is a measurement of the relative reduction in size of data representation produced by a data compression algorithm. It is typically expressed as the division of uncompressed size by compressed size, and is given by:
  • FIG. 1 is a flowchart of a method 100 for data compression according to an embodiment of the present application.
  • the method 100 includes the following steps.
  • SI 02 divide an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes.
  • the incoming block can be received over a wired or wireless network from a host.
  • the data to be compressed can be transferred via multiple blocks or the blocks can be defined by the device performing the data compression method.
  • a block refers to a sequence of bytes with predetermined length of bytes.
  • the incoming block is the block that is being processed. And before the incoming block, there may be several blocks that have been processed. After the incoming block, there may be several blocks to be processed later.
  • the block size is not fixed. For example, the block can be 4KB (kilobyte) in size.
  • the block can include a header to identify the block.
  • the header may include information such as, but not limited to, a block identifier or block ID associated with the block.
  • the block with the block ID 2 can be denoted as block2.
  • the incoming block can be divided into multiple incoming chunks.
  • the chunk refers to a sequence of bytes with a predetermined length smaller than that of the block.
  • the incoming block of 4KB can be divided into 4* 1024B incoming chunks.
  • chunk ID used as the chunk unique ID (chunk UID) is introduced to help illustrate the present embodiment.
  • SI 04 select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes.
  • one set of bytes is selected from each of the chunks, and for all the chunks in the data stream, the bytes are selected according to a preset rule.
  • the set of bytes includes non-consecutive Y bytes from the X bytes. In other words, the selected bytes are not continuous.
  • the set of bytes is selected at incremental steps from an offsetO position, where the selected bytes are spaced by a predefined pattern.
  • the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies the modified Fibonacci sequence or logarithmic sequence.
  • the steps may increase logarithmically, exponentially, linearly, by modified Fibonacci or even by self-designed sequence.
  • the difference of the byte position between the non- consecutive Y bytes and the offsetO position is a result of a byte position or byte ID of one of the Y bytes minus offfsetO position.
  • the incoming block is a string of 64 bytes and is divided into four incoming chunks, denoted as chunk 0, chunk 1, chunk 2, and chunk3, respectively. As shown in FIG. 2, each incoming chunk is a 16-byte-string. Each number in a box in FIG. 2 represents a byte ID or byte position.
  • the “offsetO position”, “byte number”, “byte ID”, “byte position” all refers to the position of a byte in the original incoming chunk from which it is selected from.
  • Chunk 0 includes byte 0 to byte 15
  • chunk 1 includes byte 16 to byte 31
  • chunk 2 includes byte 32 to byte 47
  • chunk 3 includes byte 48 to byte 63.
  • the Fibonacci sequence is [1, 1, 2, 3, 5, 8, 13, 21...]. We will replace the leading 1 by 0 to avoid repetition. Then the modified Fibonacci sequence will be [0, 1, 2, 3, 5, 8, 13, 21...].
  • the offsetO position where the first byte of the Y bytes are selected, is the start position of the incoming chunk or the byte ID of the position of offsetO is 0, as the position differences satisfy the modified Fibonacci sequence, for chunkO, the byte IDs of selected bytes are [0, 1, 2, 3, 5, 8, 13], which are emphasized in FIG. 2.
  • the byte IDs of the selected bytes for chunk 1, chunk 2, chunk 3 are [16, 17, 18, 19, 21, 24, 29], [32, 33, 34, 35, 37, 40, 45] and [48, 49, 50, 51, 53, 56, 61], respectively.
  • the offsetO position changes to m as the position differences satisfy the modified Fibonacci sequence, for chunkO, the byte IDs of selected bytes are [m, 1+m, 2+m, 3+m, 5+m, 8+m, 13+m], the byte IDs exceed the biggest byte ID in chunk 0 is not considered here.
  • a length of a set of bytes is 7 bytes for each chunk.
  • the selected bytes can follow any part of the modified Fibonacci sequence.
  • the length can be limited by a small constant. For example, if the length of the selected bytes is 5 bytes, the byte IDs of the selected bytes for chunkO can be [0, 1, 2, 3, 5], [1, 2, 3, 5, 8] or [2, 3, 5, 8, 13]. Also, we can delete the leading 1 to get the modified Fibonacci, then the modified Fibonacci sequence will be [1, 2, 3, 5, 8, 13, 21...].
  • the preset rule could be: for any two bytes in the each of the sets of bytes, the two bytes are selected at nonadj acent positions from a chunk or the number of consecutive bytes selected from a chunk should be no more than n bytes, where n is a preset value.
  • an embodiment of this application provides a computer-readable storage medium, including instructions.
  • the instructions runs on a computer, the computer is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • an electronic device including a processor and a memory.
  • the processor is connected to the memory.
  • the memory is configured to store instructions
  • the processor is configured to execute the instructions.
  • the processor executes the instructions stored in the memory, the processor is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • a chip system includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that an electronic device on which the chip system is disposed performs the method in the first aspect or any optional implementation of the first aspect.
  • a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
  • the offsetO position that is, the start position of the set of bytes can be chosen randomly.
  • the offsetO position doesn’t necessarily mean the first byte position of the set of bytes. For example, if the modified Fibonacci sequence is [0, 1, 2, 3, 5, 8, 13, 21...], the offsetO position is the position of the first byte in one set of bytes. However, if the modified Fibonacci sequence is [1, 2, 3, 5, 8, 13, 21 ...], the first byte of the set of bytes is the following byte of the offsetO position.
  • a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
  • the difference follows a base 2 exponential function, that is, the difference of the offsetO position for different sets of bytes of one chunk can be 1, 2, 4, 8, 16, 32. bytes.
  • the value of m can be consecutive or non-consecutive integers.
  • the offsetO positions for different sets of bytes in a chunk are intended to detect similar content that is shifted due to insertions and deletions by 2m number of bytes, which is quite frequent due to ubiquitous 8-bit, 16-bit, 32-bit, 64-bit data types in modem SW. For example, if chunk 0 in block 3 and chunk 1 in block 5 both have 16 bytes, where chunk 1 can be achieved by inserting a byte before chunk 0 and deleting the last byte in chunk 0, then the first fifteen bytes in chunk 0 is the same as the last fifteen bytes in chunk 1.
  • the matched fifteen bytes will not be detected by the selected set of bytes. However, if two sets of bytes are selected, and the two sets of bytes shift by one byte, then the matched fifteen bytes will be detected by the selection method proposed by our application. [0108] If one offsetO position (denoted as a reference offsetO position) is the first byte of a chunk, and the offsetO position of the sets of bytes is spaced by 0, 1, 2, 4, 8, 16... bytes from the 1st byte of the chunk or the offestO position is the 1st, 2nd, 3rd, 5th, 9th, 17th... byte of the chunk. As shown in FIG.
  • the byte ID of the offsetO position is 0, 1, and 2, which are the 1st, 2nd, 3rd byte of the chunk.
  • the offsetO position is equal to 0, 1, 2, 4, 8, 16, 32... bytes from the reference offsetO position.
  • the byte IDs of the three sets of bytes are [0, 1, 2, 3, 5, 8, 13], [1, 2, 3, 4, 6, 9, 14] and [2, 3, 4, 5, 7, 10, 15], respectively.
  • the multiple sets of bytes are selected for hash calculation to detect features of the chunk (discussed later). In this way, several sets of bytes are selected for the following hash calculation.
  • S 106 calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks.
  • Hash calculation can be explained as mapping of a set of bytes to a scalar value (one number).
  • the calculated scalar value is used as an index to determine a bucket in a key-value store (discussed later).
  • Hash calculation is based on the set of bytes selected. Therefore, if only one set of bytes is selected for one chunk, one hash value is calculated for one chunk; and if multiple sets of bytes are selected for one chunk, multiple hash values are calculated for one chunk. Yet it is notable that the multiple hash values for one chunk are not necessarily different. Also, the hash values for different chunks may also be the same if they have similar features.
  • hash calculation can be performed by Rabin-Karp, MD, SHA algorithm (secure hash algorithm), crc 32 etc.
  • SI 08 compress the incoming block according to the incoming hash values for the multiple chunks.
  • the compressing the incoming block according to the incoming hash values for the multiple incoming chunks includes: determining one or multiple first blocks according to the one or multiple incoming hash values, where the one or multiple first blocks are processed blocks and that each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the one or multiple incoming hash values; and compressing the incoming block based on the one or multiple first blocks.
  • the determining one or multiple first blocks according to the incoming hash values includes: determining the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; and determining the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided from.
  • Incoming hash values are used as keys and (block ID, chunk ID) records are used as values, and the key-value store keeps associations of hash values and corresponding (block ID, chunk ID) records of already processed chunks.
  • the data stream to be compressed consists of multiple blocks, and processed chunks are from processed blocks that come before the incoming block.
  • the processed blocks have been divided into one or multiple X bytes of chunks and sets of bytes of Y bytes have been selected from the chunks for hash calculation in the same way that we select sets of bytes for the incoming chunks.
  • the first block of the data stream to be compressed is denoted as block 0, the incoming block is block 5, then the key-value store keeps associations of records of chunks from block 0, block 1, block 2, block 3 and block 4. It is obvious that the key-value store dynamically updates with blocks being processed.
  • the key- value store may have several buckets keeping records with chunks of different hash values. Each bucket keeps records of chunks with the same hash values. Therefore, if one hash value is calculated for one chunk, records of four chunks from the one processed block may be stored in up to 4 buckets of the key-value store.
  • Using hash values of incoming chunks as keys to look up the key- value store may result in several buckets. Then all the records in the several buckets are taken for the incoming block. The chunks corresponding to the records, determined based on the key-value store are first chunks mentioned above. And the blocks mentioned above are first blocks that may be useful in the compression process of the incoming block.
  • the records of blocks are used as first blocks to serve as dictionary blocks for compressing the incoming block.
  • the dictionary blocks are further selected from the first blocks. For example, the number of the one or multiple first chunks is greater than a first preset threshold, which means that blocks that have more similar chunks with the incoming blocks are qualified to be the first chunks.
  • the method may include the following steps.
  • Blocks corresponding to groups that have more records or more chunks include more features (hash values) with the incoming block. Therefore, after the second sorting, N groups out of M groups could be selected as first blocks for the incoming block, where N ⁇ M, M is the number of blocks that includes at least one record, and N is the number of blocks determined from the M blocks based on certain conditions (e.g., first blocks include at least 2 records).
  • the first blocks are selected as the dictionary blocks for the incoming block, then the incoming block will be compressed based on the first blocks.
  • the incoming block is compressed independently, and stored or transferred by a network.
  • the following procedure for this technical solution is: updating the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
  • the calculated hash values and corresponding records of (block ID, chunk ID) are inserted into the key-value store for further search of the first blocks for the next incoming blocks.
  • blocks with one of the at least two hash values, rather than all of the at least two hash values, are qualified to be the candidate dictionary blocks, which can reduce the overhead computational costs of a determination process of the first blocks.
  • after the compressing the incoming block based on the one or multiple first blocks includes: determining one or multiple second blocks from the one or multiple first blocks according to a length of matched strings between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
  • the first blocks are blocks that have the potential to be the dictionary blocks and not necessarily the useful dictionary blocks to compress the incoming block. So we evaluate the efficiency of the first blocks for compressing the incoming block and discard the unused first blocks to decrease the storage costs.
  • the number of first blocks is variable. For example, there may be one or more first blocks after hash matching. Then the one or more first blocks are copied into the buffer, which is passed as input for a delta-compressor, followed by the incoming block.
  • the output of the delta-compressor consists of a delta-compressed incoming block and statistics for how many sub-strings (“words”) from the incoming block were found as duplicates in each dictionary block.
  • the input_word_in_dict is a pseudo-code function that searches a word (or substring) from the incoming block in a dictionary block (dictZ, dictY, dictA), with two parameters, w (word or substring) and dictionary block (dictZ, dictY, dictA).
  • w word or substring
  • dictionary block dictZ, dictY, dictA
  • Statistics for how many sub-strings (“words”) from the incoming block were found as duplicates in each dictionary block are stats[dictZ], stats[dictY] and stats [dictA] according to the pseudo-code.
  • the second blocks selected from the first blocks based on specified conditions are blocks effectively used for performing delta compression on the incoming block.
  • the first dictionary block qualified for the second block may be determined by certain rules. For example, the total length of matching strings in a first block and the incoming block is greater than a threshold.
  • the number of the second blocks for the incoming block is 0, and the incoming block compressed independently.
  • the number of the first blocks is one or more, yet no first block meets the demand to be the second block, that is, the length of matching strings for all the first blocks and the incoming block is zero or less than a threshold. As a result, all the first blocks are discarded.
  • one or multiple second blocks are effectively used for compressing the incoming block.
  • the one or multiple second blocks are selected from the first blocks, which have a length of matching strings with the incoming block greater than a threshold.
  • the incoming block can be compressed using the second blocks, with compressing results prefixed with the IDs of the second blocks (second block IDs).
  • the second blocks are the blocks that are useful in compressing the incoming block.
  • the incoming block has been compressed in previous process.
  • the determination of the second blocks are to evaluate the efficiency of the first blocks, thus discard the unused first blocks and save storage costs.
  • the result of the compressed incoming block is prefixed with the block identifier of the one or multiple second blocks.
  • the ID of the incoming block is block 10, and only one hash value is calculated for one chunk, and the following case is demonstrated to elaborate a determination process of the first blocks and the second blocks.
  • the incoming block is divided into four incoming chunks with chunk IDs being chunk 0, chunk 1, chunk 2 and chunk 3, respectively.
  • the key-value store keeps the association of hash values and corresponding (block ID, chunk ID) records of chunks divided from block 0 to block 9.
  • the present key-value store includes several buckets corresponding to records of (block ID, chunk ID) with unique hash values, where the block ID is between 0 and 9 and the chunk ID is between 0 and 3.
  • the bucket 401 keeps records of (4, 2), (1, 3), (5, 0), and the corresponding chunks share a common hash value X.
  • the hash values of the bucket 402, the bucket 403, the bucket 404, the bucket 405 and the bucket 406 are U, W, Z, V, and M, respectively.
  • the first blocks can then be selected from the processed blocks according to the hash value calculated for the incoming block. Specifically, for chunk 0, the hash value X is matched with the bucket 401, then all the records from bucket 401 are taken from the bucket 401 including (4,2), (1,3), (5,0). For chunk 1, no bucket is matched with the hash value Y (assuming that the hash values of buckets not shown in FIG. 4 are not Y). For chunk 2 and chunk 3, hash values are Z and W, and records (3, 3), (3, 1), (5, 3) from the bucket 404 and records (3, 2), (1, 1), (3, 0), (5, 2), (6, 3) from the bucket 403 are taken. For these records, block 6, block 4, block 5, block 3, and block 1 can be used as dictionary blocks to compress the incoming block.
  • Groupl, Group 2, Group3, Group 4, and Group5 represent block 3, block 5, block 1, block 4, and block 6, respectively, where Group 4 and Group 5 both have one records.
  • the block 3 has the potential to be the dictionary block based on the hash value calculated for chunk 2 and chunk 3 of the incoming block,.
  • Block 5 is possible to help compress chunk 0, chunk 2, chunk 3.
  • Block 1 is hopeful to serve as the dictionary block to compress chunk 0 and chunk 3.
  • Block 4 and block 6 are useful for compressing chunk 0 and chunk 3, respectively.
  • the five blocks can be used as dictionary blocks to compress the incoming block. It is more desirable that we select several blocks that are qualified for being the dictionary blocks or first blocks according to certain rules. For example, corresponding groups of first blocks need to have at least 2 records, and the first blocks will be block 3, block 5, and block 1. Or the first blocks are determined as the three largest groups, and then block 3, block5, and block 1 are determined.
  • block 3 (the largest group) matches two chunks in total (twice to chunk 2 and twice to chunk 3), in some embodiments, it may happen that these matched chunks are removed in other groups, the record (5, 2) corresponding to chunk 3 and the record (5, 3) corresponding to chunk 2 from Group 2 are deleted to avoid duplication.
  • the matched groups sorted by the number of matched chunks are given a corresponding priority, where the groups with more matched chunks are given a higher priority.
  • the chunks with the same hash value are purged, and then the groups with number of matched chunks left are resorted.
  • Group 4 Group 5 new groups are as follows:
  • Group 3, Group 4, Group 5 are empty after this step, and the first blocks are block 3 and block 5. Also, as block 5 only matches a single chunk in the incoming block, it may happen that only block 3 is deemed as the first block to compress the incoming block.
  • several second blocks are further selected from the first blocks when evaluating the efficiency of the first blocks as dictionary blocks used.
  • string matching is performed between the first blocks and the incoming block while compressing the incoming block, where blocks that have longer matched strings are determined as the second blocks.
  • block 1, block 3, and block 5 all meet the demand to be the second blocks. Then the three blocks are useful dictionary blocks to compress the incoming block, but it may happen that only block 3 and block 5 are effectively used to compress the incoming block, where block 3 and block 5 include all strings needed to compress the incoming block.
  • the unused first blocks will be discarded and the compression result of the incoming block can be prefixed with block identifiers of block 3 and block 5.
  • FIG. 5 is a schematic diagram depicting a delta compression process provided in the application.
  • An incoming block 504 is divided into one incoming chunk or it is not divided and two incoming hash values are calculated using the byte as shown in FIG. 5.
  • three dictionary blocks (501, 502, 503) are determined. Then the string matching result is that block 501 and block 502 have 1 word matched to the incoming block 504, and block 503 has 3 words matched to the incoming block 504. Then the incoming block 504 can be compressed using the three dictionary blocks.
  • block 501 and block 502 can be discarded as they only have 1 word matched to the incoming block 504.
  • the incoming block is possibly compressed according to several dictionary blocks (the first blocks or the second blocks), which improves the overall compression ratio of the incoming block.
  • the incoming block can be compressed according to the dictionary blocks (the first blocks or the second blocks). If the number of the dictionary blocks is zero, the incoming block is compressed independently. If at least one dictionary block is effectively used, the resulting compressed data that is prefixed by block IDs of the effectively used dictionary blocks is stored or transferred by a network.
  • the matched part of the incoming block and the dictionary block can be denoted as (a, b), where a is an offset of the first byte for the matched strings and b is a total length of the matched strings.
  • the coding result is different, such as Huffman or arithmetic coding.
  • FIG. 6 is a schematic diagram depicting a delta compression process.
  • block 601, block 602 and block 603 are candidate dictionary blocks (may be the first blocks), and block 603 is the dictionary block (may be the second block).
  • the incoming block prefixed with the block ID of block 603 is then transferred to the compression area for compression.
  • block 601 and block 603 are selected as dictionary blocks to compress the incoming block 604, for example. Yet while doing the string matching, block 601 as the dictionary block may be discarded as the matched strings “ZETTA” already exists in block 603.
  • FIG. 7 shows a key-value store updating result.
  • Bucket 407 is a new bucket and keeps an association of a hash value Y and records (10, 1).
  • Bucket 401, bucket 403 and bucket 404 are updated with records (10, 0), (10, 3), (10, 2), respectively.
  • FIG. 8 is a flowchart of an embodiment of a method 700 for data compressing. The method includes the following steps.
  • S701 calculate hash values for one or multiple sets of bytes taken from chunks from an incoming block according to preset rules.
  • S702 find one or multiple candidate dictionaries for the incoming block according to the calculated hash values.
  • S703 prioritize candidate dictionaries by the number of hash values matching the incoming block.
  • S704 select one or multiple best candidate dictionaries for delta-compression followed by delta-compressing the incoming block.
  • S705 determine one or multiple effective dictionaries from the candidate dictionaries by the length of matched strings.
  • the length of matched strings is greater than a threshold.
  • S706 store or transmit the delta-compressed incoming block prefixed by IDs of effective dictionaries.
  • FIG. 9 is a flowchart of an embodiment of a method 800 for data compressing.
  • the method 800 includes the following steps.
  • S801 divide an incoming block into fixed-sized chunks.
  • the incoming block can be divided into several fixed-sized chunks.S802, select sets of bytes for hash calculation.
  • one or multiple sets of bytes may be selected for hash calculation. Selecting a small number of bytes instead of using the whole chunk for hash calculation reduces the computational costs involved in the compression process.
  • S803 calculate hash value(s) for each set of bytes.
  • S804 use hash values as keys to look up a key-value store, and collect (block ID, chunk ID) records with the same keys.
  • hash values are used as keys to look up the key-value store for the first blocks, which has been mentioned before.
  • S805 sort (block ID, chunk ID) records to find the max number of records with the same block ID1, and a smaller number of records with other block ID2, etc.
  • step S804 the blocks found in step S804 could be ranked with the number of chunks.
  • blocks with the number of chunks less than a threshold could be discarded to improve the dictionary block quality.
  • the block with a higher number of chunks could be given a higher priority.
  • the similar chunks (chunks with the same hash values) could be purged in blocks with a lower priority, which will improve the candidate dictionary block quality.
  • S806 retrieve blocks to be used as dictionary blocks to compress the incoming block.
  • the dictionary blocks and the incoming block are put into the buffer for compression.
  • the incoming block is non-compressible.
  • it may include image data that has already been compressed, so the incoming block is not compressed successfully.
  • it can be a candidate dictionary block for other blocks, so the key-value store needs to be updated with (block ID, chunk ID) records according to hash values of the incoming block.
  • the incoming block is compressed successfully, yet it is possible that the dictionary block is useful or not used for the compression of the incoming block.
  • S8102 store the compressed block as non-delta.
  • step S8101 and step S8102 the incoming block is compressed successfully yet no dictionary block is used for the compression of the incoming block.
  • the dictionary blocks may not have a matched length longer than a threshold, so the incoming block is compressed independently.
  • the key-value store needs to be updated with the incoming block.
  • S812 store the compressed block as delta, prefix delta-compressed incoming block with used block IDs.
  • the incoming block is compressed successfully using one or multiple candidate dictionary blocks retrieved in step S806, which have matched strings longer than the threshold. Then the blocks not used can be discarded to improve the dictionary block quality.
  • the compressed block is prefixed with used dictionary block IDs.
  • FIG. 10 is a schematic block diagram of an electronic device 900 according to an embodiment of this application.
  • the electronic device 900 includes: a dividing module 901, configured to divide an incoming block into multiple incoming chunks, wherein each of the multiple incoming chunks comprises X bytes; a selecting module 902, configured to select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, wherein the one or multiple sets of bytes comprise non-consecutive Y bytes in original order selected from the X bytes; a calculating module 903, configured to calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and a compressing module 904, configured to compress the incoming block according to the incoming hash values calculated by the calculating module for the multiple incoming chunks.
  • the diving module 901 is configured to divide the incoming block into one or multiple chunks.
  • the selecting module 902 is configured to select one or multiple sets of bytes for hash calculation by the calculating module 903.
  • the hash calculation result will be used by the compressing module 904 to compress the incoming block. A small number of bytes rather than the chunk as a whole are selected for hash calculation, which reduces the computational costs involved in the compression process and improves the efficiency of compression.
  • a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, wherein the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
  • the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
  • the compressing module 904 is specifically configured to: determine one or multiple first blocks according to the incoming hash values, wherein the one or multiple first blocks are processed blocks and each of the one or multiple first blocks comprises one or multiple first chunks with a hash value the same as one of the incoming hash values; and compress the incoming block based on the one or multiple first blocks.
  • the compressing module 904 is configured to: determine the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has the same hash value with one of the incoming hash values; and determine the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided.
  • the one or multiple first chunks are determined based on a key -value store
  • the key -value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
  • the number of the one or multiple first chunks is greater than a first preset threshold.
  • the electronic device further comprises a determining module, configured to determine one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
  • the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
  • the electronic device further comprises an updating module, configured to update the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
  • an electronic device 1000 may include a transceiver 1001, a processor 1002, and a memory 1003.
  • the memory 1003 may be configured to store code, instructions, or the like executed by the processor 1002.
  • the processor 1002 may be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the foregoing method embodiments may be completed by using a hardware integrated logic circuit in the processor, or by using instructions in a form of software.
  • the processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the processor may implement or perform the methods, the steps, and the logical block diagrams that are disclosed in the embodiments of the present application.
  • the general purpose processor may be a microprocessor, or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed with reference to the embodiments of the present application may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware in the decoding processor and a software module.
  • the software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register.
  • the storage medium is located in the memory, and the processor reads information in the memory and completes the steps of the foregoing methods in combination with hardware in the processor.
  • the memory 1003 in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory.
  • the nonvolatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically erasable programmable read-only memory (Electrically EPROM, EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (Random Access Memory, RAM) and is used as an external cache.
  • RAMs may be used, and are, for example, a static random access memory (Static RAM, SRAM), a dynamic random access memory (Dynamic RAM, DRAM), a synchronous dynamic random access memory (Synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), a synchronous link dynamic random access memory (Synchronous link DRAM, SLDRAM), and a direct ram bus random access memory (Direct Rambus RAM, DR RAM).
  • Static RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • An embodiment of this application further provides a chip system, where the chip includes an input/output interface, at least one processor, and at least one memory.
  • the at least one memory is configured to store instructions
  • the at least one processor is configured to invoke the instructions of the at least one memory to perform operations performed by the electronic device in the methods in the foregoing embodiments.
  • An embodiment of this application further provides a computer storage medium, where the computer storage medium may store a program instruction for performing the steps in the foregoing methods.
  • the storage medium may be specifically the memory 1003.
  • An embodiment of this application further provides a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the steps in the foregoing methods.
  • a person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
  • “at least one” means one or more, and “a plurality of’ means two or more.
  • the term “and/or” describes an association relationship between associated objects and represents that three relationships may exist.
  • a and/or B may represent the following three cases: Only A exists, both A and B exist, and only B exists, where A and B may be singular or plural.
  • the character “I” generally indicates an “or” relationship between the associated objects. “At least one of the following” and a similar expression thereof refers to any combination of these items, including any combination of one item or a plurality of items.
  • At least one of a, b, and c may indicate: a, b, c, a and b, a and c, b and c, or a, b, and c, where a, b, and c may be singular or plural.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely an example.
  • the unit division is merely logical function division and may be other division in actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may be or may not be physically separate, and parts displayed as units may be or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.
  • the functions When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer readable storage medium. Based on such an understanding, the technical solutions in this application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product.
  • the computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of this application.
  • the foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disc.
  • program code such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Embodiments of this application provide a method for similarity detection and delta-compression and an electronic device. The method includes: calculating hash values for one or multiple sets of bytes taken from chunks of an incoming block according to a preset rule, finding candidate dictionary for delta-compression of the incoming block among already processed blocks according to the calculated hash values, prioritizing candidate dictionaries by number of hash values matching the incoming block, selecting fixed number of best candidate dictionaries (first blocks) for delta-compression followed by delta-compression; determining in the course of the delta-compression effective dictionaries (second blocks) for delta-compression by number of matched strings between each candidate dictionary and delta-compressed incoming block, said number is bigger than threshold; delta-compressed incoming block is prefixed by IDs of effective dictionaries and stored or transmitted further. According to the technical solution, the method can reduce computational costs and improve compression efficiency.

Description

METHOD FOR DATA COMPRESSION AND ELECTRONIC DEVICE
TECHNICAL FIELD
[0001] Embodiments of the present application relate to the field of data similarity in data storage and transmission, and more specifically, to a method a method for data compression and an electronic device, which improves delta-compression in data storage and data transmission of an electronic device.
BACKGROUND
[0002] With the development of computer technologies, amount of data to be stored or transmitted has been increasing rapidly. Lossless compression, a process of reducing a size of data without loss of information, is widely used to minimize amounts of stored and transmitted data.
[0003] Similarity detection makes it possible to apply deduplication or delta-compression for similar data, and further improves efficiency of data storage and transmission. Known methods of similarity detection are quite computationally expensive and time-consuming.
SUMMARY
[0004] Embodiments of this application provide a method of data similarity detection, which improves delta-compression in data storage and data transmission of an electronic device.
[0005] According to a first aspect, an embodiment of this application provides a method for data similarity detection, including: dividing an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; selecting one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to i a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; calculating one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and matching the incoming block according to the calculated hash values for the one or multiple chunks to other blocks to determine their similarity. Then, one or more the best similarity candidate blocks, found in this way, are used as a dictionary to delta-compress the incoming block.
[0006] According to the above-mentioned technical solution, the incoming block is divided into multiple incoming chunks and a small number of bytes (sets of bytes) are selected from each incoming chunk for hash calculation. Selecting sets of bytes reduces computational costs involved in the hash calculation as compared to using all bytes in chunk for hash calculation. The calculated hash values can be employed to match the incoming block to other blocks to determine their similarity. Then, one or more the best similarity candidate blocks, found in this way, are used as dictionary to delta-compress the incoming block.
[0007] In one optional implementation, when a quantity of the sets of bytes is multiple for each incoming chunk, several sets of bytes are selected from each chunk and respective number of hash values is calculated for each chunk. First byte of each selected byte set has offset from the start of the chunk, equal to 0, 1, 2, 4 ... 2m, where the m is a natural number unique for different sets of bytes.
[0008] According to the above-mentioned technical solution, if multiple sets of bytes are selected in one incoming chunk or multiple incoming hash values are to be calculated for one incoming chunk, the selected byte sets may shift by 0, 1, 2, 4 ... 2m bytes, and this may help to detect similar features that are shifted due to insertions and deletions of 2m bytes, which is quite frequent due to the ubiquitous 8-bit, 16-bit, 32-bit and 64-bit data types in modern SW, thus improving the similarity detection among blocks and as a result, improving delta-compression ratio.
[0009] In one optional implementation, the preset rule includes: intervals between the non-consecutive Y bytes are determined by a modified Fibonacci sequence, a logarithmic function, or an exponential function. The modified Fibonacci sequence is Fibonacci sequence without the first 1, e.g. 1, 2, 3, 5 ...
[0010] In one optional implementation, the preset rule includes: for any two bytes in the each of the set of bytes, the two bytes are selected at nonadjacent positions from the incoming chunk.
[0011] According to the above-mentioned technical solution, the bytes are selected at intervals, which can reduce the number of bytes selected and computational costs involved for hash calculation and similarity detection.
[0012] In one optional implementation, the similarity detection of the incoming block to other blocks according to the incoming hash values for the multiple incoming chunks, includes: determining one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first (processed) blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values. Then, one or more the best similarity candidate blocks, found in this way, are used as dictionary to delta-compress the incoming block.
[0013] According to the above-mentioned technical solution, incoming hash values calculated for the incoming chunks of the incoming block are employed to look up a key-value store for information about processed blocks that include chunks with the same hash values.
[0014] Similarity detection of incoming block to one or more processed data blocks is done by comparing incoming hash values to hash values, calculated for chunks of already processed data blocks, further called respectively processed hash values and processed chunks, to find one or more pairs, each pair comprising incoming chunk and processed chunk, their incoming and processed hash values being equal, and each chunk is associated with respective data block identifier, further referred as incoming blockID and processed blockID.
[0015] Found processed chunks, with hash values equal to incoming hash values, are sorted by processed blockID to have respective processed blockIDs ordered by number of chunks, having hash values equal to hash values of incoming block.
[0016] One or more processed blocks, matching sorted blockIDs with biggest number of chunks, having hash values equal to hash values of incoming block, are selected as candidate dictionaries for delta-compression of incoming block. [0017] Delta-compressing the incoming block using selected processed blocks as candidate dictionaries is done.
[0018] During delta-compression it is determined whether each candidate dictionary contains byte strings, equal to byte strings in incoming block to make delta-compression effective - that is to encode a byte string in incoming block as reference to equal byte string in dictionary block by less number of bits than number of bits in the respective byte string.
[0019] Dictionary candidates which do not contain any byte strings for effective delta-compression of incoming block are discarded.
[0020] Dictionary candidates, further referred effective dictionaries, which contain byte strings for effective delta-compression of incoming block are retained.
[0021] Result of delta-compression of incoming block together with processed blockIDs of effective dictionaries is stored.
[0022] A subset of incoming hash values to be stored as processed hash values, associated with respective blockID, for future similarity detection among processed blocks and future incoming blocks is selected so that the subset includes only different hash values; further the subset includes one or more minimal hash values, one or more maximal hash values.
[0023] The selected incoming hash values, associated with incoming blockID, are stored to be used later for similarity detection among processed blocks and future incoming block in appropriate data structure for fast lookup by hash as a key to get processed blockID and to retrieve respective processed block as candidate dictionary for delta-compression of incoming block.
[0024] According to a second aspect, an electronic device is provided. The electronic device including: a dividing module, configured to divide an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; a selecting module, configured to select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; a calculating module, configured to calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and a compressing module, configured to compress the incoming block according to the incoming hash values calculated by the calculating module for the multiple incoming chunks.
[0025] According to the above-mentioned technical solution, the dividing module is configured to divide the incoming block into one or multiple chunks. The selecting module is configured to select one or multiple sets of bytes for hash calculation by the calculating module. The hash calculation result will be used by the compressing module to compress the incoming block. A small number of bytes rather than the chunk as a whole are selected for hash calculation, which reduces the computational costs involved in the compression process and improves the efficiency of compression.
[0026] In one optional implementation, where when a quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
[0027] In one optional implementation, the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
[0028] In one optional implementation, the compressing module is specifically configured to: determine one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values; and compress the incoming block based on the one or multiple first blocks.
[0029] In one optional implementation, where the compressing module is configured to: determine the one or multiple first chunks based on the incoming hash values, where each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; determine the one or multiple first blocks based on the one or multiple first chunks, where the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided.
[0030] In one optional implementation, where the one or multiple first chunks are determined based on a key-value store, where the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
[0031] In one optional implementation, a quantity of the one or multiple first chunks is greater than a first preset threshold.
[0032] In one optional implementation, the electronic device further includes a determining module, configured to determine one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where a length of matched strings is greater than a second preset threshold.
[0033] In one optional implementation, the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
[0034] In one optional implementation, the electronic device further includes an updating module, configured to update the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
[0035] According to a third aspect, an embodiment of this application provides a method for data compression, including: dividing an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes; selecting one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes; calculating one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and compressing the incoming block according to the incoming hash values for the multiple incoming chunks.
[0036] According to the above-mentioned technical solution, the incoming block is divided into multiple incoming chunks and a small number of bytes (sets of bytes) are selected from the divided incoming chunks for hash calculation. Instead of using chunks as a whole to obtain features by hash calculation, selecting sets of bytes reduces computational costs involved in the hash calculation. The calculated incoming hash values can be employed to compress the incoming block.
[0037] Optionally, the incoming block can be used as a whole to select one or multiple sets of bytes for hash calculation, then the incoming block can be compressed with hash values of processed blocks. [0038] In one optional implementation, when a quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
[0039] According to the above-mentioned technical solution, if multiple sets of bytes are selected in one incoming chunk or multiple incoming hash values are to be calculated for one incoming chunk, the difference of the offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, in other words, for different sets of bytes, the selected bytes may shift 2m bytes, and this may help to detect similar features that are shifted due to insertions and deletions of 2m bytes, which is quite frequent due to the ubiquitous 8-bit, 16-bit, 32-bit and 64-bit data types in modem SW, thus improving the compression ratio. Optionally, the offsetO position of the first set of bytes is 0, and the offsetO position of the second set of bytes is 1 , and the offsetO position of the third sets of bytes is 2, etc. In other words, the difference of the offsetO position between the first set of bytes and the following sets of bytes is 1, 2, 4, 8, 16...
[0040] In one optional implementation, the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
[0041] According to the above-mentioned technical solution, the difference of the byte position between the set of bytes satisfies the modified Fibonacci sequence or logarithmic function sequence or exponential function sequence. The bytes are selected at intervals so that only a small number of bytes are selected for hash calculation, thereby reducing the computational costs and improving the compression speed.
[0042] In one optional implementation, the preset rule includes: for any two bytes in the each of the set of bytes, the two bytes are selected at nonadjacent positions from the incoming chunk.
[0043] According to the above-mentioned technical solution, the bytes are selected at intervals, which can reduce the number of bytes selected and computational costs involved.
[0044] In one optional implementation, the compressing the incoming block according to the incoming hash values for the multiple incoming chunks, includes: determining one or multiple first blocks according to the incoming hash values, where the one or multiple first blocks are processed blocks and each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the incoming hash values; and compressing the incoming block based on the one or multiple first blocks.
[0045] According to the above-mentioned technical solution, incoming hash values calculated for the incoming chunks of the incoming block are employed to look up a key-value store for blocks that include chunks with the same hash values.
[0046] For one example, first blocks include all the blocks corresponding to the matched records in the key- value store.
[0047] For another example, first blocks can be selected from the blocks corresponding to the records taken from the key-value store.
[0048] For another example, blocks corresponding to the records taken from the key-value store can be given a priority level according to the number of chunks each block has. Then if two chunks in different blocks are taken from the key-value store by the same hash value, the chunk in the block with a lower priority will be purged since the chunk in the block with a higher priority is possible to compress the particular incoming chunk in the incoming block with the same hash value. Then the blocks are resorted according to the number of chunks left. The resorted blocks can be deemed as the first blocks that have the potential to perform delta compression on the incoming block. The operation will improve the quality of the first blocks as candidate dictionary blocks.
[0049] In one optional implementation, where the determining one or multiple first blocks according to the incoming hash values includes: determining the one or multiple first chunks based on the incoming hash values, where each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; determining the one or multiple first blocks based on the one or multiple first chunks, where the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided from.
[0050] In one optional implementation, where the one or multiple first chunks are determined based on a key-value store, where the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
[0051] In one optional implementation, a quantity of the one or multiple first chunks is greater than a first preset threshold.
[0052] According to the above-mentioned technical solution, the number of chunks included in the first blocks is greater than a first preset threshold, which improves the possibility of the first blocks being useful to compress the incoming block and improves the compression ratio.
[0053] In one optional implementation, where after the compressing the incoming block based on the one or multiple first blocks, the method includes: determining one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
[0054] According to the above-mentioned technical solution, the second blocks are selected from the first blocks according to the string matching result, and only first blocks with matched strings longer than a second preset threshold will be considered as the second blocks. The second blocks are blocks that are useful for compressing the incoming block.
[0055] In one optional implementation, the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
[0056] According to the above-mentioned technical solution, the compressed incoming block includes the block identifiers of the second blocks, for example, the compressed incoming block is prefixed with the block identifiers of the second blocks, which are useful in the compression of the incoming block.
[0057] In one optional implementation, where the method further includes: updating the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
[0058] According to the above-mentioned technical solution, after the compression of the incoming block, the key-value store is updated with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks, so that when compressing the following block, the chunk identifiers and the block identifiers of the incoming chunk could be employed for similarity detecting process. [0059] According to a fourth aspect, an embodiment of this application provides a computer-readable storage medium, including instructions. When the instructions runs on a computer, the computer is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
[0060] According to a fifth aspect, an electronic device is provided, including a processor and a memory. The processor is connected to the memory. The memory is configured to store instructions, the processor is configured to execute the instructions. When the processor executes the instructions stored in the memory, the processor is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
[0061] According to a sixth aspect, a chip system is provided, where the chip system includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that an electronic device on which the chip system is disposed performs the method in the first aspect or any optional implementation of the first aspect.
[0062] According to a seventh aspect, a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
DESCRIPTION OF DRAWINGS
[0063] FIG.l shows a flowchart of an embodiment of a method 100 for data compression.
[0064] FIG. 2 shows an embodiment for selecting one set of bytes for each chunk in an incoming block.
[0065] FIG. 3 shows an embodiment for selecting more than one sets of bytes for chunk 0 in FIG. 1.
[0066] FIG. 4 is a schematic diagram of a key-value store.
[0067] FIG. 5 is a schematic diagram depicting a delta compression process provided in this application.
[0068] FIG. 6 is a schematic diagram depicting a delta compression process.
[0069] FIG. 7 shows a key-value store updating result based on an incoming block.
[0070] FIG. 8 is a flowchart of an embodiment of a method 700 for data compressing.
[0071] FIG. 9 is a flowchart of an embodiment of a method 800 for data compressing.
[0072] FIG. 10 is a schematic block diagram of an electronic device 900.
[0073] FIG. 11 is a schematic block diagram of an electronic device 1000.
DESCRIPTION OF EMBODIMENTS
[0074] The following describes the technical solutions in this application with reference to the accompanying drawings.
[0075] Definitions:
[0076] Unless otherwise stated or implicit from context the following terms and phrases have the meanings provided below.
[0077] The term “block” refers to a sequence of bytes having a predetermined length.
[0078] When a chunk is divided from a block, the term “chunk” refers to a sequence of bytes with a predetermined length smaller than that of the block. In some cases, when the whole block is deemed as a chunk, the chunk is a sequence of bytes with the same length as that of the block.
[0079] The term “byte” refers to a sequence of eight bits, where a bit is the smallest unit in the computer science field.
[0080] The phrase “hash value” refers to the output that is returned after the application of a hash function algorithm. The phrases “hash function algorithm” and “hash function” refer to an algorithm or subroutine that maps large data sets (optionally of variable length) to smaller data sets that have a fixed size for a particular hash function. The values returned by the algorithm may also be called hash function values, hash codes, hash sums, checksums, or hash values.
[0081] Data compression, source coding, or bit-rate reduction is the process of encoding information by using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) is referred to as a decoder.
[0082] Compression is useful because it reduces resources required to store and transmit data. Computational resources are consumed in the compression and decompression processes. Data compression is subject to a space-time complexity trade-off. The design of data compression schemes involves trade-offs among various factors, including a degree of compression, an amount of distortion introduced (when using lossy data compression), and computational resources required to compress and decompress data.
[0083] Delta encoding is one kind of lossless compression method, which is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; and more generally, this is known as data differencing. Delta encoding is sometimes called delta compression, particularly in a case where archival histories of changes are required (e.g., in revision control software).
[0084] Data compression ratio or compression ratio (CR), also known as compression power, is a measurement of the relative reduction in size of data representation produced by a data compression algorithm. It is typically expressed as the division of uncompressed size by compressed size, and is given by:
[0085] CR= Undressed Size
Compressed Size
[0086] When delta compression is applied, the incoming data is divided into smaller units and hash values are calculated for the units. Hash values are considered as fingerprints of these units. Therefore, for units with identical hash values, an original unit and a difference can be stored rather than two entire units.
[0087] However, there is a tradeoff between the number of hash values needed to be calculated and the computational costs for each calculated hash value. That is, when the divided units have fewer bytes/bits, the incoming data is divided into more units, thus improving overall computational costs for hash calculation, and when the divided units have more bytes/bits, each unit is longer and thus computational costs for each hash value is higher. It is important to reduce the computational costs for compression.
[0088] The embodiment of the present application, address the challenges of decreasing the computational costs of compression algorithms and speeding up the compression algorithm.
[0089] FIG. 1 is a flowchart of a method 100 for data compression according to an embodiment of the present application. The method 100 includes the following steps.
[0090] SI 02, divide an incoming block into multiple incoming chunks, where each of the multiple incoming chunks includes X bytes.
[0091] The incoming block can be received over a wired or wireless network from a host. The data to be compressed can be transferred via multiple blocks or the blocks can be defined by the device performing the data compression method. A block refers to a sequence of bytes with predetermined length of bytes. The incoming block is the block that is being processed. And before the incoming block, there may be several blocks that have been processed. After the incoming block, there may be several blocks to be processed later. The block size is not fixed. For example, the block can be 4KB (kilobyte) in size. The block can include a header to identify the block. The header may include information such as, but not limited to, a block identifier or block ID associated with the block. The block with the block ID 2 can be denoted as block2.
[0092] The incoming block can be divided into multiple incoming chunks. The chunk refers to a sequence of bytes with a predetermined length smaller than that of the block. For example, the incoming block of 4KB can be divided into 4* 1024B incoming chunks. Also, there is a chunk ID for each chunk to identify the unique chunk. Therefore, four incoming chunks of the incoming block can be denoted as chunk 0, chunk 1, chunk 2 and chunk 3. To distinguish the chunks in different blocks, (block ID, chunk ID) used as the chunk unique ID (chunk UID) is introduced to help illustrate the present embodiment.
[0093] SI 04, select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, where the one or multiple sets of bytes include non-consecutive Y bytes in original order selected from the X bytes.
[0094] In some embodiments, one set of bytes is selected from each of the chunks, and for all the chunks in the data stream, the bytes are selected according to a preset rule. The set of bytes includes non-consecutive Y bytes from the X bytes. In other words, the selected bytes are not continuous. The set of bytes is selected at incremental steps from an offsetO position, where the selected bytes are spaced by a predefined pattern. For example, the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies the modified Fibonacci sequence or logarithmic sequence. A logarithmic sequence is a set of real numbers a[i], i = 1, 2, ... , such that
(1) 0 <= a[i] <= I,
(2) a[k] + a[i] < a[r] + a[s] if k*l < r*s,
(3) a[k] + a[i] = a[r] + a[s] if k*l = r*s,
[0095] In other words, the steps may increase logarithmically, exponentially, linearly, by modified Fibonacci or even by self-designed sequence.
[0096] The difference of the byte position between the non- consecutive Y bytes and the offsetO position is a result of a byte position or byte ID of one of the Y bytes minus offfsetO position. A case where the difference satisfies the modified Fibonacci sequence is taken as an example. The incoming block is a string of 64 bytes and is divided into four incoming chunks, denoted as chunk 0, chunk 1, chunk 2, and chunk3, respectively. As shown in FIG. 2, each incoming chunk is a 16-byte-string. Each number in a box in FIG. 2 represents a byte ID or byte position. In this application, the “offsetO position”, “byte number”, “byte ID”, “byte position” all refers to the position of a byte in the original incoming chunk from which it is selected from. Chunk 0 includes byte 0 to byte 15, chunk 1 includes byte 16 to byte 31, chunk 2 includes byte 32 to byte 47, and chunk 3 includes byte 48 to byte 63.
[0097] The Fibonacci sequence is [1, 1, 2, 3, 5, 8, 13, 21...]. We will replace the leading 1 by 0 to avoid repetition. Then the modified Fibonacci sequence will be [0, 1, 2, 3, 5, 8, 13, 21...].
[0098] If the offsetO position, where the first byte of the Y bytes are selected, is the start position of the incoming chunk or the byte ID of the position of offsetO is 0, as the position differences satisfy the modified Fibonacci sequence, for chunkO, the byte IDs of selected bytes are [0, 1, 2, 3, 5, 8, 13], which are emphasized in FIG. 2. Following the modified Fibonacci sequence, the byte IDs of the selected bytes for chunk 1, chunk 2, chunk 3 are [16, 17, 18, 19, 21, 24, 29], [32, 33, 34, 35, 37, 40, 45] and [48, 49, 50, 51, 53, 56, 61], respectively.
[0099] If the offsetO position changes to m, as the position differences satisfy the modified Fibonacci sequence, for chunkO, the byte IDs of selected bytes are [m, 1+m, 2+m, 3+m, 5+m, 8+m, 13+m], the byte IDs exceed the biggest byte ID in chunk 0 is not considered here.
[0100] It should be noted that in the example above, a length of a set of bytes is 7 bytes for each chunk. Optionally, the selected bytes can follow any part of the modified Fibonacci sequence. The length can be limited by a small constant. For example, if the length of the selected bytes is 5 bytes, the byte IDs of the selected bytes for chunkO can be [0, 1, 2, 3, 5], [1, 2, 3, 5, 8] or [2, 3, 5, 8, 13]. Also, we can delete the leading 1 to get the modified Fibonacci, then the modified Fibonacci sequence will be [1, 2, 3, 5, 8, 13, 21...].
[0101] It also should be noted that the preset rule could be: for any two bytes in the each of the sets of bytes, the two bytes are selected at nonadj acent positions from a chunk or the number of consecutive bytes selected from a chunk should be no more than n bytes, where n is a preset value. These preset rules will reduce the number of bytes selected and extend the selection range in a chunk, detecting more features and reducing the costs involved.
[0102] According to a third aspect, an embodiment of this application provides a computer-readable storage medium, including instructions. When the instructions runs on a computer, the computer is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
[0103] According to a fourth aspect, an electronic device is provided, including a processor and a memory. The processor is connected to the memory. The memory is configured to store instructions, the processor is configured to execute the instructions. When the processor executes the instructions stored in the memory, the processor is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
[0104] According to a fifth aspect, a chip system is provided, where the chip system includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that an electronic device on which the chip system is disposed performs the method in the first aspect or any optional implementation of the first aspect.
[0105] According to a sixth aspect, a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the method in the first aspect or any optional implementation of the first aspect.
[0106] In one optional implementation, if only one set of bytes is selected for one chunk, the offsetO position, that is, the start position of the set of bytes can be chosen randomly. In this implementation, the offsetO position doesn’t necessarily mean the first byte position of the set of bytes. For example, if the modified Fibonacci sequence is [0, 1, 2, 3, 5, 8, 13, 21...], the offsetO position is the position of the first byte in one set of bytes. However, if the modified Fibonacci sequence is [1, 2, 3, 5, 8, 13, 21 ...], the first byte of the set of bytes is the following byte of the offsetO position.
[0107] In another optional implementation, when the quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, where the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes. In other words, the difference follows a base 2 exponential function, that is, the difference of the offsetO position for different sets of bytes of one chunk can be 1, 2, 4, 8, 16, 32. bytes.
For the different sets of bytes in a chunk, the value of m can be consecutive or non-consecutive integers. The offsetO positions for different sets of bytes in a chunk are intended to detect similar content that is shifted due to insertions and deletions by 2m number of bytes, which is quite frequent due to ubiquitous 8-bit, 16-bit, 32-bit, 64-bit data types in modem SW. For example, if chunk 0 in block 3 and chunk 1 in block 5 both have 16 bytes, where chunk 1 can be achieved by inserting a byte before chunk 0 and deleting the last byte in chunk 0, then the first fifteen bytes in chunk 0 is the same as the last fifteen bytes in chunk 1. If we select only one set of bytes for each chunk in the same way, then the matched fifteen bytes will not be detected by the selected set of bytes. However, if two sets of bytes are selected, and the two sets of bytes shift by one byte, then the matched fifteen bytes will be detected by the selection method proposed by our application. [0108] If one offsetO position (denoted as a reference offsetO position) is the first byte of a chunk, and the offsetO position of the sets of bytes is spaced by 0, 1, 2, 4, 8, 16... bytes from the 1st byte of the chunk or the offestO position is the 1st, 2nd, 3rd, 5th, 9th, 17th... byte of the chunk. As shown in FIG. 3, for example, for each incoming chunk, three sets of bytes are selected. For chunk 0, the byte ID of the offsetO position is 0, 1, and 2, which are the 1st, 2nd, 3rd byte of the chunk. The offsetO position is equal to 0, 1, 2, 4, 8, 16, 32... bytes from the reference offsetO position. The byte IDs of the three sets of bytes are [0, 1, 2, 3, 5, 8, 13], [1, 2, 3, 4, 6, 9, 14] and [2, 3, 4, 5, 7, 10, 15], respectively. The multiple sets of bytes are selected for hash calculation to detect features of the chunk (discussed later). In this way, several sets of bytes are selected for the following hash calculation.
[0109] S 106, calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks.
[0110] Hash calculation can be explained as mapping of a set of bytes to a scalar value (one number). The calculated scalar value is used as an index to determine a bucket in a key-value store (discussed later). Hash calculation is based on the set of bytes selected. Therefore, if only one set of bytes is selected for one chunk, one hash value is calculated for one chunk; and if multiple sets of bytes are selected for one chunk, multiple hash values are calculated for one chunk. Yet it is notable that the multiple hash values for one chunk are not necessarily different. Also, the hash values for different chunks may also be the same if they have similar features.
[0111] The embodiment of the present application does not limit the algorithm for hash calculation. For example, hash calculation can be performed by Rabin-Karp, MD, SHA algorithm (secure hash algorithm), crc 32 etc.
[0112] SI 08, compress the incoming block according to the incoming hash values for the multiple chunks.
[0113] In some embodiments, the compressing the incoming block according to the incoming hash values for the multiple incoming chunks, includes: determining one or multiple first blocks according to the one or multiple incoming hash values, where the one or multiple first blocks are processed blocks and that each of the one or multiple first blocks includes one or multiple first chunks with a hash value the same as one of the one or multiple incoming hash values; and compressing the incoming block based on the one or multiple first blocks.
[0114] For convenience of clear description of the technical solutions in the embodiments of this application, in the embodiments of this application, words such as "first", "second", and the like are used to distinguish between same objects or similar objects whose functions and purposes are basically the same. A person skilled in the art may understand that the terms such as "first" and "second" do not constitute a limitation on a quantity or an execution sequence, and that the terms such as "first" and "second" do not indicate a definite difference. [0115] To achieve one or multiple first blocks, using incoming hash values as keys to look up a key- value store (or a hash table) is needed to get one or multiple first blocks.
[0116] In this implementation, the determining one or multiple first blocks according to the incoming hash values includes: determining the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; and determining the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided from.
[0117] Incoming hash values are used as keys and (block ID, chunk ID) records are used as values, and the key-value store keeps associations of hash values and corresponding (block ID, chunk ID) records of already processed chunks. The data stream to be compressed consists of multiple blocks, and processed chunks are from processed blocks that come before the incoming block. The processed blocks have been divided into one or multiple X bytes of chunks and sets of bytes of Y bytes have been selected from the chunks for hash calculation in the same way that we select sets of bytes for the incoming chunks. Provided that the first block of the data stream to be compressed is denoted as block 0, the incoming block is block 5, then the key-value store keeps associations of records of chunks from block 0, block 1, block 2, block 3 and block 4. It is obvious that the key-value store dynamically updates with blocks being processed.
[0118] For example, the key- value store may have several buckets keeping records with chunks of different hash values. Each bucket keeps records of chunks with the same hash values. Therefore, if one hash value is calculated for one chunk, records of four chunks from the one processed block may be stored in up to 4 buckets of the key-value store.
[0119] Using hash values of incoming chunks as keys to look up the key- value store may result in several buckets. Then all the records in the several buckets are taken for the incoming block. The chunks corresponding to the records, determined based on the key-value store are first chunks mentioned above. And the blocks mentioned above are first blocks that may be useful in the compression process of the incoming block.
[0120] In some embodiments, the records of blocks are used as first blocks to serve as dictionary blocks for compressing the incoming block. In other embodiments, the dictionary blocks are further selected from the first blocks. For example, the number of the one or multiple first chunks is greater than a first preset threshold, which means that blocks that have more similar chunks with the incoming blocks are qualified to be the first chunks.
[0121] To get the one or multiple first blocks for the incoming block, the method may include the following steps.
[0122] (a) The records are sorted by block IDs.
[0123] After the first sorting, records with the same block ID are grouped together, and any group includes records with the same block ID. Among records in each group, there may be records with the same chunk ID. The duplicate records are purged, then each group include records with the same block ID and a unique chunk ID.
[0124] (b) The groups are sorted by the number of records (the number of chunks).
[0125] Blocks corresponding to groups that have more records or more chunks include more features (hash values) with the incoming block. Therefore, after the second sorting, N groups out of M groups could be selected as first blocks for the incoming block, where N^M, M is the number of blocks that includes at least one record, and N is the number of blocks determined from the M blocks based on certain conditions (e.g., first blocks include at least 2 records).
[0126] After the two sorting, the first blocks are selected as the dictionary blocks for the incoming block, then the incoming block will be compressed based on the first blocks.
[0127] If should be noted that two sorting process that we provide here is just one implement of determining the first blocks, there may be other means to determine the first blocks, for example, we can just take down the number of records corresponding to each blocks by statistics tables. Anyhow, we will achieve the first block.
[0128] If the number of the first blocks is zero, the incoming block is compressed independently, and stored or transferred by a network. The following procedure for this technical solution is: updating the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks. In other words, The calculated hash values and corresponding records of (block ID, chunk ID) are inserted into the key-value store for further search of the first blocks for the next incoming blocks.
[0129] It should be noted that if at least two hash values are calculated for one single chunk, blocks with one of the at least two hash values, rather than all of the at least two hash values, are qualified to be the candidate dictionary blocks, which can reduce the overhead computational costs of a determination process of the first blocks.
[0130] It should also be noted that if at least two set of bytes are selected for each of the incoming chunk, then in one possible implement, we can select one set of hash values based on one set of bytes to insert into the key- value store.
[0131] In one optional implementation, after the compressing the incoming block based on the one or multiple first blocks, includes: determining one or multiple second blocks from the one or multiple first blocks according to a length of matched strings between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
[0132] In this implementation, the first blocks are blocks that have the potential to be the dictionary blocks and not necessarily the useful dictionary blocks to compress the incoming block. So we evaluate the efficiency of the first blocks for compressing the incoming block and discard the unused first blocks to decrease the storage costs. The number of first blocks is variable. For example, there may be one or more first blocks after hash matching. Then the one or more first blocks are copied into the buffer, which is passed as input for a delta-compressor, followed by the incoming block. The output of the delta-compressor consists of a delta-compressed incoming block and statistics for how many sub-strings (“words”) from the incoming block were found as duplicates in each dictionary block.
[0133] Statistics can be collected via the further pseudo-code: if (input_word_in_dict(w, dictZ)) stats[dictZ]++; else if (input_word_in_dict(w, dictY)) stats[dictY]++; else if (input_word_in_dict(w, dictA)) stats[dictA]++;
[0134] The input_word_in_dict is a pseudo-code function that searches a word (or substring) from the incoming block in a dictionary block (dictZ, dictY, dictA), with two parameters, w (word or substring) and dictionary block (dictZ, dictY, dictA). Statistics for how many sub-strings (“words”) from the incoming block were found as duplicates in each dictionary block are stats[dictZ], stats[dictY] and stats [dictA] according to the pseudo-code.
[0135] The second blocks selected from the first blocks based on specified conditions are blocks effectively used for performing delta compression on the incoming block. The first dictionary block qualified for the second block may be determined by certain rules. For example, the total length of matching strings in a first block and the incoming block is greater than a threshold. By using the pseudo-code above, we can get the string matching result (stats [dictZ], stats [dictY] and stats[dictA]) between the incoming block and dictionary block (dictZ, dictY and dictA) and determine the second blocks according to the matching result.
[0136] In some embodiments, the number of the second blocks for the incoming block is 0, and the incoming block compressed independently.
[0137] For example, the number of the first blocks is one or more, yet no first block meets the demand to be the second block, that is, the length of matching strings for all the first blocks and the incoming block is zero or less than a threshold. As a result, all the first blocks are discarded.
[0138] In other embodiments, one or multiple second blocks are effectively used for compressing the incoming block. The one or multiple second blocks are selected from the first blocks, which have a length of matching strings with the incoming block greater than a threshold. In this case, the incoming block can be compressed using the second blocks, with compressing results prefixed with the IDs of the second blocks (second block IDs).
[0139] It should be noted that the second blocks are the blocks that are useful in compressing the incoming block. The incoming block has been compressed in previous process. The determination of the second blocks are to evaluate the efficiency of the first blocks, thus discard the unused first blocks and save storage costs.
[0140] After the second blocks are determined, which are evaluated as useful in compressing the incoming block, the result of the compressed incoming block is prefixed with the block identifier of the one or multiple second blocks.
[0141] Provided that the ID of the incoming block is block 10, and only one hash value is calculated for one chunk, and the following case is demonstrated to elaborate a determination process of the first blocks and the second blocks.
[0142] The incoming block is divided into four incoming chunks with chunk IDs being chunk 0, chunk 1, chunk 2 and chunk 3, respectively.
[0143] As mentioned earlier, the key-value store keeps the association of hash values and corresponding (block ID, chunk ID) records of chunks divided from block 0 to block 9. As shown in FIG. 4, the present key-value store includes several buckets corresponding to records of (block ID, chunk ID) with unique hash values, where the block ID is between 0 and 9 and the chunk ID is between 0 and 3. For example, the bucket 401 keeps records of (4, 2), (1, 3), (5, 0), and the corresponding chunks share a common hash value X. The hash values of the bucket 402, the bucket 403, the bucket 404, the bucket 405 and the bucket 406 are U, W, Z, V, and M, respectively. It should be understood that as the 10 blocks from block 0 to block 9 are divided into 40 chunks altogether and for each chunk, only one hash value is calculated, there are 40 records of (block ID, chunk ID) in the present key-value store. In other words, there are several buckets and corresponding records (not depicted in the FIG. 4). As these records do not influence the determination of the first blocks and the second blocks, they have been omitted in the FIG. 4.
[0144] The first blocks can then be selected from the processed blocks according to the hash value calculated for the incoming block. Specifically, for chunk 0, the hash value X is matched with the bucket 401, then all the records from bucket 401 are taken from the bucket 401 including (4,2), (1,3), (5,0). For chunk 1, no bucket is matched with the hash value Y (assuming that the hash values of buckets not shown in FIG. 4 are not Y). For chunk 2 and chunk 3, hash values are Z and W, and records (3, 3), (3, 1), (5, 3) from the bucket 404 and records (3, 2), (1, 1), (3, 0), (5, 2), (6, 3) from the bucket 403 are taken. For these records, block 6, block 4, block 5, block 3, and block 1 can be used as dictionary blocks to compress the incoming block.
[0145] To compress more efficiently, several blocks can be selected from the above five blocks as first blocks. The result after two sorting of the records is as follows:
Groupl: (3, 3), (3, 1), (3, 2), (3, 0);
Group2: (5, 0), (5, 2), (5, 3);
Group3: (1,3), (1,1);
Group4: (4,2); and
Group5: (6,3).
[0146] Groupl, Group 2, Group3, Group 4, and Group5 represent block 3, block 5, block 1, block 4, and block 6, respectively, where Group 4 and Group 5 both have one records. The block 3 has the potential to be the dictionary block based on the hash value calculated for chunk 2 and chunk 3 of the incoming block,. Block 5 is possible to help compress chunk 0, chunk 2, chunk 3. Block 1 is hopeful to serve as the dictionary block to compress chunk 0 and chunk 3. Block 4 and block 6 are useful for compressing chunk 0 and chunk 3, respectively.
[0147] As has been explained before, the five blocks can be used as dictionary blocks to compress the incoming block. It is more desirable that we select several blocks that are qualified for being the dictionary blocks or first blocks according to certain rules. For example, corresponding groups of first blocks need to have at least 2 records, and the first blocks will be block 3, block 5, and block 1. Or the first blocks are determined as the three largest groups, and then block 3, block5, and block 1 are determined.
[0148] As block 3 (the largest group) matches two chunks in total (twice to chunk 2 and twice to chunk 3), in some embodiments, it may happen that these matched chunks are removed in other groups, the record (5, 2) corresponding to chunk 3 and the record (5, 3) corresponding to chunk 2 from Group 2 are deleted to avoid duplication. In other words, the matched groups sorted by the number of matched chunks are given a corresponding priority, where the groups with more matched chunks are given a higher priority. The chunks with the same hash value are purged, and then the groups with number of matched chunks left are resorted. After similar steps are performed on Group 3, Group 4, Group 5, new groups are as follows:
Groupl: (3, 3), (3, 1), (3, 2), (3, 0); and
Group2: (5, 0).
[0149] Then Group 3, Group 4, Group 5 are empty after this step, and the first blocks are block 3 and block 5. Also, as block 5 only matches a single chunk in the incoming block, it may happen that only block 3 is deemed as the first block to compress the incoming block.
[0150] In an optional implementation, several second blocks are further selected from the first blocks when evaluating the efficiency of the first blocks as dictionary blocks used. As has been elaborated before, string matching is performed between the first blocks and the incoming block while compressing the incoming block, where blocks that have longer matched strings are determined as the second blocks.
[0151] For example, block 1, block 3, and block 5 all meet the demand to be the second blocks. Then the three blocks are useful dictionary blocks to compress the incoming block, but it may happen that only block 3 and block 5 are effectively used to compress the incoming block, where block 3 and block 5 include all strings needed to compress the incoming block.
[0152] The unused first blocks will be discarded and the compression result of the incoming block can be prefixed with block identifiers of block 3 and block 5.
[0153] FIG. 5 is a schematic diagram depicting a delta compression process provided in the application.
[0154] An incoming block 504 is divided into one incoming chunk or it is not divided and two incoming hash values are calculated using the byte as shown in FIG. 5. After using a hash value as a key to look up in the key-value store, three dictionary blocks (501, 502, 503) are determined. Then the string matching result is that block 501 and block 502 have 1 word matched to the incoming block 504, and block 503 has 3 words matched to the incoming block 504. Then the incoming block 504 can be compressed using the three dictionary blocks. Optionally, block 501 and block 502 can be discarded as they only have 1 word matched to the incoming block 504.
[0155] In the technique solutions provided by the embodiments of the present application, the incoming block is possibly compressed according to several dictionary blocks (the first blocks or the second blocks), which improves the overall compression ratio of the incoming block.
[0156] After finishing all the steps above, the incoming block can be compressed according to the dictionary blocks (the first blocks or the second blocks). If the number of the dictionary blocks is zero, the incoming block is compressed independently. If at least one dictionary block is effectively used, the resulting compressed data that is prefixed by block IDs of the effectively used dictionary blocks is stored or transferred by a network. The matched part of the incoming block and the dictionary block can be denoted as (a, b), where a is an offset of the first byte for the matched strings and b is a total length of the matched strings. In different compression algorithms, the coding result is different, such as Huffman or arithmetic coding.
[0157] FIG. 6 is a schematic diagram depicting a delta compression process. As shown in FIG. 6, for an incoming block 604, block 601, block 602 and block 603 are candidate dictionary blocks (may be the first blocks), and block 603 is the dictionary block (may be the second block). The incoming block prefixed with the block ID of block 603 is then transferred to the compression area for compression.
[0158] If more than one dictionary is selected as the dictionary blocks to compress the incoming block 604, block 601 and block 603 are selected as dictionary blocks to compress the incoming block 604, for example. Yet while doing the string matching, block 601 as the dictionary block may be discarded as the matched strings “ZETTA” already exists in block 603.
[0159] Now that the incoming block is compressed, the key-value store is updated with the incoming block hash calculation result. FIG. 7 shows a key-value store updating result. Bucket 407 is a new bucket and keeps an association of a hash value Y and records (10, 1). Bucket 401, bucket 403 and bucket 404 are updated with records (10, 0), (10, 3), (10, 2), respectively.
[0160] Then similar steps are repeated on the new incoming block (block 11). After all the blocks are processed, the data compression process is finished.
[0161] It should be noted that in the following case the key- value store may not keep all the records of processed blocks:
[0162] (a) when block is deleted, information about it is removed from the key-value store.
[0163] (b) if there is not enough memory (RAM) for a key-value store, some information may be purged, for example, but not limited to, information about blocks that have not been selected as dictionary blocks for a long time.
[0164] (c) if we do not want recursive dictionary blocks, we do not store information about delta-compressed blocks in a key-value store. Recursive dictionary blocks will impact the speed of delta-decompression.
[0165] FIG. 8 is a flowchart of an embodiment of a method 700 for data compressing. The method includes the following steps.
[0166] S701, calculate hash values for one or multiple sets of bytes taken from chunks from an incoming block according to preset rules.
[0167] S702, find one or multiple candidate dictionaries for the incoming block according to the calculated hash values.
[0168] S703, prioritize candidate dictionaries by the number of hash values matching the incoming block.
[0169] S704, select one or multiple best candidate dictionaries for delta-compression followed by delta-compressing the incoming block.
[0170] S705, determine one or multiple effective dictionaries from the candidate dictionaries by the length of matched strings.
[0171] In S705, the length of matched strings is greater than a threshold.
[0172] S706, store or transmit the delta-compressed incoming block prefixed by IDs of effective dictionaries.
[0173] FIG. 9 is a flowchart of an embodiment of a method 800 for data compressing.
The method 800 includes the following steps.
[0174] S801, divide an incoming block into fixed-sized chunks.
[0175] In this step, the incoming block can be divided into several fixed-sized chunks.S802, select sets of bytes for hash calculation.
[0176] In this step, for one chunk, one or multiple sets of bytes may be selected for hash calculation. Selecting a small number of bytes instead of using the whole chunk for hash calculation reduces the computational costs involved in the compression process.
[0177] S803, calculate hash value(s) for each set of bytes.
[0178] S804, use hash values as keys to look up a key-value store, and collect (block ID, chunk ID) records with the same keys.
[0179] In this step, hash values are used as keys to look up the key-value store for the first blocks, which has been mentioned before.
[0180] S805, sort (block ID, chunk ID) records to find the max number of records with the same block ID1, and a smaller number of records with other block ID2, etc.
[0181] In this step, the blocks found in step S804 could be ranked with the number of chunks.
[0182] In one optional embodiment, blocks with the number of chunks less than a threshold could be discarded to improve the dictionary block quality.
[0183] In another optional embodiment, the block with a higher number of chunks could be given a higher priority. The similar chunks (chunks with the same hash values) could be purged in blocks with a lower priority, which will improve the candidate dictionary block quality.
[0184] S806, retrieve blocks to be used as dictionary blocks to compress the incoming block.
[0185] S807, prepare buffer for delta-compression: copy found dictionary blocks, and copy the incoming block.
[0186] In this step, the dictionary blocks and the incoming block are put into the buffer for compression.
[0187] S808, compress the incoming block in respect to the dictionary blocks.
[0188] S809, compress successfully?
[0189] S8091, update key-value store with hash values of the incoming block.
[0190] S8092, store a non-compressible incoming block.
[0191] In step S8091 and step S8092, the incoming block is non-compressible. For example, it may include image data that has already been compressed, so the incoming block is not compressed successfully. Yet still, it can be a candidate dictionary block for other blocks, so the key-value store needs to be updated with (block ID, chunk ID) records according to hash values of the incoming block.
[0192] S810, are the dictionary blocks used?
[0193] In this step, the incoming block is compressed successfully, yet it is possible that the dictionary block is useful or not used for the compression of the incoming block.
[0194] S8101, update the key-value store according to hash values of the incoming block.
[0195] S8102, store the compressed block as non-delta.
[0196] In step S8101 and step S8102, the incoming block is compressed successfully yet no dictionary block is used for the compression of the incoming block. The dictionary blocks may not have a matched length longer than a threshold, so the incoming block is compressed independently. The key-value store needs to be updated with the incoming block.
[0197] S811, discard unused dictionary blocks.
[0198] S812, store the compressed block as delta, prefix delta-compressed incoming block with used block IDs.
[0199] In this step, the incoming block is compressed successfully using one or multiple candidate dictionary blocks retrieved in step S806, which have matched strings longer than the threshold. Then the blocks not used can be discarded to improve the dictionary block quality. The compressed block is prefixed with used dictionary block IDs.
[0200] FIG. 10 is a schematic block diagram of an electronic device 900 according to an embodiment of this application. As shown in FIG. 10, the electronic device 900 includes: a dividing module 901, configured to divide an incoming block into multiple incoming chunks, wherein each of the multiple incoming chunks comprises X bytes; a selecting module 902, configured to select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, wherein the one or multiple sets of bytes comprise non-consecutive Y bytes in original order selected from the X bytes; a calculating module 903, configured to calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and a compressing module 904, configured to compress the incoming block according to the incoming hash values calculated by the calculating module for the multiple incoming chunks.
[0201] In the above mentioned technical solution, the diving module 901 is configured to divide the incoming block into one or multiple chunks. The selecting module 902 is configured to select one or multiple sets of bytes for hash calculation by the calculating module 903. The hash calculation result will be used by the compressing module 904 to compress the incoming block. A small number of bytes rather than the chunk as a whole are selected for hash calculation, which reduces the computational costs involved in the compression process and improves the efficiency of compression.
[0202] In some embodiments, wherein when a quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, wherein the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
[0203] In some embodiments, the preset rule includes: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
[0204] In some embodiments, the compressing module 904 is specifically configured to: determine one or multiple first blocks according to the incoming hash values, wherein the one or multiple first blocks are processed blocks and each of the one or multiple first blocks comprises one or multiple first chunks with a hash value the same as one of the incoming hash values; and compress the incoming block based on the one or multiple first blocks.
[0205] In some embodiments, wherein the compressing module 904 is configured to: determine the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has the same hash value with one of the incoming hash values; and determine the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided.
[0206] In some embodiments, wherein the one or multiple first chunks are determined based on a key -value store, wherein the key -value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
[0207] In some embodiments, the number of the one or multiple first chunks is greater than a first preset threshold. [0208] In some embodiments, the electronic device further comprises a determining module, configured to determine one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, where the length of matched strings is greater than a second preset threshold.
[0209] In some embodiments, the result of the compressed incoming block includes the block identifier of the one or multiple second blocks.
[0210] In some embodiments, the electronic device further comprises an updating module, configured to update the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
[0211] As shown in FIG. 11, an electronic device 1000 may include a transceiver 1001, a processor 1002, and a memory 1003. The memory 1003 may be configured to store code, instructions, or the like executed by the processor 1002.
[0212] It should be understood that the processor 1002 may be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the foregoing method embodiments may be completed by using a hardware integrated logic circuit in the processor, or by using instructions in a form of software. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor may implement or perform the methods, the steps, and the logical block diagrams that are disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor, or the processor may be any conventional processor or the like. The steps of the methods disclosed with reference to the embodiments of the present application may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware in the decoding processor and a software module. The software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register. The storage medium is located in the memory, and the processor reads information in the memory and completes the steps of the foregoing methods in combination with hardware in the processor.
[0213] It may be understood that the memory 1003 in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory. The nonvolatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically erasable programmable read-only memory (Electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (Random Access Memory, RAM) and is used as an external cache. By way of example rather than limitation, many forms of RAMs may be used, and are, for example, a static random access memory (Static RAM, SRAM), a dynamic random access memory (Dynamic RAM, DRAM), a synchronous dynamic random access memory (Synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), a synchronous link dynamic random access memory (Synchronous link DRAM, SLDRAM), and a direct ram bus random access memory (Direct Rambus RAM, DR RAM).
[0214] An embodiment of this application further provides a chip system, where the chip includes an input/output interface, at least one processor, and at least one memory. The at least one memory is configured to store instructions, and the at least one processor is configured to invoke the instructions of the at least one memory to perform operations performed by the electronic device in the methods in the foregoing embodiments.
[0215] An embodiment of this application further provides a computer storage medium, where the computer storage medium may store a program instruction for performing the steps in the foregoing methods.
[0216] Optionally, the storage medium may be specifically the memory 1003.
[0217] An embodiment of this application further provides a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the steps in the foregoing methods. [0218] A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
[0219] It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, refer to a corresponding process in the foregoing method embodiment. Details are not described herein again.
[0220] In the embodiments of this application, “at least one” means one or more, and “a plurality of’ means two or more. The term “and/or” describes an association relationship between associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three cases: Only A exists, both A and B exist, and only B exists, where A and B may be singular or plural. The character “I” generally indicates an “or” relationship between the associated objects. “At least one of the following” and a similar expression thereof refers to any combination of these items, including any combination of one item or a plurality of items. For example, at least one of a, b, and c may indicate: a, b, c, a and b, a and c, b and c, or a, b, and c, where a, b, and c may be singular or plural.
[0221] In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
[0222] The units described as separate parts may be or may not be physically separate, and parts displayed as units may be or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.
[0223] In addition, functional units in the embodiments of this application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
[0224] When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer readable storage medium. Based on such an understanding, the technical solutions in this application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of this application. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disc.
[0225] The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

CLAIMS What is claimed is:
1. A method for data compression, comprising: dividing an incoming block into multiple incoming chunks, wherein each of the multiple incoming chunks comprises X bytes; selecting one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, wherein the one or multiple sets of bytes comprise non-consecutive Y bytes in original order selected from the X bytes; calculating one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and compressing the incoming block according to the incoming hash values for the multiple incoming chunks.
2. The method according to claim 1, wherein when a quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and any one of the other sets of bytes is 2m, wherein the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
3. The method according to claim 2, wherein the preset rule comprises: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
4. The method according to any one of claims 1 to 3, wherein the compressing the incoming block according to the incoming hash values for the multiple incoming chunks, comprises: determining one or multiple first blocks according to the incoming hash values, wherein the one or multiple first blocks are processed blocks and each of the one or multiple first blocks comprises one or multiple first chunks with a hash value the same as one of the incoming hash values; and compressing the incoming block based on the one or multiple first blocks.
5. The method according to claim 4, wherein the determining one or multiple first blocks according to the incoming hash values comprises: determining the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has a same hash value as one of the incoming hash values; and determining the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided.
6. The method according to claim 4 or 5, wherein the one or multiple first chunks are determined based on a key-value store, wherein the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
7. The method according to any one of the claims 4 to 6, wherein quantity of the one or multiple first chunks is greater than a first preset threshold.
8. The method according to any one of claims claim 4 to 7, wherein after the compressing the incoming block based on the one or multiple first blocks, the method comprises: determining one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, wherein the length of matched strings is greater than a second preset threshold.
9. The method according to claim 8, wherein a result of the compressed incoming block comprises a block identifier of the one or multiple second blocks.
10. The method according to any one of claims 1 to 9, wherein the method further comprises: updating the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
11. An electronic device, comprising: a dividing module, configured to divide an incoming block into multiple incoming chunks, wherein each of the multiple incoming chunks comprises X bytes; a selecting module, configured to select one or multiple sets of bytes from the X bytes of each of the multiple incoming chunks according to a preset rule, wherein the one or multiple sets of bytes comprise non-consecutive Y bytes in original order selected from the X bytes; a calculating module, configured to calculate one or multiple incoming hash values based on the one or multiple sets of bytes for each of the multiple incoming chunks; and a compressing module, configured to compress the incoming block according to the incoming hash values calculated by the calculating module for the multiple incoming chunks.
12. The electronic device according to claim 11, wherein when a quantity of the sets of bytes is multiple for each incoming chunk, a difference of an offsetO position between a first set of bytes and anyone of the other sets of bytes is 2m, wherein the m is a natural number unique for different sets of bytes, the first set of bytes is one of the multiple sets of bytes, the offsetO position is a position of the first byte for each of the one or multiple set of bytes.
13. The electronic device according to claim 11, wherein the preset rule comprises: a difference of a byte position between the non-consecutive Y bytes and the offsetO position satisfies a modified Fibonacci sequence or logarithmic function sequence.
14. The electronic device according to any one of claims 11 to 13 wherein the compressing module is configured to: determine one or multiple first blocks according to the incoming hash values, wherein the one or multiple first blocks are processed blocks and each of the one or multiple first blocks comprises one or multiple first chunks with a hash value the same as one of the incoming hash values; and compress the incoming block based on the one or multiple first blocks.
15. The electronic device according to claim 14, wherein the compressing module is configured to: determine the one or multiple first chunks based on the incoming hash values, wherein each of the one or multiple first chunks are processed chunks, and has the same hash value as one of the incoming hash values; and determine the one or multiple first blocks based on the one or multiple first chunks, wherein the one or multiple first blocks are the blocks from which the one or multiple first chunks are divided from.
16. The electronic device according to claim 14 or 15, wherein the one or multiple first chunks are determined based on a key-value store, wherein the key-value store keeps association of hash values and corresponding chunk identifiers and block identifiers of the processed chunks.
17. The electronic device according to any one of the claims 14 to 16, wherein a quantity of the one or multiple first chunks is greater than a first preset threshold.
18. The electronic device according to any one of claims 14 to 17, wherein the electronic device further comprises a determining module, configured to: determine one or multiple second blocks from the one or multiple first blocks according to a length of matched string between the one or multiple first blocks and the incoming block, wherein the length of matched string is greater than a second preset threshold.
19. The electronic device according to claim 18, wherein the result of the compressed incoming block comprises a block identifier of the one or multiple second blocks.
20. The electronic device according to any one of the claims 11 to 19, wherein the electronic device further comprises an updating module, configured to update the key-value store with the chunk identifiers, block identifiers and corresponding incoming hash values of the incoming chunks.
21. A computer-readable storage medium, wherein the computer-readable storage medium stores instructions, and when the instructions run on a device, the device is enabled to perform the method according to any one of claims 1 to 10.
22. A computer program product, wherein when the computer program product runs on a device, the device is enabled to perform the method according to any one of claims 1 to 10.
23. A chip system, comprising a memory and a processor, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that a device on which the chip system is disposed performs the method according to any one of claims 1 to 10.
PCT/RU2022/000246 2022-08-02 2022-08-02 Method for data compression and electronic device Ceased WO2024030041A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/RU2022/000246 WO2024030041A1 (en) 2022-08-02 2022-08-02 Method for data compression and electronic device
EP22765251.8A EP4508755A1 (en) 2022-08-02 2022-08-02 Method for data compression and electronic device
CN202280095524.6A CN119111038B (en) 2022-08-02 2022-08-02 Data compression method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2022/000246 WO2024030041A1 (en) 2022-08-02 2022-08-02 Method for data compression and electronic device

Publications (1)

Publication Number Publication Date
WO2024030041A1 true WO2024030041A1 (en) 2024-02-08

Family

ID=83193568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2022/000246 Ceased WO2024030041A1 (en) 2022-08-02 2022-08-02 Method for data compression and electronic device

Country Status (3)

Country Link
EP (1) EP4508755A1 (en)
CN (1) CN119111038B (en)
WO (1) WO2024030041A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060059207A1 (en) * 2004-09-15 2006-03-16 Diligent Technologies Corporation Systems and methods for searching of storage data with reduced bandwidth requirements
US20120016882A1 (en) * 2010-07-19 2012-01-19 Quantum Corporation Delta chunks and delta hashes
WO2022139626A1 (en) * 2020-12-22 2022-06-30 Huawei Technologies Co., Ltd. Method for storing a data page in a data storage device using similarity based data reduction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103997346B (en) * 2014-05-12 2017-02-15 东南大学 Data matching method and device based on assembly line
CN105868305B (en) * 2016-03-25 2019-03-26 西安电子科技大学 A kind of cloud storage data deduplication method for supporting fuzzy matching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060059207A1 (en) * 2004-09-15 2006-03-16 Diligent Technologies Corporation Systems and methods for searching of storage data with reduced bandwidth requirements
US20120016882A1 (en) * 2010-07-19 2012-01-19 Quantum Corporation Delta chunks and delta hashes
WO2022139626A1 (en) * 2020-12-22 2022-06-30 Huawei Technologies Co., Ltd. Method for storing a data page in a data storage device using similarity based data reduction

Also Published As

Publication number Publication date
CN119111038B (en) 2025-11-28
EP4508755A1 (en) 2025-02-19
CN119111038A (en) 2024-12-10

Similar Documents

Publication Publication Date Title
US8135683B2 (en) Method and apparatus for data redundancy elimination at the block level
CN101803203B (en) Optimized data stream compression using data-dependent chunking
US5049881A (en) Apparatus and method for very high data rate-compression incorporating lossless data compression and expansion utilizing a hashing technique
TWI676903B (en) Lossless reduction of data by deriving data from prime data elements resident in a content-associative sieve
CN108427539B (en) Offline de-duplication compression method and device for cache device data and readable storage medium
JP3889762B2 (en) Data compression method, program, and apparatus
US8078593B1 (en) Dictionary architecture and methodology for revision-tolerant data de-duplication
US6597812B1 (en) System and method for lossless data compression and decompression
CN107682016B (en) Data compression method, data decompression method and related system
US20060106870A1 (en) Data compression using a nested hierarchy of fixed phrase length dictionaries
JP2021527376A (en) Data compression
Feng et al. MLC: An efficient multi-level log compression method for cloud backup systems
US11748307B2 (en) Selective data compression based on data similarity
US12057862B2 (en) Data decompression device, memory system, and data decompression method
EP4508755A1 (en) Method for data compression and electronic device
CN116795808B (en) Data processing methods and related equipment
JP3241787B2 (en) Data compression method
EP4508754A1 (en) Method for date compression and related device
CN117200805B (en) Compression and decompression method and device with low memory occupation of MCU
US20250284632A1 (en) Data compression circuit, memory system and method for controlling the data compression circuit
Timonen Select-based random access to variable-byte encodings
CN121326860A (en) Data management method, device and equipment
CN120238134A (en) Data compression method and data decompression method
WO2025048666A1 (en) Method for data compression and related device
CN118643013A (en) A data processing method and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22765251

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280095524.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2022765251

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022765251

Country of ref document: EP

Effective date: 20241115

NENP Non-entry into the national phase

Ref country code: DE