CN108121810A

CN108121810A - A kind of data duplicate removal method, system, central server and distributed server

Info

Publication number: CN108121810A
Application number: CN201711434983.XA
Authority: CN
Inventors: 谢永恒; 张海涛; 火莽; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2018-06-05

Abstract

The embodiment of the invention discloses a kind of data duplicate removal method, system, central server and distributed servers；The described method includes：Current pending data is carried out Hash calculation by central server using the first hash algorithm, obtain the first cryptographic Hash, corresponding first cryptographic Hash of current pending data is assigned in corresponding distributed server, then corresponding first cryptographic Hash is carried out Hash calculation by distributed server using the second hash algorithm, corresponding second cryptographic Hash is obtained, last distributed server carries out duplicate removal processing according to corresponding second cryptographic Hash of current pending data to current pending data.In an embodiment of the present invention, central server calculates the first cryptographic Hash of pending data using the first hash algorithm and distributes into corresponding distributed server, it solves the problems, such as during data deduplication that the distributed server machine phenomenon that occurs delaying causes loss of data, and then improves the accuracy of data deduplication.

Description

Data deduplication method, system, central server and distributed server

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data deduplication method, a data deduplication system, a central server and a distributed server.

Background

Data deduplication is an important component in the field of big data preprocessing, and the performance of data deduplication and the accuracy of data deduplication directly influence the performance, resources and data accuracy of numerous modules such as data preprocessing and later data mining, cleaning, analysis and storage.

The current data deduplication is based on a bloom filter (Bloomfilter) of a Redis memory to perform data deduplication. Since both the Bloomfilter and the Redis memory support the bit data storage format, the filter function of the Redis memory and the Bloomfilter can be combined. Specifically, the Redis memory may provide a storage space for the Bloomfilter, calculate a position of a bit corresponding to each piece of to-be-processed data in the Bloomfilter by using a Hash (Hash) function, set the calculated position of the bit corresponding to each piece of to-be-processed data to 1, and compare values of the bits at corresponding positions in the BitMap of the Redis memory, thereby implementing data filtering and deduplication in a distributed cluster environment.

In the prior art, a Bloomfilter deduplication method based on Redis is mainly applied to crawler (Python) framework script to deduplicate a network Uniform Resource Locator (URL). When the filtering of the web crawler data is heavy, the crawled data enter the Bloomfilter one by one for filtering, multiple Hash function calculations are performed in the Bloomfilter, the position of the corresponding bit is determined according to the Hash calculation result, the value of the bit of the corresponding position is set to be 1, and then the value of the bit of the corresponding position in the BitMap in the Redis memory is compared, so that the filtering of the data is heavy.

However, when filtering out the duplicate, the script plug-in the current web crawler does not consider the condition that the distributed server is down, and if the distributed server is down, the data being filtered in the distributed server is lost, so that the accuracy of removing the duplicate of the database is affected.

Disclosure of Invention

The embodiment of the invention provides a data deduplication method, a data deduplication system, a central server and a distributed server, which can prevent the influence on data deduplication when the distributed server is down, and further can improve the accuracy of data deduplication.

In a first aspect, an embodiment of the present invention provides a data deduplication method, including:

the central server performs hash calculation on the current data to be processed by adopting a first hash algorithm to obtain a first hash value corresponding to the current data to be processed, and distributes the first hash value corresponding to the current data to be processed to the corresponding distributed server;

the distributed server performs hash calculation on a first hash value corresponding to the current data to be processed by adopting a second hash algorithm to obtain a second hash value corresponding to the current data to be processed;

and the distributed server performs duplicate removal processing on the current data to be processed according to a second hash value corresponding to the current data to be processed.

In a second aspect, an embodiment of the present invention further provides a central server, where the central server includes: the system comprises a first calculation module and a distribution module; wherein,

the first calculation module is used for performing hash calculation on the current data to be processed by adopting a first hash algorithm to obtain a first hash value corresponding to the current data to be processed;

the distribution module is used for distributing the first hash value corresponding to the current data to be processed to the corresponding distributed server.

In a third aspect, an embodiment of the present invention further provides a distributed server, where the distributed server includes: a second calculation module and a deduplication module; wherein,

the second calculation module is used for performing hash calculation on a first hash value corresponding to the current data to be processed by adopting a second hash algorithm to obtain a second hash value corresponding to the current data to be processed;

and the duplication elimination module is used for carrying out duplication elimination processing on the current data to be processed according to the second hash value corresponding to the current data to be processed.

In a fourth aspect, an embodiment of the present invention further provides a data deduplication system, where the system includes: a central server and a distributed server; wherein,

the central server is used for performing hash calculation on the current data to be processed by adopting a first hash algorithm, acquiring a first hash value corresponding to the current data to be processed, and distributing the first hash value corresponding to the current data to be processed to the corresponding distributed server;

the distributed server is used for performing hash calculation on a first hash value corresponding to the current data to be processed by adopting a second hash algorithm to obtain a second hash value corresponding to the current data to be processed; and performing deduplication processing on the current data to be processed according to the second hash value corresponding to the current data to be processed.

The embodiment of the invention provides a data deduplication method and system, a central server and a distributed server, wherein the central server adopts a first hash algorithm to obtain a first hash value corresponding to current data to be processed, the first hash value is distributed to the distributed server corresponding to the central server, the distributed server adopts a second hash algorithm to perform hash calculation on the first hash value to obtain a second hash value, and the distributed server performs deduplication processing on the current data to be processed according to the second hash value. Compared with the prior art, the existing duplication elimination method directly adopts the second hash algorithm to acquire the second hash value for the data to be processed, and the distributed servers then eliminate the duplication of the data to be processed, so that compared with the prior art, the central server adopts the first hash algorithm to calculate the first hash value of the data to be processed and distributes the first hash value to the corresponding distributed servers, the influence on data duplication elimination when the distributed servers are down can be prevented, and the accuracy of data duplication elimination can be improved; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.

Drawings

Fig. 1A is a flowchart of a data deduplication method according to an embodiment of the present invention;

fig. 1B is a schematic diagram of distributed storage allocation corresponding to a first hash algorithm in a data deduplication method according to an embodiment of the present invention;

fig. 1C is a data flow diagram when filtering heavy data is implemented inside a distributed server in the data deduplication method according to the first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a central server according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a distributed server according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data deduplication system according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before proceeding with the detailed description, the related terms involved in the present invention will be explained, and all the related terms involved in the present invention are known and understood by those skilled in the art. The Hash algorithm is a one-way cipher system, only the encryption process has no decryption process, the Hash function can change the input with any length to obtain the output with fixed length, and the one-way characteristic and the characteristic of fixed length of the output data of the Hash function can generate the message or data.

A server cluster is a solution for improving the overall computing power of servers, and is a parallel or distributed system formed by server clusters connected together. In the computer field, there are two major directions for mass data processing: the first direction is centralized computing, which is to increase the computing power of a single computer by increasing the number of processors, thereby increasing the speed of processing data. The second direction is distributed computing, that is, a group of computers are connected with each other through a network to form a distributed system, then a large amount of data to be processed is dispersed into a plurality of parts and is sent to computer groups in the distributed system for simultaneous computing, and finally, the computing results are combined to obtain a final result, and the servers for performing distributed computing are called distributed servers. Although the computing power of a single computer in a decentralized system is not strong, the speed of processing data is much higher than that of a single computer in a distributed system because each computer computes only a portion of the data and multiple computers compute simultaneously.

A Virtual Machine (Virtual Machine) refers to a special software in the architecture of computer science. It may create an environment between a computer platform and an end user that operates software based on the environment created by the software, and a virtual machine is a software implementation of a computer that can run programs like a real machine. In the field of computers, data transmission and interaction between different servers are realized through nodes.

Bit (Bit) is the smallest unit of storage of a computer, and represents the value of a Bit as 0 or 1. The digit is also called word length, and refers to the number of digits of a digital binary number that can be processed by a processor in one operation. The computer Word size is divided into 8 bits, 16 bits, 32 bits and 64 bits, and we generally refer to 8 bits (Bit)) as one Byte (Byte), 16 bits as one Word size (Word), 32 bits as one double Word size, and 64 bits as two double Word sizes. A bit array is a collection of bits having a length, e.g., a bit array of 10000 bits in length, which can only store 10000 bits.

Example one

Fig. 1A is a flowchart of a data deduplication method according to an embodiment of the present invention. The present embodiment is applicable to a situation where a data deduplication system performs deduplication filtering on massive data, and the method may be executed by a data deduplication device, and the data deduplication device may be implemented in a software and/or hardware manner. Referring to fig. 1A, the method specifically includes the steps of:

s110, the central server performs hash calculation on the current data to be processed by adopting a first hash algorithm, obtains a first hash value corresponding to the current data to be processed, and distributes the first hash value corresponding to the current data to be processed to the corresponding distributed server.

In an embodiment of the present invention, the data to be processed refers to data that needs to be filtered out. The central server is a server for preprocessing the data to be processed before the current data to be processed enters the server for filtering out the heavy data. The first Hash algorithm refers to consistent Hash calculation, the consistent Hash is a distributed Hash algorithm, the consistent Hash algorithm carries out fault transfer under the environment of a plurality of servers, the availability of a system is improved, and the problems caused by the simple Hash algorithm are corrected. The consistency Hash algorithm introduces the concept of annular Hash space, the distribution of the server nodes and the distribution of the data are divided into two independent processes, the association of the data and the server nodes is not directly established through the Hash algorithm, and the change of one node does not influence the whole distributed system. The first Hash value is a Hash value obtained after the data to be processed is subjected to consistent Hash calculation, and the value is mapped into the annular Hash space. The distributed server refers to a server which is used for filtering and reprocessing data to be processed, and the distributed server can be a plurality of servers established by real machines or virtual machines.

For example, for data to be filtered and reprocessed, before entering the distributed servers for filtering, the data to be processed is first preprocessed by the central server, where the purpose of preprocessing the data to be processed by the central server is to prevent the problem of data loss when the distributed servers are down, and simultaneously, due to the setting of the virtual nodes, the load among the server clusters can be further balanced. The central server is mainly used for preprocessing the data to be processed by a first Hash algorithm, namely, performing consistent Hash calculation, and the data to be processed can obtain a first Hash value after the consistent Hash calculation. The first Hash value obtained after the consistent Hash algorithm is performed on each piece of data to be processed (whole or part of a specific field, depending on a specific service scenario) can be used as a data identifier of the piece of data to be processed, and the data identifier is uniquely determined.

Optionally, the central server determines a target hash value interval in which the first hash value corresponding to the current data to be processed is located according to the first hash value corresponding to the current data to be processed and predetermined hash value intervals corresponding to the distributed servers;

the central server acquires the working state of the distributed server corresponding to the target hash value interval; wherein, operating condition includes: a normal working state and an abnormal working state;

and when the working state of the distributed server corresponding to the target hash value interval is a normal working state, the central server distributes the first hash value corresponding to the current data to be processed to the distributed server corresponding to the target hash value interval.

The predetermined Hash value interval corresponding to each distributed server means that each distributed server obtains a Hash value through a consistent Hash algorithm and configures the Hash value to 0-2³²The size of the annular Hash space and the values at various places in the annular Hash space of (2) are determined. The target Hash value interval means that each data to be processed after being preprocessed by the central server obtains a unique data identifier, the data identifier is mapped into the annular Hash space, the first Hash values are different, the mapping positions are different, and the corresponding Hash value intervals are also different.

As an example, performing the first Hash algorithm on the data to be processed may be understood as that the data to be processed is subjected to consistent Hash calculation to obtain a first Hash value, the data to be processed is distributed according to the first Hash value, different intervals of the annular Hash space are preset, different first Hash values are mapped to different intervals in the annular Hash space, the different intervals respectively correspond to different distributed servers, that is, different data to be processed are distributed to different distributed servers.

And the central server also acquires the working state of the distributed server corresponding to the target hash value interval. The distributed servers are divided into a normal working state and an abnormal working state. Under the normal working state, the distributed server can work normally, the data to be processed can realize the function of filtering out heavy data when entering the distributed server, if the distributed server is in the abnormal working state, the data which is being filtered out heavy data in the distributed server can not be filtered normally, the data can realize the dynamic circulation in the clockwise direction in the closed annular Hash space, and the data is distributed to the next adjacent distributed server for filtering.

Fig. 1B is a schematic diagram of distributed storage allocation corresponding to a first hash algorithm in the data deduplication method according to an embodiment of the present invention. For example, fig. 1B is taken as an example for explanation, in the drawing, node a1, node a2, node a3, node B1, node B2, node c1, and node c2 are each virtual node, each virtual node corresponds to each distributed server, a machine node a load stores data of node a1, node a2, and node a3, a machine node B load stores data of node B1 and node B2, and a machine node c load stores data of node c1 and node c 2. When each data to be processed is stored in a distributed manner, a first Hash value is obtained through consistent Hash calculation, the first Hash value corresponds to each position in the ring, if the key1 corresponds to the position shown in the figure, then a machine node NodeB2 is found clockwise, and the NodeB2 is in a normal working state, then the key1 is stored in the node B2. If the NodeB2 is in an abnormal operating state, the data being processed by the NodeB2 and the data waiting to be processed, which is queued to enter the NodeB2, are dynamically cycled in the clockwise direction and distributed to the next neighboring distributed server node a2 for filtering.

Optionally, when the working state of the distributed server corresponding to the target hash value interval is an abnormal working state, the central server determines a next target hash value interval according to the target hash value interval;

and the central server distributes the first hash value corresponding to the current data to be processed to the distributed server corresponding to the next target hash value interval.

The next target hash value interval refers to that the distributed server corresponding to the current hash value interval has a problem, and exemplarily, the search is performed in the clockwise direction, which is equivalent to extending the original target hash value interval to the next adjacent hash value interval.

For example, taking fig. 1B as an example, if key1 corresponds to the position shown in the figure, and then a machine node NodeB2 is found clockwise, and NodeB2 is in a normal operating state, then key1 is stored in B2. If NodeB2 is in abnormal working state and filtering heavy can not be realized smoothly, then clockwise search is started from the position mapped by key1, key1 is saved to the first found distributed server, namely NodeA2 in FIG. 1B, if it exceeds 2³²The distributed server is still not found and is saved to the first distributed server node a 1. Because the number of the virtual nodes is large and the virtual nodes are uniformly distributed, even if a certain node is down, the system can not be crashed due to the fact that the pressure of a single node is overlarge.

Illustratively, when filtering out the repeated data of a specific data, for example, a data chinese character "china", first, a first hash algorithm is performed in the central server, a key value "key 1" of the data chinese character "china" is first obtained, and then a first hash value of the key1 is "110", the first hash value "110" is mapped to the position of the key1 in fig. 1B, and when the node NodeB2 is found clockwise, the NodeB2 is in a normal operating state, the key1 is stored in the node B2. That is, the data Chinese character 'China' is distributed to the distributed server NodeB2, and the data Chinese character 'China' is filtered and filtered again to enter the distributed server NodeB2. If NodeB2 is in abnormal working state, filtering can not be successfully realized, then clockwise search is started from the position mapped by key1, key1 is saved to the first found distributed server, that is, NodeA2 in FIG. 1B, that is, the data Chinese character "China" is distributed to the distributed server NodeA2, and then filtering is carried out by entering NodeA2 distributed server. If it exceeds 2³²If the distributed server is still not found, the data is saved in the first distributed server node a1, that is, the data chinese character "china" is distributed to the distributed server node a1, and the data chinese character "china" is filtered out again by entering the node a1 distributed server.

And S120, the distributed server performs hash calculation on the first hash value corresponding to the current data to be processed by adopting a second hash algorithm to obtain a second hash value corresponding to the current data to be processed.

In an embodiment of the present invention, the filtering and deduplication functions in the distributed server are implemented based on a bloom filter (BloomFilter). The basic idea of BloomFilter is to map a data element to a position in a Bit array (Bit array) by a Hash function, and to know whether the data element exists in the data set by checking whether the value at the position is 1 or not. The second Hash algorithm is a plurality of preset Hash functions in the BloomFilter in each distributed server to perform random Hash calculation on the first Hash value, and aims to avoid and reduce data collision. The second Hash value refers to a plurality of Hash values obtained by calculating the first Hash value through a plurality of Hash functions, and the plurality of Hash values are positions which should be set to 1 in the bit array calculated by the BloomFilter.

Fig. 1C is an example for explanation, and fig. 1C is a data flow diagram for implementing filtering and discarding the duplicate data inside the distributed server in the data deduplication method provided in the embodiment of the present invention. Firstly, an M-bit array is created, all bits are initialized to 0, and K different hash functions are selected. The ith hash function uniquely identifies the result of the hash to the data as h (i, str), and h (i, str) ranges from 0 to M-1. An empty BloomFilter is an M-site array with a value of 0, and has K different hash function definitions, and each element of the hash set is located in an array position within M, resulting in uniform random distribution. Typically K is a constant much smaller than M and M is proportional to K. For the string "str", h (1, str), h (2, str) … … h (i, str) are calculated, respectively. The h (1, str), h (2, str) … … h (i, str) bits of the bit array are then set to 1.

It should be noted that, for the BloomFilter, as the bit array length is limited, more and more bits are set to 1 as the processed data increases. The bit positions of different data obtained after multiple Hash calculations may be repeated, and when the bit occupancy exceeds a certain proportion, the problem of false alarm rate may occur. When the amount of data stored in the bits in the BloomFilter is excessive, only a very few positions are not set to 1. Assuming that the data is an Arabic numeral "1000", the results obtained after 4 Hash functions are B4, B1, B9 and B15, and the values at these four positions should be set to 1. Assuming that the data to be processed is the english letter "abc", the results obtained after 4 Hash functions are B6, B19, B2 and B13, and the values at these four positions should be set to 1. Then when the arabic number "1000" and the english letter "abc" are both stored in BloomFilter, the result of the data to be processed, the chinese character, after 4 Hash functions in BloomFilter, is B6, B1, B9, B13, and the values at these four positions should be set to 1. When the data is filtered by the BloomFilter, the values of the four positions B6, B1, B9 and B13 are all set to be 1, the data is judged to be that the Chinese character 'China' exists in the currently processed data, the data is filtered, however, actually, the Chinese character 'China' is not recorded, and the situation of false alarm occurs at this time. The more bits the memory space of BloomFilter is occupied, the higher the false alarm rate. Therefore, for a general BloomFilter, the storage space needs to be cleaned regularly, and the situation that the accuracy of data deduplication is influenced by an excessively high false alarm rate is prevented.

It is assumed that the memory space of BloomFilter is an M-site array with K different hash function definitions.

The calculation formula of the false alarm rate is deduced as follows:

assuming that the probability of obtaining each bit array by one Hash function is the same, and the bit arrays are within the range of M, the probability that a bit is not set to 1 by the Hash function is:

if there are K Hash functions, then the probability that a bit is not set to 1 by the Hash function is:

if K Hash functions exist and n elements need to be inserted into the BloomFilter, the probability that a certain bit is not set to be 1 by the Hash function is as follows:

then the false alarm rate can be approximated as:

for example, fig. 1C only shows the BloomFilter internal implementation data flow diagram when M is 18 and K is 4. The center server carries out consistent Hash calculation on the data Chinese character to be processed to obtain a first Hash value 110, and the first Hash value 110 is distributed to a BloomFilter of a distributed server NodeB2 to filter out the weight of the distributed server. The BloomFilter performs 4 Hash function calculations on the '110' to obtain a string of Bit positions, the obtained results are B6, B1, B9 and B13, the values of the four positions are all set to be 1, and the result is a second Hash value obtained by the first Hash value '110' of the Chinese character 'China' of the data to be processed.

S130, the distributed server performs duplicate removal processing on the current data to be processed according to a second hash value corresponding to the current data to be processed.

In a specific embodiment of the present invention, the principle of the BloomFilter in the distributed server when filtering out heavy is as follows. In the above example, for the character string "str", h (1, str), h (2, str) … … h (i, str) have been calculated separately, and it is checked whether the h (1, str), h (2, str) … … h (i, str) bits of the bit array are all 1. If any one of the bits is not 1, it can be determined that "str" is not recorded; if all the bits are 1, the character string "str" is considered to exist.

Illustratively, in the above example, the BloomFilter in the distributed server obtains the second hash value after the second hash calculation, that is, obtains the position of the Bit to be set to 1 in the Bit array. And setting the corresponding Bit position in the memory as 1. If the corresponding positions are all set to be 1 through checking, the current data already exist and need to be discarded. And if any position corresponding to the position is not set to be 1 through checking, the current data is not recorded, the current data needs to be stored, and the corresponding Bit position is set to be 1.

Optionally, the distributed server determines whether a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory;

when a second hash value corresponding to the current data to be processed does not exist in the BitMap of the Redis memory, the distributed server stores the second hash value corresponding to the current data to be processed into the BitMap of the Redis memory;

and when a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory, the distributed server discards the current data to be processed.

The Redis memory is a memory of a Redis database, and the Redis memory can support a data storage format of bits. The Bit Map is a storage structure of the Redis memory for accessing elements by using a Bit array, and when judging whether some elements exist, the elements are accessed by using Bit bits as units, so that the access space is greatly saved, and the method is suitable for searching, deleting and judging mass data. During Bit coding, the Bit is increased from the right to the left, the first Bit is numbered 0, and the most significant Bit is numbered N-1, as can be understood with reference to fig. 1C and the foregoing.

Illustratively, Bloomfilter in the distributed server is interfaced with Redis memory. The Redis memory can provide a storage space for the Bloomfilter, the positions of the bits corresponding to the data to be processed are calculated in the Bloomfilter by utilizing a Hash (Hash) function, the positions of the bits corresponding to the data to be processed which are obtained through calculation are set to be 1, and the values of the bits of the corresponding positions in the BitMap of the Redis memory are compared, so that the data filtering under the distributed cluster environment is realized.

For example, still taking the data to be processed, namely the chinese character, as an example, BloomFilter in the distributed server performs a second hash calculation on the first hash value "110" to obtain four Bit bits of second hash values B6, B1, B9, and B13, and the four Bit bits are set to 1. The BloomFilter checks the values of the four bits in the BitMap of the Redis memory and judges whether the four bits in the BitMap are all 1. If the four positions B6, B1, B9 and B13 are all 1, the data Chinese character 'China' already exists in the Redis memory, and the data Chinese character 'China' currently being processed needs to be filtered out; if one of the four Bit positions B6, B1, B9 and B13 is 0, the data Chinese character 'China' does not exist in the Redis memory, the Chinese character 'China' needs to be stored, and all the four Bit positions B6, B1, B9 and B13 are set to be 1.

For example, when data filtering is performed in the distributed servers, if mass data transmitted by a real-time processing system is targeted, the data is transmitted to the distributed servers in batches, and BloomFilter in each distributed server needs to be performed one by one when filtering the data. If the machine sets the virtual machine, a large number of distributed servers are generated, the bloomfilters in each distributed server work simultaneously, batch data is generated, if one piece of data is stored in the Redis memory, a large amount of memory resources are occupied, and the read-write speed of the Redis memory is slowed down. Different commands are classified and submitted in batches by using a Redis pipeline technology, data of the same command are placed in the same pipeline and submitted in batches to a Redis memory, and both the performance of the Redis memory and the memory resource consumption are improved.

It should be noted that, for the batch commit technique of the Redis memory, the batch commit technique may be selectively used according to actual situations, and is adaptively adjusted, which is not limited herein.

In the technical scheme of this embodiment, the central server performs hash calculation on current data to be processed by using a first hash algorithm to obtain a first hash value corresponding to the current data to be processed, and then distributes the first hash value to a distributed server corresponding to the first hash value, the distributed server performs hash calculation on the first hash value by using a second hash algorithm to obtain a corresponding second hash value, and the distributed server performs deduplication processing on the current data to be processed according to the second hash value. Therefore, compared with the prior art, the data to be processed is calculated by adopting the first hash algorithm and is distributed to the corresponding distributed servers, so that the influence of data loss when the distributed servers are down can be prevented, and the accuracy of data duplicate removal can be improved; the virtual machines are arranged to increase the number of the distributed servers, so that the overall load of the distributed servers can be effectively balanced, and the performance of the servers is improved; the batch submission technology of the Redis memory can be suitable for the data deduplication condition of a real-time system; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient and popular, and has wide application range.

Example two

Fig. 2 is a schematic structural diagram of a central server according to a second embodiment of the present invention. FIG. 2 illustrates a block diagram of an exemplary central server suitable for use in implementing embodiments of the present invention. The central server shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 2, the central server includes: a first calculation module 21 and an assignment module 22, wherein,

the first calculating module 21 is configured to perform hash calculation on current data to be processed by using a first hash algorithm, and obtain a first hash value corresponding to the current data to be processed;

and the allocating module 22 is configured to allocate the first hash value corresponding to the current data to be processed to the corresponding distributed server.

Illustratively, the central server typically includes a variety of computer system readable media. Such media may be any available media that is accessible by the central server and includes both volatile and nonvolatile media, removable and non-removable media.

Wherein, the central server can be represented in the form of a general computing device, and the components of the central server can include but are not limited to: one or more processors or processing units, a system memory, and a bus connecting the various system components (including the system memory and the processing units). A bus represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. The system memory may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory. The central server may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the storage system may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard disk drives"). Although not shown in FIG. 2, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules, such as a first computing module 21 and an assignment module 22, that are configured to perform the functions of embodiments of the invention. A program/utility having a set (at least one) of program modules may be stored, for example, in the memory, such program modules including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination may comprise an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the described embodiments of the invention. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the central server, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others. The processing unit executes various functional applications and data processing by executing programs stored in the system memory, for example, to implement the data deduplication method provided by the embodiment of the present invention.

Further, the distribution module 22 includes a determination unit 221, an acquisition unit 222, and a distribution unit 223; wherein,

the determining unit 221 is configured to determine, according to the first hash value corresponding to each piece of data to be processed and the predetermined hash value interval corresponding to each distributed server, a target hash value interval corresponding to each first hash value;

an obtaining unit 222, configured to obtain a working state of a distributed server corresponding to the target hash value interval; wherein, operating condition includes: a normal working state and an abnormal working state;

and the allocating unit 223 is configured to, when the working state of the distributed server corresponding to the target hash value interval is a normal working state, allocate the first hash value corresponding to the current data to be processed to the distributed server corresponding to the target hash value interval.

Further, the determining unit 221 is further configured to determine, when the working state of the distributed server corresponding to the target hash value interval is an abnormal working state, a next target hash value interval according to the target hash value interval;

the allocating unit 223 is further configured to allocate the first hash value corresponding to each to-be-processed data to the distributed server corresponding to the next target hash value interval.

According to the technical scheme, the preprocessing process before filtering massive data is achieved through mutual matching of the modules, compared with the prior art, the preprocessing process can be used for storing the data to be processed in a distributed mode before the data enters the distributed server to be filtered, the influence of the distributed server on data de-duplication can be prevented when the distributed server is down, and therefore the accuracy of data de-duplication is improved; the technical scheme in the embodiment of the invention is simple and convenient to realize, convenient to popularize and wide in application range.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a distributed server according to a third embodiment of the present invention. FIG. 3 illustrates a block diagram of an exemplary distributed server suitable for use in implementing embodiments of the present invention. The distributed server shown in fig. 3 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 3, the distributed server includes: a second calculation module 31 and a deduplication module 32, wherein,

the second calculating module 31 is configured to perform hash calculation on a first hash value corresponding to the current data to be processed by using a second hash algorithm, and obtain a second hash value corresponding to the current data to be processed;

and the duplication elimination module 32 is configured to perform duplication elimination on the current data to be processed according to the second hash value corresponding to the current data to be processed.

Illustratively, a distributed server typically includes a variety of computer system readable media. Such media may be any available media that is accessible by the distributed server and includes both volatile and nonvolatile media, removable and non-removable media.

The distributed server can be expressed in the form of a general or virtual computing device, one machine can be provided with a plurality of virtual machines, and the number of the virtual machines can be set according to service requirements. Of course, the virtual machine may not be set, and the specific situation may be adaptively adjusted according to the actual situation, which is not limited herein. Distributed servers are similar to central servers, except that the functions implemented are different. The distributed server is mainly used for filtering and de-duplicating data. After being preprocessed by the central server, the data to be processed is distributed to each distributed server for filtering, and each distributed server is provided with a Bloomfilter plug-in for realizing a filtering function. For the description of Bloomfilter, refer to the above embodiments, and are not repeated.

The second calculation module 31 in the distributed server performs hash calculation on the first hash value corresponding to the data to be processed to obtain the second hash value corresponding to the current data to be processed, and then performs filtering to remove the duplicate through the processing of the duplicate removal module 32. The process of achieving heavy leaching can also be seen in the description of the previous embodiment.

Optionally, the deduplication module 32 includes: a determination unit 321 and a deduplication unit 322; wherein,

the determining unit 321 is configured to determine whether a second hash value corresponding to current data to be processed exists in a BitMap of the Redis memory;

the deduplication unit 322 is configured to, when a second hash value corresponding to the current data to be processed does not exist in the BitMap of the Redis memory, store the second hash value corresponding to the current data to be processed into the BitMap of the Redis memory; and when a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory, discarding the current data to be processed.

According to the technical scheme of the embodiment, the process of filtering massive data is realized through mutual matching of the modules, the batch submission technology of the Redis memory is utilized while the virtual machine is set, the overall load of the distributed server is balanced, the batch storage of the data can be realized, the memory resource is saved, and the read-write performance of the Redis memory is improved; moreover, the technical scheme in the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.

Example four

Fig. 4 is a schematic structural diagram of a data deduplication system according to a fourth embodiment of the present invention. The fourth embodiment of the present invention further provides a data deduplication system, where the system includes: a central server 41 and a distributed server 42; wherein,

the central server 41 is configured to perform hash calculation on the current data to be processed by using a first hash algorithm, acquire a first hash value corresponding to the current data to be processed, and allocate the first hash value corresponding to the current data to be processed to the corresponding distributed server;

the distributed server 42 is configured to perform hash calculation on a first hash value corresponding to the current data to be processed by using a second hash algorithm, and obtain a second hash value corresponding to the current data to be processed; and performing deduplication processing on the current data to be processed according to a second hash value corresponding to the current data to be processed.

Illustratively, the data deduplication system includes a central server 41 and distributed servers 42, wherein the number of the central server 41 and the distributed servers 42 is not limited, and may be adaptively adjusted according to specific service needs. The data deduplication system can implement all the methods and functions in the foregoing embodiments, and the descriptions of the central server 41 and the distributed servers 42 refer to the foregoing embodiments and are not repeated.

The computer storage media in a data deduplication system may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, or a combination thereof. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the technical scheme of this embodiment, the central server performs hash calculation on current data to be processed by using a first hash algorithm to obtain a first hash value corresponding to the current data to be processed, and then distributes the first hash value to a distributed server corresponding to the first hash value, the distributed server performs hash calculation on the first hash value by using a second hash algorithm to obtain a corresponding second hash value, and the distributed server performs deduplication processing on the current data to be processed according to the second hash value. Therefore, compared with the prior art, the data to be processed is calculated by adopting the first hash algorithm and is distributed to the corresponding distributed servers, so that the influence of data loss of the data to be processed when the distributed servers are down can be prevented, and the accuracy of data deduplication can be improved; the virtual machines are arranged to increase the number of the distributed servers, so that the overall load of the distributed servers can be effectively balanced, and the performance of the servers is improved; the batch submission technology of the Redis memory can be suitable for the data deduplication condition of a real-time system; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient and popular, and has wide application range.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for data deduplication, the method comprising:

2. The method according to claim 1, wherein the allocating the first hash value corresponding to the current data to be processed to the distributed server corresponding to the current data to be processed comprises:

the central server determines a target hash value interval where a first hash value corresponding to the current data to be processed is located according to the first hash value corresponding to the current data to be processed and predetermined hash value intervals corresponding to the distributed servers;

the central server acquires the working state of the distributed server corresponding to the target hash value interval; wherein the operating state comprises: a normal working state and an abnormal working state;

3. The method of claim 2, further comprising:

when the working state of the distributed server corresponding to the target hash value interval is an abnormal working state, the central server determines the next target hash value interval according to the target hash value interval;

4. The method according to claim 1, wherein the performing, by the distributed server, deduplication processing on the current data to be processed according to a second hash value corresponding to the current data to be processed includes:

the distributed server judges whether a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory;

5. A central server, characterized in that the central server comprises: the system comprises a first calculation module and a distribution module; wherein,

6. The central server according to claim 5, wherein the distribution module comprises: the device comprises a determining unit, an acquiring unit and a distributing unit; wherein,

the determining unit is used for determining a target hash value interval corresponding to each first hash value according to the first hash value corresponding to each data to be processed and a predetermined hash value interval corresponding to each distributed server;

the acquisition unit is used for acquiring the working state of the distributed server corresponding to the target hash value interval; wherein the operating state comprises: a normal working state and an abnormal working state;

and the distribution unit is used for distributing the first hash value corresponding to the current data to be processed to the distributed server corresponding to the target hash value interval when the working state of the distributed server corresponding to the target hash value interval is a normal working state.

7. The center server according to claim 6, wherein the determining unit is further configured to determine a next target hash value interval according to the target hash value interval when the operating state of the distributed server corresponding to the target hash value interval is an abnormal operating state;

the allocation unit is further configured to allocate the first hash value corresponding to each piece of data to be processed to the distributed server corresponding to the next target hash value interval.

8. A distributed server, comprising: a second calculation module and a deduplication module; wherein,

9. The distributed server of claim 8, wherein the deduplication module comprises: a judging unit and a duplicate removal unit; wherein,

the judging unit is used for judging whether a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory;

the duplicate removal unit is configured to, when a second hash value corresponding to the current data to be processed does not exist in the BitMap of the Redis memory, store the second hash value corresponding to the current data to be processed into the BitMap of the Redis memory; and when a second hash value corresponding to the current data to be processed exists in the BitMap of the Redis memory, discarding the current data to be processed.

10. A data deduplication system, the system comprising: a central server and a distributed server; wherein,

the distributed server is used for performing hash calculation on a first hash value corresponding to the current data to be processed by adopting a second hash algorithm to obtain a second hash value corresponding to the current data to be processed; and performing deduplication processing on the current data to be processed according to a second hash value corresponding to the current data to be processed.