CN110297879B

CN110297879B - Method, device and storage medium for data deduplication based on big data

Info

Publication number: CN110297879B
Application number: CN201910401427.5A
Authority: CN
Inventors: 王保军; 江腾飞
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2023-05-30
Anticipated expiration: 2039-05-15
Also published as: WO2020228182A1; CN110297879A

Abstract

The embodiment of the application belongs to the field of Internet, and relates to a method, a device and a storage medium for data deduplication based on big data, wherein the method comprises the steps of collecting at least two text data according to preset keywords; generating a binary string number of k bits according to the similar hash function and the hash function for each text data; the arrangement sequence of the j number of sub-binary strings is regulated, corresponding j sets are generated by taking different numbers of sub-binary strings as the binary strings at the forefront end, and the j number of sets are stored in a preset sample library; matching the sample library by using a binary string at the forefront end of each set of the j sets to obtain candidate results of the sets returned by the sample library; and calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value. And the data is subjected to dimension reduction by using a hash algorithm, so that the comparison time of two texts is reduced, and the cost for storing the texts is reduced.

Description

Method, device and storage medium for data deduplication based on big data

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for data deduplication based on big data.

Background

Big data (big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability.

Big data has the following several big characteristics:

capacity (Volume): the size of the data determines the value and potential information of the data under consideration;

category (Variety): a diversity of data types;

speed (Velocity): refers to the speed at which data is obtained;

variability (Variability): preventing the process of processing and efficiently managing data;

authenticity (Veracity): the quality of the data;

complexity (Complexity): the data volume is huge, and the sources are multi-channel;

value (value): and large data are reasonably used, so that high value is created with low cost.

The smallest basic unit of big data is a bit (bit), and all units are given in order: bit, byte (Byte), KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB, other units are calculated according to the rate 1024 (power of ten of 2) except for 1 byte=8bit.

With the advent of the information explosion age and the application of cloud technology, big data has attracted more and more attention, and big data processing technologies mainly include a large-scale parallel processing (MPP) database, data mining, a distributed file system, a distributed database, a cloud computing platform, the internet and an extensible storage system.

When the same file is backed up from the same directory or from a plurality of addresses in the network a plurality of times, repeated data occurs, which greatly increases the I/0 and CPU processing pressure of the analysis system, and if no duplicate processing is performed, the analysis efficiency of the data decreases, and the hardware overhead of the analysis system increases, while for charging items according to the total flow of analysis, the redundant analysis cost is unacceptable. Repeated data is particularly serious when large data processing is performed, and because the internet is currently filled with a large amount of near repeated information, for large data mining, repeated data can lead to erroneous judgment on some aspect, namely invalid large data.

Therefore, it is necessary to deduplicate the duplicate data to avoid the occurrence of the above-described problem.

In the prior art, a data deduplication technology is to compare data according to overhead load (payload) of data, full data or custom rules, so as to judge whether repetition exists, and then filter and deduplicate redundant data.

Yet another prior art data deduplication technique is to compare two text similarities, most often after word segmentation, to convert the text into a measure of feature vector distance, such as the common euclidean distance, hamming distance, or cosine angle, etc.

The data deduplication technology can be well applied to scenes with small data volume, but when a large amount of repeated data exists in the Internet, the data deduplication technology is difficult to be applied to scenes with massive data processing, otherwise, the I/0 and CPU processing pressure of an analysis system can be greatly increased, and resources are wasted.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, computer equipment and a storage medium for data deduplication based on big data, wherein a hash algorithm is used for reducing the dimension of the data, so that the comparison time of two texts can be reduced, and the cost for text storage is reduced.

In order to solve the above technical problems, the embodiments of the present application provide a method for data deduplication based on big data, which adopts the following technical scheme:

collecting at least two text data according to preset keywords;

generating, for each text data, a k-bit binary string from a similar hash function and a hash function, wherein k = 2 ⁿ Wherein n is a positive integer of 2 or more;

equally dividing the k-bit binary string into j sub-binary strings, wherein j is a positive integer greater than or equal to 1;

the arrangement sequence of the j number of sub-binary strings is regulated, corresponding j sets are generated by taking different numbers of sub-binary strings as the binary strings at the forefront end, and the j number of sets are stored in a preset sample library;

Matching the sample library by using a binary string at the forefront end of each set of the j sets to obtain candidate results of the sets returned by the sample library;

and calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value.

Optionally, for each text data, generating the k-bit binary string according to the similar hash function and the hash function specifically includes:

selecting the number k of bits of the similar hash function;

initializing each bit of the similar hash function to 0;

extracting each text data by segmentation, and extracting a plurality of segmentation_weight pairs;

carrying out hash function processing on the segmented words in each segmented word weight pair;

longitudinally accumulating the word segmentation weight pairs processed by the hash function to generate k numerical values;

and converting the generated k number values into a k-bit binary string.

Optionally, the selecting the bit number k of the similar hash function specifically includes: the number of bits k of the similar hash function is selected according to the storage cost and the size of the data set.

Optionally, the hash function processing on the word segment in each word segment_weight pair specifically includes: a hash code of a predetermined number of word-segmentation letters of each text data is calculated using a k-bit hash function.

Optionally, the collecting the at least two text data according to the preset keyword specifically includes: and capturing at least two text data related to the keywords according to preset keywords by utilizing a web crawler technology.

Optionally, the matching the sample library with the foremost binary string of each set of the j sets, and obtaining the candidate result of each set returned by the sample library specifically includes: and determining whether the forefront binary string of each binary string of the j binary strings is identical to the forefront binary string stored in the sample library in the memory, if so, determining that the candidate result fed back currently by the sample library is a correct candidate result, and if not, determining that the candidate result fed back currently by the sample library is an incorrect candidate result.

Optionally, the performing the longitudinal accumulation of the bits on the word segmentation_weight pair after the hash function processing, and generating k numerical values specifically includes: and carrying out longitudinal accumulation on the word segmentation_weight pair processed by the hash function, adding 1 if the bit is 1, subtracting 1 if the bit is 0, and finally generating k numerical values.

In order to solve the above technical problem, the embodiment of the present application further provides a device for data deduplication based on big data, which adopts the following technical scheme, where the device for data deduplication based on big data includes:

The collection module is used for collecting at least two text data according to preset keywords;

a processing module for generating, for each text data, a k-bit binary string from a similar hash function and a hash function, wherein k=2 ⁿ Wherein n is a positive integer of 2 or more;

the splitting module is used for equally dividing the binary string with k bits into j sub binary strings, wherein j is a positive integer greater than or equal to 1;

the adjustment module is used for adjusting the arrangement sequence of the j number of sub-binary strings, generating corresponding j sets by taking different number of sub-binary strings as the binary string at the forefront end, and storing the j sets in a preset sample library;

the matching module is used for matching the sample library by using the binary string at the forefront end of each set of the j sets to acquire candidate results of the sets returned by the sample library;

and the calculation module is used for calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value.

In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:

the computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the big data based data deduplication method when executing the computer program.

In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the big data based data deduplication method.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

generating, for each text data, a k-bit binary string from a similar hash function and a hash function, wherein k = 2 ⁿ Dividing the k-bit binary string equally into j sub-binary strings, wherein j is a positive integer greater than or equal to 1, adjusting the arrangement sequence of the j sub-binary strings, generating corresponding j sets by taking different sub-binary strings as the foremost binary strings and storing the j sets in a preset sample library, matching the foremost binary string of each of the j sets with the sample library, acquiring candidate results of each set returned by the sample library, calculating the Hamming distance of any two text data according to the candidate results of each text data, and if the Hamming distance is less than or equal to a threshold value, performing de-duplication on the large data, thereby reducing the comparison time of two texts by using a hash algorithm, reducing the cost of text storage

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of one embodiment of a method for big data based data deduplication in accordance with the present application;

FIG. 2 is a flow chart of one embodiment of step 102 of FIG. 1;

FIG. 3 is a schematic diagram of an embodiment of an apparatus for big data based data deduplication in accordance with the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device according to the present application.

Reference numerals: 301-collecting module, 302-processing module, 303-splitting module, 304-adjusting module, 305-matching module, 306-calculating module, 307-bus, 41-memory, 42-processor and 43-network interface

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for data deduplication based on big data according to an embodiment of the present application, where the method for data deduplication based on big data may be as follows.

And step 101, collecting at least two text data according to the preset keywords.

For example, the web crawler technology is utilized to capture at least two text data related to the keywords according to preset keywords, and the at least two text data are stored in a data warehouse of a buffer memory or a storage.

Web crawlers (also known as web spiders, web robots, web chasers) are programs or scripts that automatically crawl web information according to certain rules. It is an important component of search engines to download web pages from the world wide web. The traditional crawler starts from the URL of one or a plurality of initial web pages, obtains the URL on the initial web pages, and continuously extracts new URL from the current web page and puts the new URL into a queue in the process of grabbing the web pages until a certain stop condition of the system is met.

In this embodiment, the plurality of text data may be collected by a focused web crawler that obtains the required information based on a given crawling objective (e.g., investment category information of a customer), a selected access log record, APP feedback, a WeChat, or a web page on the world wide web and related links. For example, when searching by focusing on web crawlers, keywords related to investment data are set, for example, the keywords may be: name, identification number, address, telephone number, bank account number, mailbox address, city to which it belongs, zip code, password type (e.g., account inquiry password, withdrawal password, login password, etc.), organization name, business license number, bank account number, transaction date, transaction amount, etc. And then the web crawler captures text data related to the keywords on the log records, APP feedback, weChat or web pages on the world Wide Web, for example, the related text data containing the keywords, and stores the collected text data in a data warehouse of a buffer or a memory according to various dimensions, so that the data in the data warehouse is big data.

Step 102, generating a k-bit binary string according to a similar hash function (simhash) and a hash function (hash), wherein k=2, for each text data ⁿ Wherein n is a positive integer of 2 or more.

For example, for each text data (e.g., doc text, web text) of the big data, the text data is converted into hash codes (hashcode) using a simhash function, as described in detail below.

For example, the following three paragraphs of text are illustrated: p1= the cat sat on the mat; p2= the cat sat on a mat; p3= we all scream for ice cream, the entire process can be described as follows, as shown in fig. 2, which is a flowchart of one embodiment of step 102 in fig. 1.

Step 1021, selecting the bit number k of the simhash function.

For example, the number of bits k of simhash is selected according to the storage cost and the size of the data set, where k=2 ⁿ N is a positive integer of 2 or more, for example, k=16, 32, 64 or 128 bits.

Step 1022, initializing each bit of the simhash function to 0.

Step 1023, extracting the word segmentation from each text data, and extracting a plurality of word segmentation_weight pairs.

For example, each text data is subjected to word segmentation extraction (including word segmentation and weight calculation), for example, n word_weight pairs (feature_weight_pairs) are obtained by extraction, and are denoted as feature_weight_weights= [ fw1, fw2 … fwn ], where fwn = (feature_n, weight_n), where n is a positive integer greater than or equal to 2.

For example, various predetermined number of word-splitting methods are generally adopted, for example, the predetermined number is 2 or 3, for example, for "the cat sat on the mat", the following results are obtained by adopting a two-by-two word-splitting method: { "th", "he", "e", "c", "ca", "at", "t", "s", "sa", "o", "on", "n", "t", "m", "ma" }, where the space also counts a letter.

Step 1024, performing a hash function processing on the segmentation (feature) in each segmentation_weight pair (feature_weight_pairs).

For example, a hash code (hash code) of each predetermined number of word letters (word) of the text data is calculated using a 32-bit hash function, and a hash code (hash code) of each 2 or 3 letters of the text data is calculated, such as: "th". Hash= -502157718, "he". Hash= -369049682, … ….

Step 1025, performing longitudinal accumulation on the word segmentation_weight pairs processed by the hash function to generate k numerical values.

For example, the word-splitting weight pair processed by the hash function is subjected to longitudinal accumulation of bits, if the bit is 1, 1 is added, and if the bit is 0, 1 is subtracted, and k (i.e. bits_count) values are finally generated.

For example, a 32-bit hash function is adopted, the bit number bits_count=32 generated by the hash, and for each bit of the hash code of each word, if the bit is 1, the value of the corresponding bit of the simhash is added by 1; otherwise, subtracting 1, and obtaining 32 values (i.e. simhash comprises 32 values).

Step 1026, converting the generated k number values into a k-bit binary string.

For example, for the resulting 32-bit simhash, if the bit is greater than 1, then it is set to 1; otherwise, set to 0.

In another embodiment of the present application, a 64 or 128 bit binary string may also be generated, and the present embodiment is not limited.

The use of simhash should produce results similar to the following:

irb(main):003:0>p1.simhash＝>851459198 00110010110000000011110001111110

irb(main):004:0>p2.simhash＝>847263864 00110010100000000011100001111000

irb(main):002:0>p3.simhash＝>984968088 00111010101101010110101110011000。

after simhash function operation, the hamming distance (hamming distance) of the three texts is the number of different bits in the two binary strings.

Step 103, equally dividing the k-bit binary string into j sub-binary strings, wherein j is a positive integer greater than or equal to 1 and j is a positive integer greater than or equal to 1.

For example, a 32-bit or 64-bit binary string is equally divided into four parts, e.g., when a 32-bit binary string is equally divided into four parts, each part includes an 8-bit sub-binary string, e.g., when a 64-bit binary string is equally divided into four parts, each part includes a 16-bit binary string. For example, a 64-bit binary string is equally divided into four 16-bit sub-binary strings: l (L) _1-16 ，L _17-32 ，L _33-48 And L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 And L _48-64 Each comprising a 16-bit binary string.

The above embodiment is described by taking the example of equally dividing a binary string of 32 bits or 64 bits into four, but the embodiments of the present application are not limited to dividing into a plurality of parts, for example, j is a positive integer of 1 or more, for example, j may be 2, 3, 4, 5, 6, 7, 8, or the like.

And 104, adjusting the arrangement sequence of the j number of sub-binary strings, generating corresponding j sets by taking different number of sub-binary strings as the binary string at the forefront, and storing the j sets in a preset sample library.

When the 64-bit binary string is equally divided into four parts, any one part of the 16-bit sub-binary string may be adjusted as the front-most binary string of the four-part sub-binary string, for example, the sub-binary string L _1-16 ，L _17-32 ，L _33-48 And L _48-64 The binary strings that are the foremost ends of all binary strings may be respectively adjusted, and then there are 4 sets, and may be stored in a preset sample library in a table (table), for example, in a preset memory, that is, 4 tables are stored in the memory, for example, the 4 sets are respectively: (L) _1-16 ，L _17-32 ，L _33-48 ，L _48-64 )、(L _17-32 ，L _1-16 ，L _33-48 ，L _48-64 )、(L _33-48 ，L _1-16 ，L _17-32 ，L _48-64 )、(L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 )。

The above embodiment simply performs the set classification with the first part of the sub-binary strings, and how the latter sub-binary strings are arranged is not limited. For example, in another embodiment of the present application, the set classification may also be performed in other ways, such as equally dividing a 64-bit binary string into two parts, each part including a 32-bit sub-binary string, such as sub-binary string L _1-32 And L _33-64 . Any one of the 32-bit sub-binary strings can be adjusted as the front-most binary string, e.g., sub-binary string L _1-32 And L _33-64 If the binary strings used as the forefront are respectively adjusted, there are 2 sets, and the binary strings can be stored in a sample library of a memory as tables (tables), that is, 2 tables are stored in the memory, for example, the 2 sets are respectively (L _1-32 ，L _33-64 ) Sum (L) _33-64 ，L _1-32 )。

Step 105, matching the sample library with the forefront binary string of each set of the j sets, and obtaining candidate results of each set returned by the sample library.

For example, the sample library is matched with the foremost binary string of each of the j sets if the sample library has always 2 ^m A hash fingerprint, then return 2 for each set ^m-j A candidate result, wherein m is an integer greater than 2, and m>j。

For example, when the 64-bit binary string generates four tables, the matching method is used to search the first 16-bit sub-binary string, if 2 are stored in the sample library ³⁴ (as much as 10 billion) hash fingerprints, then each table returns 2 ^(34-16) =26262820 candidate results, return 2 relative to the prior art ³⁴ The Hamming distance calculation cost is greatly reduced.

In another embodiment of the present application, the matching the sample library with the foremost binary string of each of the j sets specifically includes: and determining whether the forefront binary string of each binary string of the j binary strings is identical to the forefront binary string stored in the memory, if so, determining that the candidate result fed back currently by the sample library is a correct candidate result, and if not, determining that the candidate result fed back currently by the sample library is an incorrect candidate result.

And 106, calculating the Hamming distance of any two text data according to the candidate result of each text data, and if the Hamming distance is smaller than or equal to a threshold value, performing de-duplication (namely discarding or deleting one text).

For example, the Hamming distance of binary string A and binary string B is the number of 1's in the binary after A xor B.

For example, binary string a=100111, binary string b=101010, then hamming_distance (a, B) =count_1 (a xor B) =count_1 (001101) =3.

The high-dimensional feature vector is mapped into an f-bit fingerprint (finger print) through a simhash algorithm, whether the two texts are repeated or are highly similar or not is determined through comparing the Hamming Distance (Hamming Distance) of the f-bit fingerprints of the two texts, namely, the smaller the value of the Hamming Distance is, the more similar is the value of the Hamming Distance, and when the Hamming Distance is equal to zero, the two compared texts are the same, and the larger the value of the Hamming Distance is, the more dissimilar is the value of the Hamming Distance is.

For example, the simhash result of the above three texts p1, p2 and p3 has a hamming distance of (p 1, p 2) =4, (p 1, p 3) =16 and (p 2, p 3) =12. The similarity between the texts is much greater than the similarity with p 3.

In summary, the method for deduplication based on big data described in the above embodiment is different from the biggest simhash function operation in that although the hash function may be used for mapping to compare the duplication of texts, the text with only one byte for a possible gap will be mapped into two completely different hash results, and the hash mapping results of simhash function on similar texts are similar. For example, setting the simhash function to 64 bits, i.e., f=64, maps the weighted feature set of text onto a 64-bit hash fingerprint (finger print).

For example, setting simhash function as 64 bits, equally dividing 64-bit binary string into 4-part sub-binary string, then adjusting the 64-bit binary, taking any part of sub-binary string as the first 16 bits, combining four kinds of combinations in total, generating four kinds of tables and storing in sample library, searching the first 16 bits by adopting an accurate matching mode, if 2 are stored in the sample library ³⁴ (as much as 10 billion) hash fingerprints, then each table returns 2 ^(34-16) =26262844 candidate results, greatly reducing the calculation cost of hamming distance.

Therefore, the method for de-duplication of the big data based on the big data described in the embodiment of the application uses the hash algorithm to reduce the dimension of the big data, so that the comparison time of two texts can be reduced, and the cost for storing the texts is reduced.

It should be noted that, the method for deduplication based on big data provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the method and apparatus for deduplication based on big data are generally set in the server/terminal device. The terminal device may be a wireless terminal, which may be a device providing voice and/or data connectivity to a user, a handheld device having wireless connectivity, or other processing device connected to a wireless modem, or a wired terminal. The terminal may be a portable, pocket, hand-held, computer-built-in or vehicle-mounted mobile device.

It should be understood that the number of terminal devices, networks and servers is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the various processes of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 1, the present application provides an embodiment of a data deduplication apparatus based on big data, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied in various electronic devices.

As shown in fig. 3, the data deduplication apparatus 300 based on big data according to the present embodiment includes: a collection module 301, a processing module 302, a splitting module 303, an adjustment module 304, a matching module 305, a calculation module 306, and a bus 307. The collecting module 301, the processing module 302, the splitting module 303, the adjusting module 304, the matching module 305 and the calculating module 306 are connected to each other through the bus 307. The block division of the present embodiment is merely illustrative, and the respective logical division may also be performed according to the respective method actions.

The bus 307 is used to enable connected communication between these components. For example, the bus 307 may be an industry standard architecture (Industry Standard Architecture, ISA) bus, a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus system may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The collecting module 301 is configured to collect at least two text data according to a preset keyword;

the processing module 302 is configured to form, for each text data, a k-bit binary string according to a similar hash function (simhash) and a hash function (hash), where k=2 ⁿ Wherein n is a positive integer of 2 or more;

the splitting module 303 is configured to split the k-bit binary string equally into j-bit sub-binary strings, where j is a positive integer greater than or equal to 1;

the adjusting module 304 is configured to adjust an arrangement sequence of the j number of sub-binary strings, generate corresponding j sets by using different number of sub-binary strings as a binary string at the forefront end, and store the j sets in a preset sample library;

the matching module 305 is configured to use the j setsAnd matching the binary strings at the forefront of each set of the sets with the sample library, and obtaining candidate results of each set returned by the sample library. For example, the matching module 305 is configured to match the sample library with the foremost binary string of each of the j sets if the sample library has a total of 2 ^m A hash fingerprint, then return 2 for each set ^m-j A candidate result, wherein m is an integer greater than 2, and m>j；

The calculating module 306 is configured to calculate a hamming distance between any two text data according to the candidate result of each text data, and perform deduplication if the hamming distance is less than or equal to a threshold.

The following three paragraphs of text are illustrative: p1= the cat sat on the mat; p2= the cat sat on a mat; p3= we all scream for ice cream.

In another embodiment of the present application, for example, the processing module 302 is configured to convert, for each text data (for example, doc text, web text) of the big data, the text data into hash codes (hash codes) using a simhash function, for example, the processing module 302 further includes: the system comprises a selection subunit, an initialization subunit, a extraction subunit, a hash function processing subunit, an accumulation subunit and a processing subunit, wherein any two of the selection subunit, the initialization subunit, the extraction subunit, the hash function processing subunit, the accumulation subunit and the processing subunit can be in communication connection with each other.

The selecting subunit is configured to select the number k of bits of the similar hash function, for example, the selecting subunit is configured to select the number k of bits of simhash according to the storage cost and the size of the data set, where k=2 ⁿ N is a positive integer of 2 or more, for example, k=16, 32, 64 or 128 bits.

An initialization subunit, configured to initialize each bit of the similar hash function to 0;

And the extraction subunit is used for extracting the word segmentation from each text data and extracting a plurality of word segmentation weight pairs. For example, the extraction subunit performs hash function processing on the word in each word_weight pair, for example, the extraction subunit is configured to calculate a hash code of a predetermined number of word letters of each text data using a k-bit hash function. For example, the predetermined number is 2 or 3.

The extraction subunit is further configured to perform word segmentation extraction (including word segmentation and weight calculation) on each text data, for example, extract n (word segmentation and weight) (word segmentation_weight) pairs, denoted as feature_weight_pairs= [ fw1, fw2 … fwn ], where fwn = (feature_n, weight_n), where n is a positive integer greater than or equal to 2. For example, various predetermined number of word-splitting methods are generally adopted, for example, the predetermined number is 2 or 3, for example, for "the cat sat on the mat", the following results are obtained by adopting a two-by-two word-splitting method: { "th", "he", "e", "c", "ca", "at", "t", "s", "sa", "o", "on", "n", "t", "m", "ma" }, where the space also counts a letter.

And the hash function processing subunit is used for carrying out hash function processing on the word segmentation in each word segmentation_weight pair. For example, the hash function processing subunit is configured to calculate a hash code (hash code) of each predetermined number of word letters (words) of the text data using a 32-bit hash function, and calculate a hash code (hash code) of each 2 or 3 letters of the text data, such as: "th". Hash= -502157718, "he". Hash= -369049682, … ….

And the accumulation subunit is used for longitudinally accumulating the bits of the word-segmentation weight pairs processed by the hash function to generate k numerical values, for example, the accumulation subunit longitudinally accumulates the bits of the word-segmentation weight pairs processed by the hash function, if the bits are 1, the weight is weighted, and if the bits are 0, the weight is reduced, and finally k numerical values are generated. For example, the accumulating subunit adopts a 32-bit hash function, so that the bit number bits_count=32 generated by the hash, and for each bit of the hash code of each word, if the bit is 1, the value of the corresponding bit of the simhash is added by 1; otherwise, subtracting 1, and obtaining 32 values (i.e. simhash comprises 32 values).

And the processing subunit is used for converting the generated k numerical values into a k-bit binary string. For example, k is 32, 64, or 128, e.g., the processing subunit sets 1 if the bit is greater than 1 for the resulting simhash of 32 bits; otherwise, set to 0.

In another embodiment of the present application, the processing subunit may also generate a 64 or 128 bit binary string, which is not limited in this embodiment.

The use of simhash should produce results similar to the following:

irb(main):003:0>p1.simhash＝>851459198 00110010110000000011110001111110

irb(main):004:0>p2.simhash＝>847263864 00110010100000000011100001111000

irb(main):002:0>p3.simhash＝>984968088 00111010101101010110101110011000。

In another embodiment of the present application, the matching module 305 is configured to match the sample library with a foremost binary string of each of the j sets specifically includes: the matching module 305 is configured to determine that the foremost binary string of each set of the j sets and the foremost binary string of each set stored in the memory are identical, and determine that the candidate result currently fed back by the sample library is a correct candidate result if the two strings are identical, and determine that the candidate result currently fed back by the sample library is a incorrect candidate result if the two strings are not identical, and determine that the candidate result currently fed back by the sample library is a correct candidate result if the two strings are not identical.

In another embodiment of the present application, the splitting module 303 is further configured to equally divide a 32-bit or 64-bit binary string into four parts, for example, when equally dividing a 32-bit binary string into four parts, each part includes an 8-bit sub-binary string, for example, when equally dividing a 64-bit binary string into four parts, each part includes a 16-bit binary string. For example, the splitting module 303 is further configured to split the 64-bit binary string equally into four 16-bit sub-binary strings: l (L) _1-16 ，L _17-32 ，L _33-48 And L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 And L _48-64 Each comprising a 16-bit binary string.

The above embodiment is described by taking the binary string of 32 bits or 64 bits as an example, but the embodiment of the present application is not limited to dividing into a plurality of parts, for example, j is a positive integer greater than or equal to 1, for example, j may be an even number such as 2, 4, 6, or 8.

In another embodiment of the present application, the adjustment module 304 is further configured to adjust any one 16-bit sub-binary string to be the front-most binary string of the four-bit sub-binary string, for example, sub-binary string L, when the 64-bit binary string is equally divided into four parts _1-16 ，L _17-32 ，L _33-48 And L _48-64 The binary string that is the forefront of all binary strings may be adjusted separately, there are 4 sets, and the table (table) may be stored in the memory, that is, the 4 tables are stored in the memory, for example, the 4 sets are respectively: (L) _1-16 ，L _17-32 ，L _33-48 ，L _48-64 )、(L _17-32 ，L _1-16 ，L _33-48 ，L _48-64 )、(L _33-48 ，L _1-16 ，L _17-32 ，L _48-64 )、(L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 )。

For example, the calculating module 306 is further configured to calculate a hamming distance between two text data (e.g., the first text data and the second text data), where the hamming distance between the binary string a () and the binary string B is the number of 1 s in the binary string after the binary string a xor B.

The computing module 306 is further configured to map the high-dimensional feature vector into an f-bit fingerprint (finger print) by using a simhash algorithm, and determine whether the two texts are duplicate or highly similar by comparing the Hamming distances (Hamming distances) of the f-bit fingerprints of the two texts, i.e., the smaller the Hamming Distance value is, the more similar the Hamming Distance value is, and when the Hamming Distance is equal to zero, the two comparison texts are identical, and the larger the Hamming Distance value is, the more dissimilar the Hamming Distance value is.

In this embodiment, the modules may be implemented by one or more processors, chips, or integrated circuits, which is not limited in this embodiment.

Therefore, the device for de-duplication of the big data based on the embodiment of the application uses the hash algorithm to reduce the dimension of the big data, so that the comparison time of two texts can be reduced, and the cost for storing the texts is reduced.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, one or more processors 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The Memory 41 includes at least one type of readable storage medium including flash Memory, a hard disk, a multimedia card, a card Memory (e.g., SD or DX Memory, etc.), a Random Access Memory (RAM), a Static Random-Access Memory (SRAM), a read-only Memory (ROM), an electrically erasable programmable read-only Memory (Electrically Erasable Programmable read only Memory, EEPROM), a programmable read-only Memory (Programmable read-only Memory, PROM), a magnetic Memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various types of application software installed on the computer device 4, such as program codes of the data processing method described above. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or the data deduplication, for example, the program code of the data deduplication method based on big data.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The processor 42 is configured to collect at least two text data according to a preset keyword; for each text data, forming a k-bit binary string according to a similar hash function (simhash) and a hash function (hash), wherein k＝2 ⁿ Wherein n is a positive integer of 2 or more; equally dividing the k-bit binary string into j sub-binary strings, wherein j is a positive integer greater than or equal to 1; the arrangement sequence of the j number of sub-binary strings is regulated, corresponding j sets are generated by taking different numbers of sub-binary strings as the binary strings at the forefront end, and the j number of sets are stored in a preset sample library; matching the sample library by using a binary string at the forefront end of each set of the j sets to obtain candidate results of the sets returned by the sample library; and calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value.

In another embodiment of the present application, the processor 42 is further configured to: when the 64-bit binary string is equally divided into four parts, any one of the 16-bit sub-binary strings may be adjusted as the front-most binary string of the four-part sub-binary string, for example, sub-binary string L _1-16 ，L _17-32 ，L _33-48 And L _48-64 The binary string that is the forefront of all binary strings may be adjusted separately, there are 4 sets, and the table (table) may be stored in the memory, that is, the 4 tables are stored in the memory, for example, the 4 sets are respectively: (L) _1-16 ，L _17-32 ，L _33-48 ，L _48-64 )、(L _17-32 ，L _1-16 ，L _33-48 ，L _48-64 )、(L _33-48 ，L _1-16 ，L _17-32 ，L _48-64 )、(L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 )。

The processor 42 is further configured to calculate a hamming distance between two text data (e.g., the first text data and the second text data), e.g., the hamming distance of the binary string a and the binary string B is the number of 1 s in the binary after the a xor B.

The processor is further configured to map the high-dimensional feature vector into a fingerprint (fingerprint) of f bits (f-bit) through a simhash algorithm, where f is an integer greater than or equal to 2, and determine whether the two texts are repeated or highly similar by comparing Hamming distances (Hamming distances) of the f-bit fingerprints of the two texts, that is, the smaller the Hamming Distance value is, the more similar the Hamming Distance value is, and when the Hamming Distance is equal to zero, the two comparison texts are identical, and the larger the Hamming Distance value is, the more dissimilar the Hamming Distance value is.

The present application also provides another embodiment, namely, a computer-readable storage medium storing a data processing program executable by at least one processor to cause the at least one processor to perform the steps of the data processing method as described above.

For example, the data processing program, when executed by at least one processor, performs the following: collecting at least two text data according to preset keywords; for each text data, forming a k-bit binary string according to a similar hash function (simhash) and a hash function (hash), wherein k=2 ⁿ Wherein n is a positive integer of 2 or more; equally dividing the k-bit binary string into j sub-binary strings, wherein j is a positive integer greater than or equal to 1; the arrangement sequence of the j number of sub-binary strings is regulated, corresponding j sets are generated by taking different numbers of sub-binary strings as the binary strings at the forefront end, and the j number of sets are stored in a preset sample library; matching the sample library by using a binary string at the forefront end of each set of the j sets to obtain candidate results of the sets returned by the sample library; and calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value.

In another embodiment of the present application, the data processing program, when executed by at least one processor, performs the following: when the 64-bit binary string is equally divided into four parts, any one part of 16-bit sub-strings can be binaryString making adjustment as the front-most binary string of the four-part sub-binary string, e.g. sub-binary string L _1-16 ，L _17-32 ，L _33-48 And L _48-64 The binary string that is the forefront of all binary strings may be adjusted separately, there are 4 sets, and the table (table) may be stored in the memory, that is, the 4 tables are stored in the memory, for example, the 4 sets are respectively: (L) _1-16 ，L _17-32 ，L _33-48 ，L _48-64 )、(L _17-32 ，L _1-16 ，L _33-48 ，L _48-64 )、(L _33-48 ，L _1-16 ，L _17-32 ，L _48-64 )、(L _48-64 ，L _1-16 ，L _17-32 ，L _33-48 )。

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. A method for deduplication of data based on big data, comprising:

collecting at least two text data according to preset keywords;

calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value;

the generating the k-bit binary string for each text data according to the similar hash function and the hash function specifically comprises:

selecting the number k of bits of the similar hash function;

initializing each bit of the similar hash function to 0;

converting the generated k number values into a k-bit binary string;

The step of matching the sample library by the binary string at the forefront of each set of the j sets, and the step of obtaining the candidate results of each set returned by the sample library specifically comprises the following steps:

and determining whether the forefront binary string of each binary string of the j number of sub binary strings is identical to the forefront binary string stored in the sample library in the memory, if so, determining that the candidate result fed back currently by the sample library is a correct candidate result, and if not, determining that the candidate result fed back currently by the sample library is an incorrect candidate result.

2. The method for data deduplication based on big data according to claim 1, wherein the selecting the number of bits k of the similar hash function specifically comprises:

the number of bits k of the similar hash function is selected according to the storage cost and the size of the data set.

3. The method for data deduplication based on big data according to claim 1, wherein the hash function processing on the word in each word_weight pair specifically comprises:

a hash code of a predetermined number of word-segmentation letters of each text data is calculated using a k-bit hash function.

4. The big data based data deduplication method of claim 3, wherein the collecting the at least two text data according to the preset keyword specifically comprises: and capturing at least two text data related to the keywords according to preset keywords by utilizing a web crawler technology.

5. The method for data deduplication based on big data according to claim 1, wherein the performing the longitudinal accumulation of the bits on the word-segmentation_weight pair processed by the hash function, generating k number values specifically includes:

and carrying out longitudinal accumulation on the word segmentation_weight pair processed by the hash function, adding 1 if the bit is 1, subtracting 1 if the bit is 0, and finally generating k numerical values.

6. An apparatus for deduplication based on big data, comprising:

The calculation module is used for calculating the Hamming distance of any two text data according to the candidate result of each text data, and performing deduplication if the Hamming distance is smaller than or equal to a threshold value;

selecting the number k of bits of the similar hash function;

initializing each bit of the similar hash function to 0;

converting the generated k number values into a k-bit binary string;

7. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the big data based data deduplication method of any of claims 1 to 5 when the computer program is executed.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the big data based data deduplication method of any of claims 1 to 5.