CN110297879A

CN110297879A - A kind of method, apparatus and storage medium of the data deduplication based on big data

Info

Publication number: CN110297879A
Application number: CN201910401427.5A
Authority: CN
Inventors: 王保军; 江腾飞
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2019-10-01
Anticipated expiration: 2039-05-15
Also published as: CN110297879B; WO2020228182A1

Abstract

The embodiment of the present application belongs to internet area, is related to the method, apparatus and storage medium of a kind of data deduplication based on big data, including collect at least two text datas according to preset keyword；For each text data, k binary string numbers are generated according to similar hash function and hash function；Putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, corresponding j collection is generated as the binary string of front end using the sub- binary string of different parts and merges storage into preset sample database；The binary string of the front end for each set gathered with described j matches the sample database, obtains the candidate result respectively gathered that the sample database returns；The Hamming distances of any two text data are calculated according to the candidate result of each text data, if Hamming distances are less than or equal to threshold value, carry out duplicate removal.Dimensionality reduction is carried out to data using hash algorithm, the reduced time of two texts is reduced, reduces and expense is stored to text.

Description

A kind of method, apparatus and storage medium of the data deduplication based on big data

Technical field

This application involves Internet technical field more particularly to a kind of method, apparatus of the data deduplication based on big data, Computer equipment and storage medium.

Background technique

Big data (big data), referring to can not be captured with conventional software tool within the scope of certain time, be managed It is to need new tupe that could have stronger decision edge, see clearly discovery power and process optimization energy with the data acquisition system of processing Magnanimity, high growth rate and the diversified information assets of power.

Big data has the characteristics that following several big:

Capacity (Volume): the size of data determines the value and potential information of considered data；

Type (Variety): the diversity of data type；

Speed (Velocity): refer to the speed for obtaining data；

Changeability (Variability): hampering processing and effectively manages the process of data；

Authenticity (Veracity): the quality of data；

Complexity (Complexity): data volume is huge, and source is by all kinds of means；

Be worth (value): reasonable utilization big data creates high value with low cost.

The smallest basic unit of big data is bit (bit), provides all units in order: bit, byte (Byte), KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB, in addition to 1Byte=8bit, according into rate 1024 (2 between other units Ten powers) it calculates.

With the arriving in information explosion epoch and the application of cloud, big data has attracted more and more concerns, greatly The processing technique of data mainly includes MPP (MPP) database, data mining, distributed file system, divides Cloth database, cloud computing platform, internet and expansible storage system.

When repeatedly backing up identical file under same catalogue in a network, or identical text is backed up from multiple addresses When part, the data that will be duplicated, repeated data considerably increases I/0 the and CPU processing pressure of analysis system, if not Do duplicate removal processing, the analysis efficiency of that data can reduce, and the hardware spending of analysis system is caused to increase, and for according to point It analyses total flow and carries out charging item, it is unacceptable that extra analysis cost, which is spent,.Repeated data is carrying out big data It is especially serious when processing, because being flooded with a large amount of nearly duplicate message on internet at present, big data excavation is come It says, duplicate data, which will lead to, makes erroneous judgement, that is, invalid big data to Mr. Yu's aspect.

Therefore, it is necessary to duplicate removal be carried out to repeated data, to avoid the generation of the above problem.

A kind of data deduplication technology of the prior art is that (payload), total evidence or customized are loaded according to the expense of data Rule carries out comparing and then does the filtering duplicate removal of redundant data to judge whether there is repetition.

There are also a kind of data deduplication technologies of prior art, are to compare two text similarities, are to segment text mostly Later, measurement of feature vector distance, such as common Euclidean distance, Hamming distances or complementary chord angle etc. are converted into.

The data deduplication technology of foregoing description can be suitable for the few scene of data volume well, but when internet exists largely Repeated data when, above-mentioned data deduplication technology is difficult to the ability scene suitable for mass data processing, otherwise can greatly increase point I/0 the and CPU processing pressure of analysis system, waste of resource.

Summary of the invention

The purpose of the embodiment of the present application is to propose method, apparatus, the computer of a kind of data deduplication based on big data Equipment and storage medium carry out dimensionality reduction to data using hash algorithm, it is possible to reduce the reduced time of two texts, reduction pair Text stores expense.

In order to solve the above-mentioned technical problem, the embodiment of the present application provides a kind of method of data deduplication based on big data, Using technical solution as described below:

At least two text datas are collected according to preset keyword；

For each text data, k binary strings are generated according to similar hash function and hash function, wherein k= 2ⁿ, wherein n is the positive integer more than or equal to 2；

The k binary string is divided into j one's share of expenses for a joint undertaking binary string, wherein j is the positive integer more than or equal to 1；

Putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as the binary system of front end Concatenate into corresponding j collection and merges storage into preset sample database；

The binary string of the front end for each set gathered with described j matches the sample database, obtains the sample The candidate result respectively gathered that library returns；

The Hamming distances that any two text data is calculated according to the candidate result of each text data, if Hamming distances Less than or equal to threshold value, duplicate removal is carried out.

Optionally, described to be directed to each text data, k binary systems are generated according to similar hash function and hash function String specifically includes:

Select the digit k of similar hash function；

Everybody of similar hash function is initialized as 0；

Each text data is subjected to participle extraction, extracts multiple participle _ weights pair；

Hash function processing is carried out to each participle _ weight centering participle；

To cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, k number value is generated；

The k number value of the generation is converted to k binary strings.

Optionally, the digit k of the similar hash function of the selection is specifically included: according to the big of carrying cost and data set It is small, select the digit k of similar hash function.

Optionally, described that the progress hash function processing of each participle _ weight centering participle is specifically included: to use k Hash function calculates the hash code of the predetermined quantity participle letter of each text data.

Optionally, described to be specifically included according to preset keyword collection at least two text data: to utilize net Network crawler technology grabs at least two text datas relevant to the keyword according to preset keyword.

Optionally, the binary string of the front end of each set gathered with described j matches the sample database, obtains The candidate result respectively gathered for taking the sample database to return specifically includes: determining every part of binary string of the j parts of binary string Front end the binary string of front end that stores of binary string and sample database described in memory it is whether identical, if It is identical, it is determined that the candidate result of the sample database current feedback is correct candidate result, if it is different, then determining institute The candidate result for stating sample database current feedback is incorrect candidate result.

Optionally, described pair it is cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, generate K numerical value specifically includes: to cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, if should Position is 1, then plus 1, if it is 0, then subtracts 1, ultimately produce k number value.

In order to solve the above-mentioned technical problem, the embodiment of the present application also provides a kind of dress of data deduplication based on big data It sets, using technical solution as described below, the device of the data deduplication based on big data includes:

Collection module, for collecting at least two text datas according to preset keyword；

Processing module, for being directed to each text data, according to similar hash function and hash function generate the two of k into System is gone here and there, wherein k=2ⁿ, wherein n is the positive integer more than or equal to 2；

Module is divided, for the k binary string to be divided into j one's share of expenses for a joint undertaking binary string, wherein j is more than or equal to 1 Positive integer；

Module is adjusted, putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as front end Binary string generate the merging of corresponding j collection and store into preset sample database；

The binary string of matching module, the front end of each set for being gathered with described j matches the sample database, Obtain the candidate result respectively gathered that the sample database returns；

Computing module, for according to the candidate result of each text data calculate any two text data hamming away from From if Hamming distances are less than or equal to threshold value, progress duplicate removal.

In order to solve the above-mentioned technical problem, the embodiment of the present application also provides a kind of computer equipment, uses as described below Technical solution:

The computer equipment includes memory and processor, and computer program, the processing are stored in the memory The step of device realizes the method for the data deduplication based on big data when executing the computer program.

In order to solve the above-mentioned technical problem, the embodiment of the present application also provides a kind of computer readable storage medium, uses Technical solution as described below:

Computer program is stored on the computer readable storage medium, when the computer program is executed by processor The step of realizing the method for the data deduplication based on big data.

Compared with prior art, the embodiment of the present application mainly have it is following the utility model has the advantages that

For each text data, k binary strings are generated according to similar hash function and hash function, wherein k= 2ⁿ, wherein n is the positive integer more than or equal to 2, the k binary string is divided into j one's share of expenses for a joint undertaking binary string, wherein j is big In the positive integer for being equal to 1, putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as front end Binary string generate corresponding j collection and merge and store into preset sample database, most with each set of described j set The binary string of front end matches the sample database, the candidate result respectively gathered that the sample database returns is obtained, according to each text The candidate result of notebook data calculates the Hamming distances of any two text data, if Hamming distances are less than or equal to threshold value, carries out Therefore duplicate removal carries out dimensionality reduction to big data using hash algorithm, it is possible to reduce the reduced time of two texts reduces to text Store expense.

Detailed description of the invention

It, below will be to needed in the embodiment of the present application description in order to illustrate more clearly of the scheme in the application Attached drawing makees a simple introduction, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is according to a kind of flow chart of one embodiment of the method for data deduplication based on big data of the application；

Fig. 2 is a kind of flow chart of specific embodiment of step 102 in Fig. 1；

Fig. 3 is the structural representation according to a kind of one embodiment of the device of data deduplication based on big data of the application Figure；

Fig. 4 is the structural schematic diagram according to one embodiment of the computer equipment of the application.

Appended drawing reference: 301- collection module, 302- processing module, 303- division module, 304- adjustment module, 305- matching Module, 306- computing module, 307- bus, 41- memory, 42- processor and 43- network interface

Specific embodiment

Unless otherwise defined, all technical and scientific terms used herein and the technical field for belonging to the application The normally understood meaning of technical staff is identical；Description tool is intended merely in the term used in the description of application herein The purpose of the embodiment of body, it is not intended that in limitation the application；The description and claims of this application and above-mentioned attached drawing are said Term " includes " in bright and " having " and their any deformation, it is intended that cover and non-exclusive include.The application's Specification and claims or term " first " in above-mentioned attached drawing, " second " etc. be for distinguishing different objects, rather than It is used to describe a particular order.

Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.The phrase, which occurs, in each position in the description might not each mean phase Same embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art are explicitly Implicitly understand, embodiment described herein can be combined with other embodiments.

In order to make those skilled in the art more fully understand application scheme, below in conjunction with attached drawing, to the application reality The technical solution applied in example is clearly and completely described.

As shown in Figure 1, a kind of method flow schematic diagram of data deduplication based on big data for one embodiment of the application, The method of the data deduplication based on big data can be as described below.

Step 101, at least two text datas are collected according to preset keyword.

For example, grabbing at least two text relevant to the keyword according to preset keyword using web crawlers technology At least two text data is stored in the data warehouse of buffer or memory by notebook data.

Web crawlers (the webpage spider that is otherwise known as, network robot, webpage follower), is a kind of according to certain rule Then, the program or script of web message are automatically grabbed.It is search engine support grid page above and below WWW, is search The important composition of engine.Traditional crawler obtains the URL on Initial page since the URL of one or several Initial pages, During grabbing webpage, new URL is constantly extracted from current page and is put into queue, certain stopping until meeting system Condition.

In the present embodiment, multiple text datas can be collected by focused web crawler, the focused web crawler was according to both Fixed crawl target (for example, investment classification information of client), selectable access log record, APP feedback, wechat or ten thousand It ties up online webpage to link to relevant, information required for obtaining.For example, setting is thrown when being searched for by focused web crawler The related keyword of data is provided, for example, the keyword can be with are as follows: name, ID card No., address, telephone number, bank Account, email address, affiliated city, postcode, cipher type (such as account inquiries password, withdrawal password, login password), tissue Organization names, business license number, account No., trade date, transaction amount etc..Then web crawlers log recording, Crawl text data relevant to keyword on webpage on APP feedback, wechat or WWW, for example, the correlation refers to packet Text data containing the keyword, the text data that these are collected into are stored in buffer or storage according to various dimensions In the data warehouse of device, then the data of the data warehouse are big data.

Step 102, raw according to similar hash function (simhash) and hash function (hash) for each text data At k binary strings, wherein k=2ⁿ, wherein n is the positive integer more than or equal to 2.

For example, being directed to each text data (for example, Doc text, web text) of big data, simhash function is utilized Hash coding (hashcode) is converted by this article notebook data, it is described in detail below.

For example, being illustrated for following three sections of texts: p1=the cat sat on the mat；P2=the cat sat on a mat；P3=we all scream for ice cream, whole process can be as described below, as shown in Fig. 2, For a kind of flow chart of specific embodiment of step 102 in Fig. 1.

Step 1021, the digit k for selecting simhash function.

For example, selecting the digit k of simhash, wherein k=2 according to carrying cost and the size of data setⁿ, n is big In the positive integer for being equal to 2, such as k=16,32,64 or 128.

Everybody of simhash function is initialized as 0 by step 1022.

Each text data is carried out participle extraction by step 1023, extracts multiple participle _ weights pair.

For example, each text data is carried out participle extraction (including segmenting and calculating weight), obtained for example, extracting N participle _ weight is denoted as feature_weight_pairs=[fw1, fw2 ... to (feature_weight_pairs) Fwn], wherein fwn=(feature_n, weight_n), wherein n is the positive integer more than or equal to 2.

For example, generally by the way of the participle of various predetermined numbers, for example, the predetermined quantity is 2 or 3, for example, right In " the cat sat on the mat ", following result is obtained by the way of segmenting two-by-two: " th ", " he ", " e ", " C ", " ca ", " at ", " t ", " s ", " sa ", " o ", " on ", " n ", " t ", " m ", " ma " }, wherein also calculate a letter in space.

Step 1024 carries out each participle _ weight to the participle (feature) in (feature_weight_pairs) The processing of hash function.

For example, segmenting the Hash of alphabetical (word) using each predetermined quantity that 32 hash functions calculate this article notebook data Code (hashcode) calculates every 2 or 3 alphabetical hash codes (hashcode) of text data, such as: " th " .hash =-502157718, " he " .hash=-369049682 ... ....

Step 1025, to cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, generate k A numerical value.

For example, to cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, if the position is 1, then plus 1, if it is 0, then subtracts 1, ultimately produce k (i.e. bits_count) a numerical value.

For example, using 32 hash functions, then the digit bits_count=32 that hash is generated, to each participle (word) Hashcode each, if the position be 1, simhash corresponding positions value add 1；Otherwise subtract 1, obtain 32 numerical value (i.e. simhash includes 32 numerical value).

Step 1026, the binary string that the k number value of generation is converted to k.

For example, if the position is greater than 1, being set as 1 to 32 simhash finally obtained；Otherwise it is set as 0.

In another embodiment of the application, 64 or 128 binary strings also can be generated, the present embodiment is simultaneously unlimited It is fixed.

Similar following result can should be generated using simhash:

Irb (main): 003:0 > p1.simhash=> 851,459,198 00110010110000000011110001111 110

Irb (main): 004:0 > p2.simhash=> 847,263,864 00110010100000000011100001111 000

Irb (main): 002:0 > p3.simhash=> 984,968,088 00111010101101010110101110011 000。

After simhash functional operation, the Hamming distances (hammingdistance) of these three texts be two two into The quantity of different positions in system string.

Step 103, the k binary string is divided into j one's share of expenses for a joint undertaking binary string, wherein j is the positive integer more than or equal to 1 J is the positive integer more than or equal to 1.

For example, the binary string of 32 or 64 are divided into four parts, for example, when 32 binary strings to be divided into At four parts, every part includes 8 seat binary strings, for example, every part includes 16 when 64 binary strings are divided into four parts Position binary string.For example, 64 binary strings to be divided into four parts 16 sub- binary strings: L_1-16, L_17-32, L_33-48With L_48-64, L_1-16, L_17-32, L_33-48And L_48-64Respectively include 16 binary strings.

The binary string of 32 or 64 are only divided into for four parts and are illustrated by above-described embodiment, but the application Embodiment be not intended to limit and be divided into how many points, for example, j is the positive integer more than or equal to 1, such as j can be 2,3,4,5,6,7 Or 8 etc..

Step 104, putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as front end Binary string generate corresponding j collection and merge storage into preset sample database.

When 64 binary strings are divided into four parts, any a 16 seat binary string can be adjusted and be used as institute The binary string of the front end of four one's share of expenses for a joint undertaking binary strings is stated, for example, sub- binary string L_1-16, L_17-32, L_33-48And L_48-64It can be with The binary string of the front end as all binary strings is adjusted separately, then there are 4 set, can be deposited with table (table) It is stored in preset sample database, for example, storage is stored with 4 table into preset memory in memory, such as 4 set are respectively as follows: (L_1-16, L_17-32, L_33-48, L_48-64)、(L_17-32, L_1-16, L_33-48, L_48-64)、 (L_33-48, L_1-16, L_17-32, L_48-64)、(L_48-64, L_1-16, L_17-32, L_33-48)。

Above-described embodiment is only to carry out sets classification, subsequent sub- binary string with an one's share of expenses for a joint undertaking binary string of foremost How to arrange and does not limit.For example, can also otherwise carry out sets classification, example in another embodiment of the application Such as, 64 binary strings are divided into two parts, every part includes 32 seat binary strings, such as sub- binary string L_1-32With L_33-64.Any a 32 seat binary string can be adjusted to the binary string as front end, for example, by sub- binary string L_1-32And L_33-64The binary string as front end is adjusted separately, then there are 2 set, can be stored in table (table) In the sample database of memory, i.e., it is stored with 2 table in memory, for example, 2 set are respectively (L_1-32, L_33-64) and (L_33-64, L_1-32)。

Step 105, the binary string of the front end for each set gathered with described j matches the sample database, obtains The candidate result respectively gathered that the sample database returns.

For example, the binary string of the front end for each set gathered with described j matches the sample database, if described Sample database always has 2^mA Hash fingerprint then returns to 2 for each set^m-jA candidate result, wherein m is the integer greater than 2, And m > j.

For example, when above-mentioned 64 binary strings generate four table, searched in the way of matched most preceding 16 seat two into System string, if having 2 in sample database³⁴(the Hash fingerprint of similar 1,000,000,000), then each table returns to 2^(34-16)=262144 A candidate result, returns to 2 compared with the existing technology³⁴Hash fingerprint, greatly reduce the calculating cost of Hamming distances.

In another embodiment of the application, the binary string of the front end of each set gathered with described j The sample database is matched to specifically include: determine the binary string of the front end of every part of binary string of the j parts of binary string with Whether the binary string of the front end of memory storage is identical, if identical determine that matching, that is, determines the sample database The candidate result of current feedback is correct candidate result, if difference determines that mismatch, i.e. the sample database is worked as in determination The candidate result of preceding feedback is incorrect candidate result.

Step 106, the Hamming distances that any two text data is calculated according to the candidate result of each text data, if Hamming distances are less than or equal to threshold value, carry out duplicate removal (abandoning or delete one of text).

For example, the Hamming distances of binary string A and binary string B be exactly after A xor B in binary system 1 number.

For example, binary string A=100111, binary string B=101010, then hamming_distance (A, B)= Count_1 (A xor B)=count_1 (001101)=3.

The maps feature vectors of higher-dimension are led at the fingerprint (fingerprint) of a f-bit by simhash algorithm It crosses and compares the Hamming distances (Hamming Distance) of the f-bit fingerprint of two texts to determine whether two texts repeat Or height is approximate, i.e. the value of Hamming distances is smaller more similar, when Hamming distances are equal to zero, illustrates two comparison texts Identical, the value of Hamming distances the big more dissimilar.

For example, the simhash of above three text p1, p2 and p3 are as a result, its Hamming distances between any two is (p1, p2) =4, (p1, p3)=16 and (p2, p3)=12.Then the similarity between text, the similarity between p1 and p2 want long-range two-by-two In the similarity with p3.

In conclusion above-described embodiment description the data deduplication based on big data method, simhash functional operation and Although the maximum difference of hash functional operation is hash function can be used for mapping to compare the repetition of text, but right In may the text of gap only one byte can also be mapped to two entirely different Hash results, and simhash function pair The Hash mapping result of similar text is also similar.For example, setting simhash function is 64, i.e. f=64, by adding for text Power characteristic set is mapped on the Hash fingerprint (fingerprint) of a 64-bit.

For example, setting simhash function is 64,64 binary strings are divided into 4 one's share of expenses for a joint undertaking binary strings, then Above-mentioned 64 binary systems are adjusted, using any one one's share of expenses for a joint undertaking binary string as first 16, there are four types of combinations in total, generate four parts Table is simultaneously stored into sample database, first 16 is searched using accurate matched mode, if having 2 in sample database³⁴(almost 10 Hash fingerprint hundred million), then each table returns to 2^(34-16)=262144 candidate results, greatly reduce Hamming distances Calculate cost.

Therefore, the method for the data deduplication based on big data of embodiments herein description, using hash algorithm to big Data carry out dimensionality reduction, it is possible to reduce the reduced time of two texts reduces and stores expense to text.

It should be noted that the method for the data deduplication based on big data provided by the embodiment of the present application is generally by servicing Device/terminal device executes, and correspondingly, the square law device of the data deduplication based on big data is generally positioned at server/terminal and sets In standby.The terminal device can be wireless terminal and be also possible to catv terminal, and wireless terminal can be directed to user and provide language The equipment of sound and/or data connectivity has the handheld device of wireless connecting function or is connected to radio modem Other processing equipments.Terminal can be portable, pocket, hand-held, built-in computer or vehicle-mounted mobile dress It sets.

It should be understood that the number of terminal device, network and server is only schematical.It, can be with according to needs are realized With any number of terminal device, network and server.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage In medium, the program is when being executed, it may include such as each process of the embodiment of above-mentioned each method.Wherein, storage above-mentioned is situated between Matter can be the non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM), or with Machine storage memory (Random Access Memory, RAM) etc..

It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, in the flow chart of attached drawing at least A part of step may include that perhaps these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps Moment executes completion, but can execute at different times, and execution sequence is also not necessarily and successively carries out, but can be with It is executed in turn or alternately at least part of the sub-step or stage of other steps or other steps.

With further reference to Fig. 3, as the realization to method shown in above-mentioned Fig. 1, this application provides one kind to be based on big data Data deduplication device one embodiment, the Installation practice is corresponding with embodiment of the method shown in FIG. 1, the device tool Body can be applied in various electronic equipments.

As shown in figure 3, the data deduplication device 300 described in the present embodiment based on big data include: collection module 301, Processing module 302, division module 303, adjustment module 304, matching module 305, computing module 306 and bus 307.The receipts Collect module 301, the processing module 302, the division module 303, the adjustment module 304,305 and of the matching module The computing module 306 is connected by the bus 307 between each other.The module division of the present embodiment is only schematic, also It can be acted according to respective method and carry out respective logical partitioning.

The bus 307 is for realizing the connection communication between these components.For example, the bus 307 can be industry Standard architecture (Industry Standard Architecture, ISA) bus, Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..It is total that the bus system can be divided into address bus, data Line, control bus etc..Only to be indicated with a thick line in figure, it is not intended that an only bus or a type convenient for indicating The bus of type.

The collection module 301, for collecting at least two text datas according to preset keyword；

The processing module 302, for being directed to each text data, according to similar hash function (simhash) and Hash Function (hash) is at k binary strings, wherein k=2ⁿ, wherein n is the positive integer more than or equal to 2；

The division module 303, for the k binary string to be divided into j one's share of expenses for a joint undertaking binary string, wherein j be greater than Positive integer equal to 1；

The adjustment module 304, for adjusting putting in order for the j one's share of expenses for a joint undertaking binary string, with the son two of different parts into System string is that the binary string of front end generates corresponding j collection merging storage into preset sample database；

The matching module 305, for described in the binary string matching with the front end of each set of described j set Sample database obtains the candidate result respectively gathered that the sample database returns.For example, the matching module 305 is used for the j The binary string of the front end of each set of a set matches the sample database, if the sample database always has 2^mA Hash refers to Line then returns to 2 for each set^m-jA candidate result, wherein m is integer greater than 2, and m > j；

The computing module 306, for calculating any two text data according to the candidate result of each text data Hamming distances carry out duplicate removal if Hamming distances are less than or equal to threshold value.

It is illustrated for three sections of texts below: p1=the cat sat on the mat；P2=the cat sat on a mat；P3=we all scream for ice cream.

In another embodiment of the application, for example, the processing module 302 is used for each textual data for big data According to (for example, Doc text, web text), Hash is converted by this article notebook data using simhash function and is encoded (hashcode), for example, the processing module 302 further include: selection subelement, extracts subelement, breathes out initialization subelement Uncommon function processing subelement, cumulative subelement and processing subelement, wherein the selection subelement, is somebody's turn to do at the initialization subelement Subelement, hash function processing subelement, the cumulative subelement and any the two of the processing subelement is extracted between each other may be used With communication connection.

The selection subelement, for selecting the digit k of similar hash function, for example, the selection subelement is used for root According to carrying cost and the size of data set, the digit k of simhash is selected, wherein k=2ⁿ, n is just whole more than or equal to 2 Number, such as k=16,32,64 or 128.

Subelement is initialized, for everybody of similar hash function to be initialized as 0；

It extracts subelement and extracts multiple participle _ weights pair for each text data to be carried out participle extraction.For example, The extraction subelement carries out hash function processing to each participle _ weight centering participle, for example, the extraction subelement Predetermined quantity for using k hash functions to calculate each text data segments alphabetical hash code.For example, described pre- Fixed number amount is 2 or 3.

The extraction subelement is also used to carry out each text data participle extraction and (including participle and calculates power Weight), n (participle, weight) (participles _ weight) are obtained to (feature_weight_pairs) for example, extracting, and are denoted as Feature_weight_pairs=[fw1, fw2 ... fwn], wherein fwn=(feature_n, weight_n), wherein n is big In the positive integer for being equal to 2.For example, generally by the way of the participle of various predetermined numbers, for example, the predetermined quantity is 2 or 3, For example, obtaining following result by the way of segmenting two-by-two for " the cat sat on the mat ": " th ", " he ", " E ", " c ", " ca ", " at ", " t ", " s ", " sa ", " o ", " on ", " n ", " t ", " m ", " ma " }, wherein also calculate a word in space It is female.

Hash function handles subelement, for carrying out hash function processing to each participle _ weight centering participle.Example Such as, for example, hash function processing subelement is used to calculate each predetermined quantity point of this article notebook data using 32 hash functions The hash code (hashcode) of word letter (word) calculates text data every 2 or 3 alphabetical hash codes (hashcode), such as: " th " .hash=-502157718, " he " .hash=-369049682 ... ....

Cumulative subelement, for by the hash function, treated that participle _ weight is cumulative to the longitudinal direction for carrying out position, Generate k number value, for example, the cumulative subelement to by the hash function treated participle _ weight to carrying out position Longitudinal cumulative, if the position is 1, weighted then subtracts weight if it is 0, ultimately produces k number value.For example, cumulative son is single Member uses 32 hash functions, then the digit bits_count=32 that hash is generated, to each hashcode's for segmenting (word) Each, if the value that the position is 1, simhash corresponding positions adds 1；Otherwise subtract 1, obtaining 32 numerical value, (i.e. simhash includes 32 numerical value).

Subelement is handled, k binary strings are converted to for the k number value by the generation.For example, the k be 32, 64 or 128, for example, the processing subelement is set as 1 if the position is greater than 1 to 32 simhash finally obtained； Otherwise it is set as 0.

In another embodiment of the application, 64 or 128 binary strings, this implementation is also can be generated in processing subelement Example does not limit.

Similar following result can should be generated using simhash:

Irb (main): 003:0 > p1.simhash=> 851,459,198 00110010110000000011110001111 110

Irb (main): 004:0 > p2.simhash=> 847,263,864 00110010100000000011100001111 000

In another embodiment of the application, each set that the matching module 305 is used to gather with described j is most The binary string of front end matches the sample database and specifically includes: the matching module 305 is used for each of described j set The binary string of the front end for each set that the binary string of the front end of set is stored with memory carries out the phase same sex and sentences It is disconnected, if identical determine that matching, i.e., if the identical candidate result for determining that the sample database current feedback is correctly to wait Choosing is as a result, if difference determines that mismatch, i.e., if the not identical candidate result for determining that the sample database current feedback is Incorrect candidate result.

In another embodiment of the application, the division module 303 is also used to 32 or 64 binary strings etc. It is divided into four parts, for example, every part includes 8 seat binary strings, for example, working as when 32 binary strings are divided into four parts When 64 binary strings are divided into four parts, every part includes 16 binary strings.For example, the division module 303 is also used to 64 binary strings are divided into four parts 16 sub- binary strings: L_1-16, L_17-32, L_33-48And L_48-64, L_1-16, L_17-32, L_33-48And L_48-64Respectively include 16 binary strings.

The binary string of 32 or 64 are only divided into for four parts and are illustrated by above-described embodiment, but the application Embodiment be not intended to limit and be divided into how many points, for example, j is the positive integer more than or equal to 1, such as j can be 2,4,6 or 8 etc. Even number.

In another embodiment of the application, the adjustment module 304 is also used to 64 binary strings being divided into four When part, any a 16 seat binary string can be adjusted to the binary system of the front end as the four one's shares of expenses for a joint undertaking binary string String, for example, sub- binary string L_1-16, L_17-32, L_33-48And L_48-64The front end as all binary strings can be adjusted separately Binary string, then exist 4 set, can be stored in memory, i.e., be stored in memory with table (table) 4 table, such as 4 set are respectively as follows: (L_1-16, L_17-32, L_33-48, L_48-64)、(L_17-32, L_1-16, L_33-48, L_48-64)、 (L_33-48, L_1-16, L_17-32, L_48-64)、(L_48-64, L_1-16, L_17-32, L_33-48)。

For example, the computing module 306 is also used to calculate two text datas (for example, the first text data and the second text Notebook data) between Hamming distances, the Hamming distances of binary string A () and binary string B are exactly binary system after A xor B In 1 number.

The computing module 306 is also used to the maps feature vectors of higher-dimension through simhash algorithm at f-bit's Fingerprint (fingerprint), the Hamming distances (Hamming Distance) by comparing the f-bit fingerprint of two texts come Determine whether two texts repeat or height is approximate, i.e. the value of Hamming distances is smaller more similar, when Hamming distances are equal to zero When, illustrate that two comparison texts are identical, the value of Hamming distances the big more dissimilar.

In the present embodiment, above-mentioned module can be realized by one or more processors, chip or integrated circuit, this reality Example is applied not limit.

Therefore, the device of the data deduplication based on big data of embodiments herein description, using hash algorithm to big Data carry out dimensionality reduction, it is possible to reduce the reduced time of two texts reduces and stores expense to text.

In order to solve the above technical problems, the embodiment of the present application also provides computer equipment.It is this referring specifically to Fig. 4, Fig. 4 Embodiment computer equipment basic structure block diagram.

The computer equipment 4 includes that connection memory 41, one or more processors are in communication with each other by system bus 42, network interface 43.It should be pointed out that the computer equipment 4 with component 41-43 is illustrated only in figure, it should be understood that Be, it is not required that implement all components shown, the implementation that can be substituted is more or less component.Wherein, this technology Field technical staff is appreciated that computer equipment here is that one kind can be according to the instruction for being previously set or storing, automatically The equipment for carrying out numerical value calculating and/or information processing, hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field- Programmable Gate Array, FPGA), it is digital processing unit (Digital Signal Processor, DSP), embedded Equipment etc..

The computer equipment can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The computer equipment can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user Machine interaction.

The memory 41 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random Access memory (Static Random-Access Memory, SRAM), read-only memory (ROM), electrically erasable Read-only memory (Electrically Erasable Programmable read only memory, EEPROM) can be compiled Journey read-only memory (Programmable read-only memory, PROM), magnetic storage, disk, CD etc..One In a little embodiments, the memory 41 can be the internal storage unit of the computer equipment 4, such as the computer equipment 4 Hard disk or memory.In further embodiments, the memory 41 is also possible to the external storage of the computer equipment 4 The plug-in type hard disk being equipped in equipment, such as the computer equipment 4, intelligent memory card (SmartMedia Card, SMC), safety Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 41 can also both include The internal storage unit of the computer equipment 4 also includes its External memory equipment.In the present embodiment, the memory 41 is logical It is usually used in operating system and types of applications software that storage is installed on the computer equipment 4, such as above-mentioned data processing method Program code etc..It has exported or will export all kinds of in addition, the memory 41 can be also used for temporarily storing Data.

The processor 42 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 42 is commonly used in described in control The overall operation of computer equipment 4.In the present embodiment, the processor 42 is for running the journey stored in the memory 41 Sequence code or data deduplication, such as run the program code of the data duplicate removal method based on big data.

The network interface 43 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the computer equipment 4 and other electronic equipments.

The processor 42, for collecting at least two text datas according to preset keyword；For each text data, According to similar hash function (simhash) and hash function (hash) at k binary strings, wherein k=2ⁿ, wherein n be Positive integer more than or equal to 2；The k binary string is divided into j one's share of expenses for a joint undertaking binary string, wherein j is just more than or equal to 1 Integer；Putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as the binary string of front end It generates corresponding j collection and merges storage into preset sample database；With described j gather each set front end two into Sample database described in String matching processed obtains the candidate result respectively gathered that the sample database returns；According to the time of each text data It selects result to calculate the Hamming distances of any two text data, if Hamming distances are less than or equal to threshold value, carries out duplicate removal.

In another embodiment of the application, the processor 42 is also used to: 64 binary strings are divided into four parts When, any a 16 seat binary string can be adjusted to the binary string of the front end as the four one's shares of expenses for a joint undertaking binary string, For example, sub- binary string L_1-16, L_17-32, L_33-48And L_48-64Two of the front end as all binary strings can be adjusted separately Then there are 4 set, can be stored in memory with table (table), i.e., be stored with 4 in memory in system string Table, such as 4 set are respectively as follows: (L_1-16, L_17-32, L_33-48, L_48-64)、(L_17-32, L_1-16, L_33-48, L_48-64)、 (L_33-48, L_1-16, L_17-32, L_48-64)、(L_48-64, L_1-16, L_17-32, L_33-48)。

The processor 42 be also used to calculate two text datas (for example, the first text data and second text data) it Between Hamming distances, for example, the Hamming distances of binary string A and binary string B be exactly after A xor B in binary system 1 Number.

The processor is also used to the maps feature vectors of higher-dimension through simhash algorithm into a f bit (f-bit) Fingerprint (fingerprint), wherein f is integer more than or equal to 2, by comparing the sea of the f-bit fingerprint of two texts Prescribed distance (Hamming Distance) determines whether two texts repeat or height is approximate, i.e. the value of Hamming distances is got over It is small more similar, when Hamming distances are equal to zero, illustrate that two comparison texts are identical, the more big more not phase of the value of Hamming distances Seemingly.

Present invention also provides another embodiments, that is, provide a kind of computer readable storage medium, the computer Readable storage medium storing program for executing is stored with data processor, and the data processor can be executed by least one processor, so that institute At least one processor is stated to execute such as the step of above-mentioned data processing method.

For example, executing following content: according to default key when the data processor is executed by least one processor Word collects at least two text datas；For each text data, according to similar hash function (simhash) and hash function (hash) at k binary strings, wherein k=2ⁿ, wherein n is the positive integer more than or equal to 2；By the k binary string etc. It is divided into j one's share of expenses for a joint undertaking binary string, wherein j is the positive integer more than or equal to 1；Putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, Corresponding j collection, which is generated, as the binary string of front end using the sub- binary string of different parts merges storage into preset sample database；

The binary string of the front end for each set gathered with described j matches the sample database, obtains the sample The candidate result respectively gathered that library returns；The sea of any two text data is calculated according to the candidate result of each text data Prescribed distance carries out duplicate removal if Hamming distances are less than or equal to threshold value.

In another embodiment of the application, when the data processor is executed by least one processor, execute such as Lower content: when 64 binary strings are divided into four parts, any a 16 seat binary string can be adjusted described in being used as The binary string of the front end of four one's share of expenses for a joint undertaking binary strings, for example, sub- binary string L_1-16, L_17-32, L_33-48And L_48-64It can divide Tiao Zheng not then there are 4 set as the binary string of the front end of all binary strings, can be stored with table (table) In memory, i.e., it is stored with 4 table in memory, such as 4 set are respectively as follows: (L_1-16, L_17-32, L_33-48, L_48-64)、(L_17-32, L_1-16, L_33-48, L_48-64)、(L_33-48, L_1-16, L_17-32, L_48-64)、(L_48-64, L_1-16, L_17-32, L_33-48)。

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, computer, Server, air conditioner or network equipment etc.) execute method described in each embodiment of the application.

Obviously, embodiments described above is merely a part but not all of the embodiments of the present application, attached The preferred embodiment of the application is given in figure, but is not intended to limit the scope of the patents of the application.The application can be with many differences Form realize, on the contrary, purpose of providing these embodiments is keeps the understanding to disclosure of this application more thorough Comprehensively.Although the application is described in detail with reference to the foregoing embodiments, for coming for those skilled in the art, It can still modify to technical solution documented by aforementioned each specific embodiment, or special to part of technology Sign carries out equivalence replacement.All equivalent structures done using present specification and accompanying drawing content, are directly or indirectly used in Other related technical areas, similarly within the application scope of patent protection.

Claims

1. a kind of method of the data deduplication based on big data characterized by comprising

At least two text datas are collected according to preset keyword；

For each text data, k binary strings are generated according to similar hash function and hash function, wherein k=2ⁿ, Middle n is the positive integer more than or equal to 2；

Putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, is concatenated using the sub- binary string of different parts as the binary system of front end Merge storage into preset sample database at corresponding j collection；

The binary string of the front end for each set gathered with described j matches the sample database, obtains the sample database and returns The candidate result respectively gathered returned；

The Hamming distances that any two text data is calculated according to the candidate result of each text data, if Hamming distances are less than Equal to threshold value, duplicate removal is carried out.

2. the method for the data deduplication according to claim 1 based on big data, which is characterized in that described to be directed to each text Notebook data is specifically included according to the binary string that similar hash function and hash function generate k:

Select the digit k of similar hash function；

Everybody of similar hash function is initialized as 0；

The k number value of the generation is converted to k binary strings.

3. the method for the data deduplication according to claim 2 based on big data, which is characterized in that described to select similar Kazakhstan The digit k of uncommon function is specifically included:

According to carrying cost and the size of data set, the digit k of similar hash function is selected.

4. the method for the data deduplication according to claim 2 based on big data, which is characterized in that described to each point Word _ weight centering participle carries out hash function processing and specifically includes:

The hash code of letter is segmented using the predetermined quantity that k hash functions calculate each text data.

5. the method for the data deduplication according to claim 4 based on big data, which is characterized in that described according to described pre- If keyword is collected at least two text data and is specifically included: being grabbed using web crawlers technology according to preset keyword At least two text datas relevant to the keyword.

6. the method for the data deduplication based on big data described in -5 any one according to claim 1, which is characterized in that described The binary string of the front end for each set gathered with described j matches the sample database, obtains what the sample database returned The candidate result respectively gathered specifically includes:

Determine that sample database described in the binary string and memory of the front end of every part of binary string of the j parts of binary string stores Front end binary string it is whether identical, if identical, it is determined that the candidate result of the sample database current feedback is Correct candidate result, if it is different, then determining that by the candidate result of the sample database current feedback be incorrect candidate knot Fruit.

7. the method for the data deduplication according to claim 2 based on big data, which is characterized in that described in described pair of process Hash function treated participle _ weight is cumulative to the longitudinal direction for carrying out position, generates k number value and specifically includes:

To cumulative to the longitudinal direction for carrying out position by the hash function treated participle _ weight, if the position is 1, plus 1, If it is 0, then subtract 1, ultimately produces k number value.

8. a kind of device of the data deduplication based on big data characterized by comprising

Processing module generates k binary systems according to similar hash function and hash function for being directed to each text data It goes here and there, wherein k=2ⁿ, wherein n is the positive integer more than or equal to 2；

Module is divided, for the k binary string to be divided into j one's share of expenses for a joint undertaking binary string, wherein j is just whole more than or equal to 1 Number；

Module is adjusted, putting in order for the j one's share of expenses for a joint undertaking binary string is adjusted, using the sub- binary string of different parts as the two of front end System concatenates into corresponding j collection and merges storage into preset sample database；

The binary string of matching module, the front end of each set for being gathered with described j matches the sample database, obtains The candidate result respectively gathered that the sample database returns；

Computing module, for calculating the Hamming distances of any two text data according to the candidate result of each text data, such as Fruit Hamming distances are less than or equal to threshold value, carry out duplicate removal.

9. a kind of computer equipment, including memory and processor, computer program, the processing are stored in the memory Device realizes the data deduplication based on big data as described in any one of claims 1 to 7 when executing the computer program The step of method.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, when the computer program is executed by processor realize as described in any one of claims 1 to 7 based on big data The step of method of data deduplication.