Summary of the invention
This application provides a kind of address similarity calculating method and its systems.First technical problems to be solved be as
Where the similarity of two address is accurately calculated under human interference.Second technical problems to be solved is to be related on a large scale
When address base, how under the premise of not influencing precision, is effectively promoted and compare speed.
To solve the above-mentioned problems, this application discloses a kind of address similarity calculating methods, comprising:
Obtain the primary vector and corresponding two address secondary vector of corresponding first address, wherein each address include from
At least one binary sequence extracted in corresponding address;
It is weighted according to the position of binary sequence in the address, calculates the similarity of the primary vector He the secondary vector,
When wherein doing the weighting, the weight of first binary sequence is higher than the power of the posterior binary sequence in the address in the address
Weight.
In a preferred embodiment, the similarity of the calculating primary vector and the secondary vector includes calculating the primary vector
With the cosine distance of the secondary vector.
Disclosed herein as well is a kind of address similarity calculating methods, comprising:
Obtain third address;
From at least one binary sequence of the third address extraction, to constitute third vector;
Obtain the binary sequence inverted index of each address in the first address set;
The inverted index is inquired according to the binary sequence in the third vector, obtains including at least any in the third vector
Second address set of one binary sequence, binary in the third vector for being included to each address in second address set
The number of sequence is counted, according to the statistical result selection N number of address most comprising binary sequence number in the third vector
As third address set, wherein N is preset positive integer;
According to the address similarity that each address of third address set is described above.
In a preferred embodiment, at this before at least one binary sequence of the third address extraction, further includes:
The third address is pre-processed.
In a preferred embodiment, which includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, which is individually extracted and from original place
It is removed in location.
In a preferred embodiment, should further comprise from least one binary sequence of first address extraction:
If detecting continuous number, this regard continuous number as a binary sequence.
In a preferred embodiment, which generates in the following manner:
The pretreatment is carried out to each address in first address set;
For passing through pretreated each address, at least one binary sequence is extracted from address respectively to constitute and ground
The corresponding vector in location;
The inverted index is generated according to binary sequence corresponding to each address.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Vector obtains module, for obtaining the primary vector and corresponding two address secondary vector of corresponding first address,
Wherein each address includes at least one binary sequence extracted from corresponding address;
Weighted calculation module, for being weighted, calculating the primary vector and being somebody's turn to do according to the position of binary sequence in the address
The similarity of secondary vector, wherein the weight of first binary sequence is higher than in the address in the address when doing the weighting
The weight of binary sequence afterwards.
In a preferred embodiment, the similarity of the calculating primary vector and the secondary vector includes calculating the primary vector
With the cosine distance of the secondary vector.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Address acquisition module, for obtaining third address;
Binary sequence extraction module, for from least one binary sequence of the third address extraction, to constitute third vector;
Inverted index module, for obtaining the binary sequence inverted index of each address in the first address set;
Query statistic module is at least wrapped for inquiring the inverted index according to the binary sequence in the third vector
The second address set containing any one binary sequence in the third vector included to each address in second address set
The third vector in the number of binary sequence counted, binary sequence in the third vector is included according to statistical result selection
The most N number of address of number is as third address set, and wherein N is preset positive integer;
Similarity calculation module, for according to the address similarity that each address of third address set is described above.
It in a preferred embodiment, further include preprocessing module, for being pre-processed to first address, by processing result
It is output to the address acquisition module.
In a preferred embodiment, which includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, which is individually extracted and from original place
It is removed in location.
In a preferred embodiment, the binary sequence extraction module is when extracting binary sequence, if detecting continuous number
Word, this regard continuous number as a binary sequence.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Memory, for storing computer executable instructions;And
Processor, for realizing the step in method as previously described when executing the computer executable instructions.
Disclosed herein as well is a kind of computer readable storage medium, calculating is stored in the computer readable storage medium
Machine executable instruction, the computer executable instructions realize the step in method as previously described when being executed by processor.
In the application embodiment, at least one binary sequence will be extracted from address and has constituted vector, it is aobvious to calculate representative
Between the vector of indication apart from when use, weighted according to the position in the in situ location of binary sequence, wherein leaning in the address
The corresponding weight of binary sequence afterwards is smaller, can effectively improve the precision of address similarity mode in this way.
Take the method drawdown ratio of inverted index to Candidate Set in similarity mode, in the premise for not influencing precision
Under, effectively improve comparison speed.
Turn Chinese character by complicated and simple conversion, phonetic, remove divided-by symbol, digital standard and the character that provinces and cities district will be represented
Individually extraction and the preprocessing means such as removal from raw address, can be effective against common artificial address fraudulent means.
Before binary sequence comparison, the character string of administrative division is removed from address, can be further improved comparison
Efficiency.
A large amount of technical characteristic is described in the description of the present application, is distributed in each technical solution, if to enumerate
Out if the combination (i.e. technical solution) of all possible technical characteristic of the application, specification can be made excessively tediously long.In order to keep away
Exempt from this problem, each technical characteristic disclosed in the application foregoing invention content, below in each embodiment and example
Each technical characteristic disclosed in disclosed each technical characteristic and attached drawing, can freely be combined with each other, to constitute each
The new technical solution (these technical solutions have been recorded because being considered as in the present specification) of kind, unless the group of this technical characteristic
Conjunction is technically infeasible.For example, disclosing feature A+B+C in one example, spy is disclosed in another example
A+B+D+E is levied, and feature C and D are the equivalent technologies means for playing phase same-action, it, can not as long as technically selecting a use
Can use simultaneously, feature E can be technically combined with feature C, then, and the scheme of A+B+C+D because technology is infeasible should not
It is considered as having recorded, and the scheme of A+B+C+E should be considered as being described.
Specific embodiment
In the following description, in order to make the reader understand this application better, many technical details are proposed.But this
The those of ordinary skill in field is appreciated that even if without these technical details and many variations based on the following respective embodiments
And modification, the application technical solution claimed also may be implemented.
The explanation of part concept:
Binary sequence (bigram), i.e., using two adjacent cells as a sequence, such as two adjacent Chinese characters compositions
Sequence.
Implementation to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application
Mode is described in further detail.
The first embodiment of the application is related to a kind of address similarity calculating method, and present embodiment is related to two addresses
Between similarity calculation, process as shown in Figure 1, method includes the following steps:
In a step 101, the primary vector and corresponding two address secondary vector for obtaining corresponding first address, wherein often
A address includes at least one binary sequence (bigram) extracted from corresponding address.In other words, primary vector includes
, from the binary sequence of the first address extraction, secondary vector includes at least one binary sequence from the second address extraction at least one
Column.In present embodiment, the extraction word granularity to address is bigram.Since there are Sparse Problems for the granularity based on participle, become
Changing a word may cause biggish difference (such as Zhujiang River road and Zhu Jianglu);And word granularity lacks the priori knowledge of word sequence
(orderly knowledge on address) be easy to cause and accidentally knows (such as: Zhujiang River road and the Jiang Zhulu vector in word granularity are identical);For
This uses bigram (bluish white horse bay cell and the bluish white code bay cell, although having of word granularity used here as half-way house
Baima and white code is different but bluish white and bay or identical).In one embodiment, more special to the processing mode of number
Very, using all continuous numbers as a word, bigram building is carried out, such as No. 696 are a bigram.
Then into step 102, weighted according to the position of binary sequence in the address, calculate primary vector and second to
The similarity of amount, wherein the weight of first binary sequence is higher than the posterior binary in the address in the address when weighting
The weight of sequence.In other words, for any two binary sequence in a vector of a corresponding address (such as address D)
(such as X and Y) is calculated similar if position of position of the X in corresponding address D prior to Y in address D with weighting scheme
When spending, the corresponding weight of X is greater than the corresponding weight of Y.The corresponding weight of binary sequence in other words in the address rearward
Also smaller.The similarity of primary vector and secondary vector is exactly the first address and two address similarity.
In a preferred embodiment, by calculating the cosine distance of primary vector and secondary vector, to calculate primary vector
With the similarity of secondary vector.Such as: the binary sequence of primary vector corresponding with the first address is (al, a2, a3, a4, a5),
The binary sequence of secondary vector corresponding with the second address is (bl, b2, b3, b4, b5), wherein al=bl, a2=b2, a4=
B4, then entire vector space according to sequence be rearranged to (a1, a2, a3, b3, a4, a5, b5) respective weights be (w1, w2, w3, w4, w5,
w6,w7)
Cosine=(w1*al*w1*bl+w2*a2*w2*b2+w5*a4*w5*b4)/sqrt
(wl*al*wl*al+w2*a2*w2*a2+w3*a3*w3*a3+w5*a4*w5*a4+w6*a5*w6*a5)*sqrt
(w1*bl*w1*bl+w2*b2*w2*b2+w4*b3*w4*b3+w5*b4*w5*b4+w7*b5*w7 * b5) hereM < 1, x indicate position, for example, x=1 is first binary sequence, weight w1 is 1;Position is more to the rear, and weight is about
It is small, such as m=0.5, then for the 4th position (x=4), weight w4=0.5;9th position (x=9), weight w9=0.33.
In other embodiments, primary vector and second can also be calculated to calculate with other modes except cosine distance
The similarity of vector.
The second embodiment of the application is related to a kind of address similarity calculating method, and present embodiment is related to from largely
Found out in the set (such as an existing address base) for having address with several most similar addresses of specified address, and export should
The similarity of specified address (being third address in present embodiment) and these most close addresses.Its process is as shown in Fig. 2, the party
Method the following steps are included:
In step 201, third address is obtained.
Then into step 202, third address is pre-processed.This step is optional step.
In one embodiment, pretreatment includes two parts:
Chinese-traditional is converted to simplified form of Chinese Character by first part, and phonetic is converted into Chinese character, removes punctuation mark, number
Standardization, etc..
Second part identifies the character that provinces and cities district is represented in address, and the character for representing provinces and cities district is individually extracted
And it is removed from raw address.Such as: " Nanjing from the rooms 302 of lower area's Zhujiang River road 696 " is pretreated as " Zhujiang River road
No. 696 rooms 302 ".In one embodiment, pretreatment first carries out first part and executes second part again, so as to cope with
It represents and occurs the problems such as complex form of Chinese characters, phonetic, punctuate in the character string in provinces and cities district.The representative provinces and cities district removed from address
Character can be standardized with the default vocabulary of administrative division, then the independence of administrative division is compared, by administrative division
Comparison result and the comparison result of subsequent address comprehensively consider.Before being compared with binary sequence, by the character of administrative division
String is removed from address, can be further improved comparison efficiency, especially when extensive address compares, significant effect.
Then into step 203, from least one binary sequence of third address extraction, to constitute third vector.Binary sequence
The extracting mode of column and the extracting mode of first embodiment are identical, please refer in step 101 from address extraction binary sequence
Method.In one embodiment, if detecting continuous number, this regard continuous number as a binary sequence.
Then into step 204, the binary sequence inverted index of each address in the first address set is obtained.The inverted index
It is that the address containing the binary sequence is inquired according to binary sequence.In one embodiment, the first address set is existing
Address base.In one embodiment, inverted index is generated in the following manner:
1, each address in the first address set is pre-processed;Pretreated mode is identical as step 202.
2, for passing through pretreated each address, at least one binary sequence is extracted from address respectively to constitute and ground
The corresponding vector in location.Binary sequence extracting mode is identical as in step 203.
3, the binary sequence according to corresponding to each address generates inverted index.
Then into step 205, inverted index is inquired according to the binary sequence in third vector, obtains including at least third
The address set (referred to as the second address set) of any one binary sequence in vector, to address institute each in the second address set
The number of binary sequence is counted in the third vector for including, and includes binary sequence in third vector according to statistical result selection
The most N number of address of number is as third address set, and wherein N is preset positive integer.In one embodiment, N=
1000, in other embodiments, suitable N can be selected according to application scenarios.
Then into step 206, according to first embodiment in method calculate third address and third address set one by one
Close the address similarity of each address.Hereafter it can be confirmed that third address with some address in the first address set is same
One address, or find out and the third address highest address of similarity.
After carrying out pretreatment and the processing of binary sequence word granularity to addresses all in address base, to the binary sequence of all addresses
Column establish inverted index, do the address candidates set that similarity compares with new address for obtaining, can mention with drawdown ratio to range
Comparison efficiency is risen, even if address comparison library is huger, computational efficiency still can achieve commercial requirement.
Fig. 5 is the processing method shown in one embodiment.There are the address of magnanimity, C2C industry in full dose trade company address base
It is newly-increased address that business, which Adds Address in library, needs to match the address in two address bases, find out in the library that Adds Address
Similar address of each address in full dose trade company address base.It in advance will be to each of full dose trade company address base address
Address pretreatment is done respectively, is extracted word granularity Bigram (extracting the binary sequence in each address), then basis
Bigram carries out inverted index, obtains Bigram inverted index.For each of library address that Adds Address, first similarly into
Row address pretreatment and extraction word granularity Bigram, then inquire the Bigram inverted index, and it is most to obtain Bigram matching
1000 addresses (inverted order Top1000) be used as candidate site collection, then according to the method according to first embodiment according to
The position of binary sequence in the address weights, and calculates separately what this address and candidate site in the library that Adds Address were concentrated
The similarity of 1000 addresses to find most like address, or judges whether there is identical address.
Below by the effect of comparative descriptions the application with conventional method.
Traditional address similarity calculating method includes:
1) using the editing distance (edit distance) between character string;
2) cosine calculating is carried out using the vector for segmenting or dividing after word.
For following address, calculating the distance between they using conventional method can be far, and uses the technology of the application
Scheme is it is determined that be identical address:
The rooms 302 of Nanjing Baixia District Zhujiang River road 696
Nanjing city, Zhujiang River road 696, Baixia District rooms 302 // without saving word
Nanjing Zhujiang River road No. 696 302//do not have room word, not province, area
The 302 Room // digital representation Chinese of Nanjing Zhujiang River road 696
These addresses are then practical different below, but calculating similarity similarity distance using conventional method can be relatively
Closely, and the technical solution of the application can accurately determine that they are different.
The rooms 302 of Nanjing Baixia District Zhujiang River road 696
The rooms 402 of the Nanjing Baixia District Zhujiang River road 696 // room 302 of room 402
Room // No. 196 302 No. 696 numbers of Nanjing Baixia District Zhujiang River road 196
No. 696 302 rooms of Nanjing Baixia District Changjiang Road // Changjiang Road Zhujiang River road
The third embodiment of the application is related to a kind of address similarity calculation system, and basic structure is as shown in Figure 3.It should
System includes:
Vector obtains module 301, for obtain corresponding first address primary vector and the corresponding two address secondth to
Amount, wherein each address includes at least one binary sequence extracted from corresponding address;
Weighted calculation module 302 calculates primary vector and for weighting according to the position of binary sequence in the address
The similarity of two vectors, wherein the weight of first binary sequence is higher than posterior in the address in the address when weighting
The weight of binary sequence.
In one embodiment, the similarity for calculating primary vector and secondary vector include calculate primary vector and second to
The cosine distance of amount.
First embodiment is method implementation corresponding with present embodiment, and the technology in first embodiment is thin
Section can be used for present embodiment, and the technical detail in same present embodiment can be used for first embodiment.
The 4th embodiment of the application is related to a kind of address similarity calculation system, and basic structure is as shown in Figure 4.It should
System includes:
Address acquisition module 401, for obtaining third address.
Processing result is output to address acquisition module for pre-processing to the first address by preprocessing module 402.
Preprocessing module is optional.In one embodiment, pretreatment includes two parts:
Firstly, Chinese-traditional is converted to simplified form of Chinese Character, phonetic is converted into Chinese character, removes punctuation mark, digital standard
Change.
Secondly, identify the character for representing provinces and cities district in address, by the character for representing provinces and cities district individually extract and from
It is removed in raw address.
Binary sequence extraction module 403, for from least one binary sequence of third address extraction, with constitute third to
Amount.In one embodiment, if detecting continuous number, this regard continuous number as a binary sequence.
Inverted index module 404, for obtaining the binary sequence inverted index of each address in the first address set.
Query statistic module 405 is included at least for inquiring inverted index according to the binary sequence in third vector
Second address set of any one binary sequence in third vector, the third for being included to each address in the second address set
The number of binary sequence is counted in vector, most comprising binary sequence number in third vector according to statistical result selection
N number of address is as third address set, and wherein N is preset positive integer.
Similarity calculation module 406, with calculating third address and third one by one for the method according to first embodiment
Gather the address similarity of each address in location.The reality of the technical solution in third embodiment can be used in similarity calculation module
It is existing.
Second embodiment is method implementation corresponding with present embodiment, and the technology in second embodiment is thin
Section can be used for present embodiment, and the technical detail in same present embodiment can be used for first embodiment.
It should be noted that it will be appreciated by those skilled in the art that in above-mentioned third embodiment and the 4th embodiment
Shown in the function of each module can be realized and running on the program on processor (executable instruction), can also be by specific
Logic circuit and realize.If address similarity calculation system is in the application third embodiment and the 4th embodiment with software
The form of functional module realizes and when sold or used as an independent product, also can store and computer-readable deposits at one
In storage media.Based on this understanding, the technical solution of the embodiment of the present application substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which is stored in a storage medium, including
Some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes sheet
Apply for all or part of each embodiment method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory
The various media that can store program code such as (ROM, Read Only Memory), magnetic or disk.In this way, the application is real
It applies example and is not limited to any specific hardware and software combination.
Correspondingly, the application another embodiment also provides a kind of computer storage medium, wherein being stored with computer
Executable instruction, the computer executable instructions realize each method embodiment of the application when being executed by processor.
In addition, the application another embodiment also provides a kind of address similarity calculation system, including for depositing
The memory of computer executable instructions is stored up, and, processor.The processor is used for can in the computer executed in the memory
The step in above-mentioned each method embodiment is realized when executing instruction.
It should be noted that relational terms such as first and second and the like are only in the application documents of this patent
For distinguishing one entity or operation from another entity or operation, without necessarily requiring or implying these entities
Or there are any actual relationship or orders between operation.Moreover, the terms "include", "comprise" or its any other
Variant is intended to non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only
It including those elements, but also including other elements that are not explicitly listed, or further include for this process, method, object
Product or the intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence " including one ", not
There is also other identical elements in the process, method, article or equipment for including element for exclusion.The application documents of this patent
In, if it is mentioned that certain behavior is executed according to certain element, then refers to the meaning for executing the behavior according at least to the element, including
Two kinds of situations: the behavior is executed according only to the element and the behavior is executed according to the element and other elements.Multiple, multiple,
A variety of equal expression include 2,2 times, 2 kinds and 2 or more, 2 times or more, two or more.
It is included in disclosure of this application with being considered as globality in all documents that the application refers to, so as to
It can be used as the foundation of modification if necessary.In addition, it should also be understood that, after having read the above disclosure of the application, this field
Technical staff can make various changes or modifications the application, and such equivalent forms equally fall within the application model claimed
It encloses.