[go: up one dir, main page]

CN109657034A - Address similarity calculating method and its system - Google Patents

Address similarity calculating method and its system Download PDF

Info

Publication number
CN109657034A
CN109657034A CN201811309162.8A CN201811309162A CN109657034A CN 109657034 A CN109657034 A CN 109657034A CN 201811309162 A CN201811309162 A CN 201811309162A CN 109657034 A CN109657034 A CN 109657034A
Authority
CN
China
Prior art keywords
address
binary sequence
vector
similarity
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811309162.8A
Other languages
Chinese (zh)
Inventor
祝慧佳
赵智源
周书恒
郭亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811309162.8A priority Critical patent/CN109657034A/en
Publication of CN109657034A publication Critical patent/CN109657034A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application involves technical field of information processing, a kind of address similarity calculating method and its system are disclosed, can effectively improve the precision of address similarity mode.This method comprises: the primary vector and corresponding two address secondary vector of corresponding first address are obtained, wherein each address includes at least one binary sequence extracted from corresponding address;It is weighted according to the position of binary sequence in the address, calculates the similarity of primary vector and secondary vector, wherein the weight of first binary sequence is higher than the weight of the posterior binary sequence in the address in the address when weighting.

Description

Address similarity calculating method and its system
Technical field
This application involves technical field of information processing, in particular to address comparison technology.
Background technique
Different from electric business on line, many operation activities of scene are all to identify these shops around hotel owner's entity under line One important information source of family or merchant entities is exactly address information.It can be navigated to by address information and participate in movable quotient Family and the movable user of participation.And often in operation activity, under some hotel owners or line client in order to obtain more rebatings, Premiums are obtained by the way of the cheating of address, have seriously affected operational effect.Such as: recommend trade company's application to sweep in user In the human-to-human transmission project that code is paid, in order to obtain rebating and ensure that material can be addressed to false trade company, they can make user With similar address (some small mutation of address, such as pinyin digital variation, guarantor can identify but machine can not identify) To bypass same address detected strategy;In addition, there is also trade company's falsenesses to submit store information come the problem of obtaining rebating.These are asked The presence of topic has seriously affected the progress of business activity, is technically directed to anti-cheating technology relevant to address.
And realize the anti-key practised fraud in address is that the similarity of address how is accurately calculated under human interference.This just at For this field technical problem urgently to be solved.
Summary of the invention
This application provides a kind of address similarity calculating method and its systems.First technical problems to be solved be as Where the similarity of two address is accurately calculated under human interference.Second technical problems to be solved is to be related on a large scale When address base, how under the premise of not influencing precision, is effectively promoted and compare speed.
To solve the above-mentioned problems, this application discloses a kind of address similarity calculating methods, comprising:
Obtain the primary vector and corresponding two address secondary vector of corresponding first address, wherein each address include from At least one binary sequence extracted in corresponding address;
It is weighted according to the position of binary sequence in the address, calculates the similarity of the primary vector He the secondary vector, When wherein doing the weighting, the weight of first binary sequence is higher than the power of the posterior binary sequence in the address in the address Weight.
In a preferred embodiment, the similarity of the calculating primary vector and the secondary vector includes calculating the primary vector With the cosine distance of the secondary vector.
Disclosed herein as well is a kind of address similarity calculating methods, comprising:
Obtain third address;
From at least one binary sequence of the third address extraction, to constitute third vector;
Obtain the binary sequence inverted index of each address in the first address set;
The inverted index is inquired according to the binary sequence in the third vector, obtains including at least any in the third vector Second address set of one binary sequence, binary in the third vector for being included to each address in second address set The number of sequence is counted, according to the statistical result selection N number of address most comprising binary sequence number in the third vector As third address set, wherein N is preset positive integer;
According to the address similarity that each address of third address set is described above.
In a preferred embodiment, at this before at least one binary sequence of the third address extraction, further includes:
The third address is pre-processed.
In a preferred embodiment, which includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, which is individually extracted and from original place It is removed in location.
In a preferred embodiment, should further comprise from least one binary sequence of first address extraction:
If detecting continuous number, this regard continuous number as a binary sequence.
In a preferred embodiment, which generates in the following manner:
The pretreatment is carried out to each address in first address set;
For passing through pretreated each address, at least one binary sequence is extracted from address respectively to constitute and ground The corresponding vector in location;
The inverted index is generated according to binary sequence corresponding to each address.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Vector obtains module, for obtaining the primary vector and corresponding two address secondary vector of corresponding first address, Wherein each address includes at least one binary sequence extracted from corresponding address;
Weighted calculation module, for being weighted, calculating the primary vector and being somebody's turn to do according to the position of binary sequence in the address The similarity of secondary vector, wherein the weight of first binary sequence is higher than in the address in the address when doing the weighting The weight of binary sequence afterwards.
In a preferred embodiment, the similarity of the calculating primary vector and the secondary vector includes calculating the primary vector With the cosine distance of the secondary vector.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Address acquisition module, for obtaining third address;
Binary sequence extraction module, for from least one binary sequence of the third address extraction, to constitute third vector;
Inverted index module, for obtaining the binary sequence inverted index of each address in the first address set;
Query statistic module is at least wrapped for inquiring the inverted index according to the binary sequence in the third vector The second address set containing any one binary sequence in the third vector included to each address in second address set The third vector in the number of binary sequence counted, binary sequence in the third vector is included according to statistical result selection The most N number of address of number is as third address set, and wherein N is preset positive integer;
Similarity calculation module, for according to the address similarity that each address of third address set is described above.
It in a preferred embodiment, further include preprocessing module, for being pre-processed to first address, by processing result It is output to the address acquisition module.
In a preferred embodiment, which includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, which is individually extracted and from original place It is removed in location.
In a preferred embodiment, the binary sequence extraction module is when extracting binary sequence, if detecting continuous number Word, this regard continuous number as a binary sequence.
Disclosed herein as well is a kind of address similarity calculation systems, comprising:
Memory, for storing computer executable instructions;And
Processor, for realizing the step in method as previously described when executing the computer executable instructions.
Disclosed herein as well is a kind of computer readable storage medium, calculating is stored in the computer readable storage medium Machine executable instruction, the computer executable instructions realize the step in method as previously described when being executed by processor.
In the application embodiment, at least one binary sequence will be extracted from address and has constituted vector, it is aobvious to calculate representative Between the vector of indication apart from when use, weighted according to the position in the in situ location of binary sequence, wherein leaning in the address The corresponding weight of binary sequence afterwards is smaller, can effectively improve the precision of address similarity mode in this way.
Take the method drawdown ratio of inverted index to Candidate Set in similarity mode, in the premise for not influencing precision Under, effectively improve comparison speed.
Turn Chinese character by complicated and simple conversion, phonetic, remove divided-by symbol, digital standard and the character that provinces and cities district will be represented Individually extraction and the preprocessing means such as removal from raw address, can be effective against common artificial address fraudulent means.
Before binary sequence comparison, the character string of administrative division is removed from address, can be further improved comparison Efficiency.
A large amount of technical characteristic is described in the description of the present application, is distributed in each technical solution, if to enumerate Out if the combination (i.e. technical solution) of all possible technical characteristic of the application, specification can be made excessively tediously long.In order to keep away Exempt from this problem, each technical characteristic disclosed in the application foregoing invention content, below in each embodiment and example Each technical characteristic disclosed in disclosed each technical characteristic and attached drawing, can freely be combined with each other, to constitute each The new technical solution (these technical solutions have been recorded because being considered as in the present specification) of kind, unless the group of this technical characteristic Conjunction is technically infeasible.For example, disclosing feature A+B+C in one example, spy is disclosed in another example A+B+D+E is levied, and feature C and D are the equivalent technologies means for playing phase same-action, it, can not as long as technically selecting a use Can use simultaneously, feature E can be technically combined with feature C, then, and the scheme of A+B+C+D because technology is infeasible should not It is considered as having recorded, and the scheme of A+B+C+E should be considered as being described.
Detailed description of the invention
Fig. 1 is the address similarity based method flow diagram according to the application first embodiment
Fig. 2 is the address similarity based method flow diagram according to the application second embodiment
Fig. 3 is the address similarity system structure diagram according to the application third embodiment
Fig. 4 is the address similarity system structure diagram according to the 4th embodiment of the application
Fig. 5 is the address matching schematic diagram according to two address bases of the application one embodiment
Specific embodiment
In the following description, in order to make the reader understand this application better, many technical details are proposed.But this The those of ordinary skill in field is appreciated that even if without these technical details and many variations based on the following respective embodiments And modification, the application technical solution claimed also may be implemented.
The explanation of part concept:
Binary sequence (bigram), i.e., using two adjacent cells as a sequence, such as two adjacent Chinese characters compositions Sequence.
Implementation to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application Mode is described in further detail.
The first embodiment of the application is related to a kind of address similarity calculating method, and present embodiment is related to two addresses Between similarity calculation, process as shown in Figure 1, method includes the following steps:
In a step 101, the primary vector and corresponding two address secondary vector for obtaining corresponding first address, wherein often A address includes at least one binary sequence (bigram) extracted from corresponding address.In other words, primary vector includes , from the binary sequence of the first address extraction, secondary vector includes at least one binary sequence from the second address extraction at least one Column.In present embodiment, the extraction word granularity to address is bigram.Since there are Sparse Problems for the granularity based on participle, become Changing a word may cause biggish difference (such as Zhujiang River road and Zhu Jianglu);And word granularity lacks the priori knowledge of word sequence (orderly knowledge on address) be easy to cause and accidentally knows (such as: Zhujiang River road and the Jiang Zhulu vector in word granularity are identical);For This uses bigram (bluish white horse bay cell and the bluish white code bay cell, although having of word granularity used here as half-way house Baima and white code is different but bluish white and bay or identical).In one embodiment, more special to the processing mode of number Very, using all continuous numbers as a word, bigram building is carried out, such as No. 696 are a bigram.
Then into step 102, weighted according to the position of binary sequence in the address, calculate primary vector and second to The similarity of amount, wherein the weight of first binary sequence is higher than the posterior binary in the address in the address when weighting The weight of sequence.In other words, for any two binary sequence in a vector of a corresponding address (such as address D) (such as X and Y) is calculated similar if position of position of the X in corresponding address D prior to Y in address D with weighting scheme When spending, the corresponding weight of X is greater than the corresponding weight of Y.The corresponding weight of binary sequence in other words in the address rearward Also smaller.The similarity of primary vector and secondary vector is exactly the first address and two address similarity.
In a preferred embodiment, by calculating the cosine distance of primary vector and secondary vector, to calculate primary vector With the similarity of secondary vector.Such as: the binary sequence of primary vector corresponding with the first address is (al, a2, a3, a4, a5), The binary sequence of secondary vector corresponding with the second address is (bl, b2, b3, b4, b5), wherein al=bl, a2=b2, a4= B4, then entire vector space according to sequence be rearranged to (a1, a2, a3, b3, a4, a5, b5) respective weights be (w1, w2, w3, w4, w5, w6,w7)
Cosine=(w1*al*w1*bl+w2*a2*w2*b2+w5*a4*w5*b4)/sqrt
(wl*al*wl*al+w2*a2*w2*a2+w3*a3*w3*a3+w5*a4*w5*a4+w6*a5*w6*a5)*sqrt
(w1*bl*w1*bl+w2*b2*w2*b2+w4*b3*w4*b3+w5*b4*w5*b4+w7*b5*w7 * b5) hereM < 1, x indicate position, for example, x=1 is first binary sequence, weight w1 is 1;Position is more to the rear, and weight is about It is small, such as m=0.5, then for the 4th position (x=4), weight w4=0.5;9th position (x=9), weight w9=0.33.
In other embodiments, primary vector and second can also be calculated to calculate with other modes except cosine distance The similarity of vector.
The second embodiment of the application is related to a kind of address similarity calculating method, and present embodiment is related to from largely Found out in the set (such as an existing address base) for having address with several most similar addresses of specified address, and export should The similarity of specified address (being third address in present embodiment) and these most close addresses.Its process is as shown in Fig. 2, the party Method the following steps are included:
In step 201, third address is obtained.
Then into step 202, third address is pre-processed.This step is optional step.
In one embodiment, pretreatment includes two parts:
Chinese-traditional is converted to simplified form of Chinese Character by first part, and phonetic is converted into Chinese character, removes punctuation mark, number Standardization, etc..
Second part identifies the character that provinces and cities district is represented in address, and the character for representing provinces and cities district is individually extracted And it is removed from raw address.Such as: " Nanjing from the rooms 302 of lower area's Zhujiang River road 696 " is pretreated as " Zhujiang River road No. 696 rooms 302 ".In one embodiment, pretreatment first carries out first part and executes second part again, so as to cope with It represents and occurs the problems such as complex form of Chinese characters, phonetic, punctuate in the character string in provinces and cities district.The representative provinces and cities district removed from address Character can be standardized with the default vocabulary of administrative division, then the independence of administrative division is compared, by administrative division Comparison result and the comparison result of subsequent address comprehensively consider.Before being compared with binary sequence, by the character of administrative division String is removed from address, can be further improved comparison efficiency, especially when extensive address compares, significant effect.
Then into step 203, from least one binary sequence of third address extraction, to constitute third vector.Binary sequence The extracting mode of column and the extracting mode of first embodiment are identical, please refer in step 101 from address extraction binary sequence Method.In one embodiment, if detecting continuous number, this regard continuous number as a binary sequence.
Then into step 204, the binary sequence inverted index of each address in the first address set is obtained.The inverted index It is that the address containing the binary sequence is inquired according to binary sequence.In one embodiment, the first address set is existing Address base.In one embodiment, inverted index is generated in the following manner:
1, each address in the first address set is pre-processed;Pretreated mode is identical as step 202.
2, for passing through pretreated each address, at least one binary sequence is extracted from address respectively to constitute and ground The corresponding vector in location.Binary sequence extracting mode is identical as in step 203.
3, the binary sequence according to corresponding to each address generates inverted index.
Then into step 205, inverted index is inquired according to the binary sequence in third vector, obtains including at least third The address set (referred to as the second address set) of any one binary sequence in vector, to address institute each in the second address set The number of binary sequence is counted in the third vector for including, and includes binary sequence in third vector according to statistical result selection The most N number of address of number is as third address set, and wherein N is preset positive integer.In one embodiment, N= 1000, in other embodiments, suitable N can be selected according to application scenarios.
Then into step 206, according to first embodiment in method calculate third address and third address set one by one Close the address similarity of each address.Hereafter it can be confirmed that third address with some address in the first address set is same One address, or find out and the third address highest address of similarity.
After carrying out pretreatment and the processing of binary sequence word granularity to addresses all in address base, to the binary sequence of all addresses Column establish inverted index, do the address candidates set that similarity compares with new address for obtaining, can mention with drawdown ratio to range Comparison efficiency is risen, even if address comparison library is huger, computational efficiency still can achieve commercial requirement.
Fig. 5 is the processing method shown in one embodiment.There are the address of magnanimity, C2C industry in full dose trade company address base It is newly-increased address that business, which Adds Address in library, needs to match the address in two address bases, find out in the library that Adds Address Similar address of each address in full dose trade company address base.It in advance will be to each of full dose trade company address base address Address pretreatment is done respectively, is extracted word granularity Bigram (extracting the binary sequence in each address), then basis Bigram carries out inverted index, obtains Bigram inverted index.For each of library address that Adds Address, first similarly into Row address pretreatment and extraction word granularity Bigram, then inquire the Bigram inverted index, and it is most to obtain Bigram matching 1000 addresses (inverted order Top1000) be used as candidate site collection, then according to the method according to first embodiment according to The position of binary sequence in the address weights, and calculates separately what this address and candidate site in the library that Adds Address were concentrated The similarity of 1000 addresses to find most like address, or judges whether there is identical address.
Below by the effect of comparative descriptions the application with conventional method.
Traditional address similarity calculating method includes:
1) using the editing distance (edit distance) between character string;
2) cosine calculating is carried out using the vector for segmenting or dividing after word.
For following address, calculating the distance between they using conventional method can be far, and uses the technology of the application Scheme is it is determined that be identical address:
The rooms 302 of Nanjing Baixia District Zhujiang River road 696
Nanjing city, Zhujiang River road 696, Baixia District rooms 302 // without saving word
Nanjing Zhujiang River road No. 696 302//do not have room word, not province, area
The 302 Room // digital representation Chinese of Nanjing Zhujiang River road 696
These addresses are then practical different below, but calculating similarity similarity distance using conventional method can be relatively Closely, and the technical solution of the application can accurately determine that they are different.
The rooms 302 of Nanjing Baixia District Zhujiang River road 696
The rooms 402 of the Nanjing Baixia District Zhujiang River road 696 // room 302 of room 402
Room // No. 196 302 No. 696 numbers of Nanjing Baixia District Zhujiang River road 196
No. 696 302 rooms of Nanjing Baixia District Changjiang Road // Changjiang Road Zhujiang River road
The third embodiment of the application is related to a kind of address similarity calculation system, and basic structure is as shown in Figure 3.It should System includes:
Vector obtains module 301, for obtain corresponding first address primary vector and the corresponding two address secondth to Amount, wherein each address includes at least one binary sequence extracted from corresponding address;
Weighted calculation module 302 calculates primary vector and for weighting according to the position of binary sequence in the address The similarity of two vectors, wherein the weight of first binary sequence is higher than posterior in the address in the address when weighting The weight of binary sequence.
In one embodiment, the similarity for calculating primary vector and secondary vector include calculate primary vector and second to The cosine distance of amount.
First embodiment is method implementation corresponding with present embodiment, and the technology in first embodiment is thin Section can be used for present embodiment, and the technical detail in same present embodiment can be used for first embodiment.
The 4th embodiment of the application is related to a kind of address similarity calculation system, and basic structure is as shown in Figure 4.It should System includes:
Address acquisition module 401, for obtaining third address.
Processing result is output to address acquisition module for pre-processing to the first address by preprocessing module 402. Preprocessing module is optional.In one embodiment, pretreatment includes two parts:
Firstly, Chinese-traditional is converted to simplified form of Chinese Character, phonetic is converted into Chinese character, removes punctuation mark, digital standard Change.
Secondly, identify the character for representing provinces and cities district in address, by the character for representing provinces and cities district individually extract and from It is removed in raw address.
Binary sequence extraction module 403, for from least one binary sequence of third address extraction, with constitute third to Amount.In one embodiment, if detecting continuous number, this regard continuous number as a binary sequence.
Inverted index module 404, for obtaining the binary sequence inverted index of each address in the first address set.
Query statistic module 405 is included at least for inquiring inverted index according to the binary sequence in third vector Second address set of any one binary sequence in third vector, the third for being included to each address in the second address set The number of binary sequence is counted in vector, most comprising binary sequence number in third vector according to statistical result selection N number of address is as third address set, and wherein N is preset positive integer.
Similarity calculation module 406, with calculating third address and third one by one for the method according to first embodiment Gather the address similarity of each address in location.The reality of the technical solution in third embodiment can be used in similarity calculation module It is existing.
Second embodiment is method implementation corresponding with present embodiment, and the technology in second embodiment is thin Section can be used for present embodiment, and the technical detail in same present embodiment can be used for first embodiment.
It should be noted that it will be appreciated by those skilled in the art that in above-mentioned third embodiment and the 4th embodiment Shown in the function of each module can be realized and running on the program on processor (executable instruction), can also be by specific Logic circuit and realize.If address similarity calculation system is in the application third embodiment and the 4th embodiment with software The form of functional module realizes and when sold or used as an independent product, also can store and computer-readable deposits at one In storage media.Based on this understanding, the technical solution of the embodiment of the present application substantially in other words contributes to the prior art Part can be embodied in the form of software products, which is stored in a storage medium, including Some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes sheet Apply for all or part of each embodiment method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory The various media that can store program code such as (ROM, Read Only Memory), magnetic or disk.In this way, the application is real It applies example and is not limited to any specific hardware and software combination.
Correspondingly, the application another embodiment also provides a kind of computer storage medium, wherein being stored with computer Executable instruction, the computer executable instructions realize each method embodiment of the application when being executed by processor.
In addition, the application another embodiment also provides a kind of address similarity calculation system, including for depositing The memory of computer executable instructions is stored up, and, processor.The processor is used for can in the computer executed in the memory The step in above-mentioned each method embodiment is realized when executing instruction.
It should be noted that relational terms such as first and second and the like are only in the application documents of this patent For distinguishing one entity or operation from another entity or operation, without necessarily requiring or implying these entities Or there are any actual relationship or orders between operation.Moreover, the terms "include", "comprise" or its any other Variant is intended to non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only It including those elements, but also including other elements that are not explicitly listed, or further include for this process, method, object Product or the intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence " including one ", not There is also other identical elements in the process, method, article or equipment for including element for exclusion.The application documents of this patent In, if it is mentioned that certain behavior is executed according to certain element, then refers to the meaning for executing the behavior according at least to the element, including Two kinds of situations: the behavior is executed according only to the element and the behavior is executed according to the element and other elements.Multiple, multiple, A variety of equal expression include 2,2 times, 2 kinds and 2 or more, 2 times or more, two or more.
It is included in disclosure of this application with being considered as globality in all documents that the application refers to, so as to It can be used as the foundation of modification if necessary.In addition, it should also be understood that, after having read the above disclosure of the application, this field Technical staff can make various changes or modifications the application, and such equivalent forms equally fall within the application model claimed It encloses.

Claims (15)

1. a kind of address similarity calculating method characterized by comprising
The primary vector and corresponding two address secondary vector for obtaining corresponding first address, wherein each address includes from correspondence Address at least one binary sequence for extracting;
It is weighted according to the position of binary sequence in the address, calculates the similarity of the primary vector and the secondary vector, When wherein doing the weighting, the weight of first binary sequence is higher than the power of the posterior binary sequence in the address in the address Weight.
2. the method as described in claim 1, which is characterized in that the phase for calculating the primary vector and the secondary vector It include the cosine distance for calculating the primary vector and the secondary vector like degree.
3. a kind of address similarity calculating method characterized by comprising
Obtain third address;
From described at least one binary sequence of third address extraction, to constitute third vector;
Obtain the binary sequence inverted index of each address in the first address set;
The inverted index is inquired according to the binary sequence in the third vector, obtains including at least in the third vector and appoint It anticipates second address set an of binary sequence, the third vector for being included to each address in second address set The number of middle binary sequence is counted, most comprising binary sequence number in the third vector according to statistical result selection N number of address is as third address set, and wherein N is preset positive integer;
Method according to claim 1 calculates each ground in the third address and the third address set one by one The address similarity of location.
4. method as claimed in claim 3, which is characterized in that described from least one binary sequence of the third address extraction Before column, further includes: pre-processed to the third address.
5. method as claimed in claim 4, which is characterized in that the pretreatment includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, the character for representing provinces and cities district is individually extracted and from raw address Middle removal.
6. method as claimed in claim 3, which is characterized in that described from least one binary sequence of first address extraction Column further comprise:
If detecting continuous number, this regard continuous number as a binary sequence.
7. method as claimed in claim 5, which is characterized in that the inverted index generates in the following manner:
The pretreatment is carried out to address each in first address set;
For passing through pretreated each address, at least one binary sequence is extracted from address respectively to constitute and address Corresponding vector;
The inverted index is generated according to binary sequence corresponding to each address.
8. a kind of address similarity calculation system characterized by comprising
Vector obtains module, for obtaining the primary vector and corresponding two address secondary vector of corresponding first address, wherein Each address includes at least one binary sequence extracted from corresponding address;
Weighted calculation module calculates the primary vector and described for weighting according to the position of binary sequence in the address The similarity of secondary vector, wherein the weight of first binary sequence is higher than in the address in the address when doing the weighting The weight of posterior binary sequence.
9. system as claimed in claim 8, which is characterized in that the phase for calculating the primary vector and the secondary vector It include the cosine distance for calculating the primary vector and the secondary vector like degree.
10. a kind of address similarity calculation system characterized by comprising
Address acquisition module, for obtaining third address;
Binary sequence extraction module is used for from described at least one binary sequence of third address extraction, to constitute third vector;
Inverted index module, for obtaining the binary sequence inverted index of each address in the first address set;
Query statistic module is at least wrapped for inquiring the inverted index according to the binary sequence in the third vector The second address set containing any one binary sequence in the third vector, to address institute each in second address set The number of binary sequence is counted in the third vector for including, according to statistical result selection comprising in the third vector The most N number of address of binary sequence number is as third address set, and wherein N is preset positive integer;
Similarity calculation module calculates the third address and described for method according to claim 1 one by one The address similarity of three each address of address set.
11. system as claimed in claim 10, which is characterized in that further include preprocessing module, for first address It is pre-processed, processing result is output to the address acquisition module.
12. system as claimed in claim 11, which is characterized in that the pretreatment includes following one or any combination thereof:
Chinese-traditional is converted into simplified form of Chinese Character;
Phonetic is converted into Chinese character;
Remove punctuation mark;
Digital standard;
It identifies the character for representing provinces and cities district in address, the character for representing provinces and cities district is individually extracted and from raw address Middle removal.
13. system as claimed in claim 10, which is characterized in that the binary sequence extraction module is extracting binary sequence When, if detecting continuous number, this regard continuous number as a binary sequence.
14. a kind of address similarity calculation system characterized by comprising
Memory, for storing computer executable instructions;And
Processor, it is as claimed in any of claims 1 to 7 in one of claims for being realized when executing the computer executable instructions Step in method.
15. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Executable instruction is realized as described in any one of claim 1 to 7 when the computer executable instructions are executed by processor Method in step.
CN201811309162.8A 2018-11-05 2018-11-05 Address similarity calculating method and its system Pending CN109657034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811309162.8A CN109657034A (en) 2018-11-05 2018-11-05 Address similarity calculating method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811309162.8A CN109657034A (en) 2018-11-05 2018-11-05 Address similarity calculating method and its system

Publications (1)

Publication Number Publication Date
CN109657034A true CN109657034A (en) 2019-04-19

Family

ID=66110082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811309162.8A Pending CN109657034A (en) 2018-11-05 2018-11-05 Address similarity calculating method and its system

Country Status (1)

Country Link
CN (1) CN109657034A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767936A (en) * 2019-11-07 2020-10-13 北京沃东天骏信息技术有限公司 Address similarity detection method and device
CN115271834A (en) * 2022-09-29 2022-11-01 平安银行股份有限公司 House positioning method and device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Data redundancy detection method and system for point of interest in electronic map
WO2011003232A1 (en) * 2009-07-07 2011-01-13 Google Inc. Query parsing for map search
CN106096024A (en) * 2016-06-24 2016-11-09 北京京东尚科信息技术有限公司 The appraisal procedure of address similarity and apparatus for evaluating
CA2956158A1 (en) * 2016-01-28 2017-07-28 Neopost Technologies Method and apparatus for postal address matching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Data redundancy detection method and system for point of interest in electronic map
WO2011003232A1 (en) * 2009-07-07 2011-01-13 Google Inc. Query parsing for map search
CA2956158A1 (en) * 2016-01-28 2017-07-28 Neopost Technologies Method and apparatus for postal address matching
CN106096024A (en) * 2016-06-24 2016-11-09 北京京东尚科信息技术有限公司 The appraisal procedure of address similarity and apparatus for evaluating

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767936A (en) * 2019-11-07 2020-10-13 北京沃东天骏信息技术有限公司 Address similarity detection method and device
CN115271834A (en) * 2022-09-29 2022-11-01 平安银行股份有限公司 House positioning method and device, computer equipment and readable storage medium
CN115271834B (en) * 2022-09-29 2023-02-03 平安银行股份有限公司 House positioning method and device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN109300029A (en) Lending fraud detection model training method, loan fraud detection method and device
CN109559230B (en) Bank transaction group discovery method and system based on overlapping community discovery algorithm
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
CN110909540A (en) Method and device for identifying new words of short message spam and electronic equipment
CN113627542A (en) Event information processing method, server and storage medium
CN110134794B (en) Method and device for constructing entity portrait
CN111241258A (en) Data cleaning method and device, computer equipment and readable storage medium
CN112364942B (en) Credit data sample equalization method and device, computer equipment and storage medium
Wang et al. An intelligent forensics approach for detecting patch‐based image inpainting
CN109657034A (en) Address similarity calculating method and its system
CN107330709B (en) Method and device for determining target object
Sitorus et al. Sensing trending topics in twitter for greater Jakarta area
CN105095826B (en) A kind of character recognition method and device
CN117152669B (en) Cross-mode time domain video positioning method and system
CN118350029A (en) Invoice information security management method, device, equipment, medium and product
CN116757737B (en) Marketing method and device based on address information
CN111061924A (en) Phrase extraction method, device, equipment and storage medium
CN114783417B (en) Voice detection method and device, electronic equipment and storage medium
CN104794636A (en) Mobile phone model recommendation method based on user display scoring
CN113344581B (en) Service data processing method and device
CN115408379A (en) Terminal repeating data determination method, device, equipment and computer storage medium
CN108647301A (en) A kind of creation method and terminal device of customer relationship net
CN112132367B (en) Modeling method and device for enterprise operation management risk identification
Hentona et al. Community detection and growth potential prediction from patent citation networks
CN115310957A (en) Account detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20190419

RJ01 Rejection of invention patent application after publication