[go: up one dir, main page]

CN115905464B - Address matching method and device based on repetition weight - Google Patents

Address matching method and device based on repetition weight

Info

Publication number
CN115905464B
CN115905464B CN202211353271.6A CN202211353271A CN115905464B CN 115905464 B CN115905464 B CN 115905464B CN 202211353271 A CN202211353271 A CN 202211353271A CN 115905464 B CN115905464 B CN 115905464B
Authority
CN
China
Prior art keywords
address
vector
candidate
weight
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211353271.6A
Other languages
Chinese (zh)
Other versions
CN115905464A (en
Inventor
陆启衡
侯方杰
陶闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Weizhi Zhuoxin Information Technology Co ltd
Original Assignee
Shanghai Weizhi Zhuoxin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Weizhi Zhuoxin Information Technology Co ltd filed Critical Shanghai Weizhi Zhuoxin Information Technology Co ltd
Priority to CN202211353271.6A priority Critical patent/CN115905464B/en
Publication of CN115905464A publication Critical patent/CN115905464A/en
Application granted granted Critical
Publication of CN115905464B publication Critical patent/CN115905464B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于重复度权重的地址匹配方法及装置,该方法包括:对待匹配地址进行向量化处理,得到对应的地址向量;将所述地址向量与预设的地址数据库中多个候选地址对应的候选地址向量进行相似度计算,得到所述待匹配地址与任一所述候选地址之间的权重向量相似度;所述权重向量相似度包括所述地址向量与所述候选地址向量之间的相似度与重复度权重的乘积;所述重复度权重与对应的所述地址向量或所述候选地址向量的同一数据外观可能同时指向的不同实体对象的对象数量成反比;根据所述权重向量相似度,从所述多个候选地址中筛选出所述待匹配地址对应的目标地址。可见,本发明能够有效提高地址匹配的精确度和效率。

The present invention discloses a method and device for address matching based on repetition weights. The method comprises: vectorizing an address to be matched to obtain a corresponding address vector; calculating similarity between the address vector and candidate address vectors corresponding to multiple candidate addresses in a preset address database to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprises the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight is inversely proportional to the number of different physical objects that the same data appearance of the corresponding address vector or candidate address vector may simultaneously point to; and screening the target address corresponding to the address to be matched from the multiple candidate addresses based on the weighted vector similarity. It can be seen that the present invention can effectively improve the accuracy and efficiency of address matching.

Description

Address matching method and device based on repetition weight
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to an address matching method and apparatus based on repetition weight.
Background
Along with the development of data processing algorithm technology and processing equipment performance, more and more transportation or transportation enterprises begin to adopt data processing technology to realize the processing of address data, wherein the address matching technology is an important ring, and can effectively improve the running efficiency and income of enterprises for the accurate matching of fuzzy addresses or wrong addresses. However, in the existing address matching technology, only similarity calculation is generally considered for the characterization data of the address characters, and the influence of the repetition degree corresponding to different addresses on the similarity is not considered, so that the matching accuracy is lower. Therefore, the existing address matching method based on the repetition degree weight has defects and needs to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing an address matching method and device based on repetition weight, which can effectively improve the accuracy and efficiency of address matching.
In order to solve the technical problem, the first aspect of the present invention discloses an address matching method based on repetition weight, which comprises:
vectorizing the addresses to be matched to obtain corresponding address vectors;
Performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain weight vector similarity between the address to be matched and any candidate address, wherein the weight vector similarity comprises the product of similarity between the address vector and the candidate address vector and repetition weight, and the repetition weight is inversely proportional to the number of objects of different entity objects to which the corresponding address vector or the same data appearance of the candidate address vector possibly points at the same time;
And screening target addresses corresponding to the addresses to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.
In a first aspect of the present invention, as an optional implementation manner, the repeatability weight includes a level repeatability weight and/or a scene repeatability weight, the level repeatability weight is inversely proportional to the number of objects of different entity objects that may be pointed to simultaneously by the same data appearance of the address level to which some or all of the address vectors or the candidate address vectors belong, the scene repeatability weight is inversely proportional to the number of objects of different entity objects that may be pointed to simultaneously by the same data appearance of the scene type to which some or all of the addresses or the candidate addresses belong, and the data appearance includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern.
In a first aspect of the present invention, the address vector includes a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched;
And performing vectorization processing on the addresses to be matched to obtain corresponding address vectors, wherein the vectorization processing comprises the following steps:
splitting the address to be matched to obtain address fragments to be matched corresponding to a plurality of address levels;
and carrying out vectorization processing on the address fragments to be matched to obtain corresponding address fragment vectors.
In a first aspect of the present invention, as an optional implementation manner, the calculating the similarity between the address vector and a candidate address vector corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any one of the candidate addresses includes:
For any one of a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;
Calculating the weighted vector similarity between any one of the address fragment vectors corresponding to the addresses to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight
And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched to obtain the weighted vector similarity between the address to be matched and the candidate address.
In a first aspect of the present invention, as an optional implementation manner, the similarity includes a cosine distance and/or a euclidean distance, and/or the level repeatability weight is inversely proportional to a level refinement degree of an address level to which the address vector or some or all of the candidate address vectors belong in all address levels, and/or the scene type includes an arbitrary naming scene and an involuntary naming scene, wherein the scene repeatability weight corresponding to the arbitrary naming scene is lower than the scene repeatability weight corresponding to the involuntary naming scene.
As an optional implementation manner, in the first aspect of the present invention, the level repeatability weight corresponding to any one of the address levels may be calculated by:
For any address hierarchy, acquiring a plurality of address fragment information corresponding to the address hierarchy;
Screening a plurality of repeated segment sets corresponding to the address hierarchy according to the entity address object corresponding to each address segment information, wherein each repeated segment set comprises a plurality of address segment information which have the same data appearance but correspond to different entity address objects;
Determining the level repetition degree weight corresponding to the address level according to the quantity of the address fragment information included in all the repeated fragment sets;
and determining the hierarchy repeatability weight corresponding to the address hierarchy according to the number of the address fragment information included in all the repeated fragment sets, including:
Calculating a statistical value of the number of address fragment information included in all the repeated fragment sets, wherein the statistical value comprises at least one of a sum value, an average value and a weighted average value;
and determining the level repeatability weight corresponding to the address level according to the statistic value, wherein the level repeatability weight is inversely proportional to the statistic value.
In a first aspect of the present invention, the selecting, according to the similarity of the weight vectors, the target address corresponding to the address to be matched from the plurality of candidate addresses includes:
Arranging the plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;
Determining a preset number of candidate addresses in the address sequence as target addresses corresponding to the addresses to be matched;
And/or the number of the groups of groups,
And screening at least one candidate address with the similarity of the weight vector larger than a preset similarity threshold value from the plurality of candidate addresses, and determining the candidate address as a target address corresponding to the address to be matched.
The second aspect of the present invention discloses an address matching device based on repetition degree weight, the device comprising:
The address processing module is used for vectorizing the address to be matched to obtain a corresponding address vector;
the similarity calculation module is used for calculating the similarity between the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain the similarity of the weight vector between the address to be matched and any candidate address, wherein the similarity of the weight vector comprises the product of the similarity between the address vector and the candidate address vector and the weight of the repetition degree, and the weight of the repetition degree is inversely proportional to the number of objects of different entity objects to which the corresponding address vector or the same data appearance of the candidate address vector possibly points at the same time;
and the address screening module is used for screening the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.
In a second aspect of the present invention, the repeatability weight comprises a level repeatability weight and/or a scene repeatability weight, the level repeatability weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of an address level to which some or all of the address vectors or the candidate address vectors belong may be simultaneously directed, the scene repeatability weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of a scene type to which some or all of the addresses or the candidate addresses belong may be simultaneously directed, and the data appearance comprises at least one of a data name, a data vector, a data identification and a data visualization pattern.
As an optional implementation manner, in the second aspect of the present invention, the address vector includes a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched; the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate address;
And the address processing module carries out vectorization processing on the address to be matched to obtain a specific mode of a corresponding address vector, and the specific mode comprises the following steps:
splitting the address to be matched to obtain address fragments to be matched corresponding to a plurality of address levels;
and carrying out vectorization processing on the address fragments to be matched to obtain corresponding address fragment vectors.
In a second aspect of the present invention, the method for calculating the similarity between the address vector and a candidate address vector corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any one of the candidate addresses includes:
For any one of a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;
Calculating the weighted vector similarity between any one of the address fragment vectors corresponding to the addresses to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight
And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched to obtain the weighted vector similarity between the address to be matched and the candidate address.
In a second aspect of the present invention, as an optional implementation manner, the similarity includes a cosine distance and/or a euclidean distance, and/or the level repeatability weight is inversely proportional to a level refinement degree of an address level to which the address vector or some or all of the candidate address vectors belong in all address levels, and/or the scene type includes an arbitrary naming scene and an involuntary naming scene, wherein the scene repeatability weight corresponding to the arbitrary naming scene is lower than the scene repeatability weight corresponding to the involuntary naming scene.
As an optional implementation manner, in the second aspect of the present invention, the apparatus further includes a weight calculation module, configured to calculate the level repeatability weight corresponding to any of the address levels by performing the following steps:
For any address hierarchy, acquiring a plurality of address fragment information corresponding to the address hierarchy;
Screening a plurality of repeated segment sets corresponding to the address hierarchy according to the entity address object corresponding to each address segment information, wherein each repeated segment set comprises a plurality of address segment information which have the same data appearance but correspond to different entity address objects;
Determining the level repetition degree weight corresponding to the address level according to the quantity of the address fragment information included in all the repeated fragment sets;
And the weight calculation module determines a specific mode of the hierarchy repetition degree weight corresponding to the address hierarchy according to the number of the address fragment information included in all the repeated fragment sets, and the specific mode comprises the following steps:
Calculating a statistical value of the number of address fragment information included in all the repeated fragment sets, wherein the statistical value comprises at least one of a sum value, an average value and a weighted average value;
and determining the level repeatability weight corresponding to the address level according to the statistic value, wherein the level repeatability weight is inversely proportional to the statistic value.
In a second aspect of the present invention, the specific manner of the address screening module for screening the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors includes:
Arranging the plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;
Determining a preset number of candidate addresses in the address sequence as target addresses corresponding to the addresses to be matched;
And/or the number of the groups of groups,
And screening at least one candidate address with the similarity of the weight vector larger than a preset similarity threshold value from the plurality of candidate addresses, and determining the candidate address as a target address corresponding to the address to be matched.
The third aspect of the present invention discloses another address matching device based on repetition degree weight, the device comprises:
a memory storing executable program code;
a processor coupled to the memory;
the processor invokes the executable program code stored in the memory to perform some or all of the steps in the address matching method based on the repetition weight disclosed in the first aspect of the present invention.
A fourth aspect of the present invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute some or all of the steps of the address matching method based on repetition weight disclosed in the first aspect of the present invention.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
The embodiment of the invention discloses an address matching method and device based on repetition weight, wherein the method comprises the steps of carrying out vectorization processing on an address to be matched to obtain a corresponding address vector, carrying out similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain weight vector similarity between the address to be matched and any candidate address, wherein the weight vector similarity comprises the product of the similarity between the address vector and the candidate address vector and the repetition weight, the repetition weight is inversely proportional to the number of objects of different entity objects possibly pointed by the same data appearance of the corresponding address vector or the candidate address vector at the same time, and screening target addresses corresponding to the address to be matched from the plurality of candidate addresses according to the weight vector similarity. Therefore, the embodiment of the invention can fully combine the repetition degree weight to calculate the similarity between the address to be matched and the plurality of candidate addresses, so that the similarity between different candidate addresses and the address to be matched can be more accurately determined, and the accuracy and the efficiency of address matching can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of an address matching method based on repetition weight according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an address matching device based on repetition weight according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of another address matching device based on repetition weight according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "second," "second," and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or elements but may, in the alternative, include other steps or elements not expressly listed or inherent to such process, method, article, or device.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The invention discloses an address matching method and device based on repetition degree weight, which can fully combine the repetition degree weight to calculate the similarity between an address to be matched and a plurality of candidate addresses so as to more accurately determine the similarity degree between different candidate addresses and the address to be matched, thereby effectively improving the accuracy and efficiency of address matching. The following will describe in detail.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of an address matching method based on repetition weight according to an embodiment of the present invention. The address matching method based on the repetition weight described in fig. 1 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 1, the address matching method based on the repetition degree weight may include the following operations:
101. And carrying out vectorization processing on the addresses to be matched to obtain corresponding address vectors.
Optionally, the address to be matched may be an address input by a user or determined by the system according to a preset rule, which is generally an address that does not correspond to the physical address object, and the corresponding physical address object needs to be determined by matching. Alternatively, the address to be matched may include information of a plurality of address levels.
Alternatively, the vectorization processing of the present invention may be performed by using a vectorization algorithm model of a corresponding word or character, for example, vectorization processing of addresses or address fragments may be performed by using some word vector models that are pre-trained, or vectorization processing may be performed by using feature extractors of some neural network models that are related to the trained word prediction.
102. And performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain the similarity of the weight vector between the address to be matched and any candidate address.
Specifically, the weight vector similarity includes a product of a similarity between the address vector and the candidate address vector and a repetition weight, where the repetition weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the corresponding address vector or candidate address vector may be simultaneously directed.
By setting the repetition degree weight, the degree of possible duplicate names corresponding to the address vector or the candidate address vector can be effectively represented, for example, the address corresponding to the address vector is likely to correspond to a plurality of different addresses, and then the address vector has very high duplicate name, so that the weight of the address vector in calculating the total similarity is reduced, and the repetition degree weight is inversely proportional to the duplicate name, so that the accuracy of the finally calculated similarity can be effectively improved.
Alternatively, the repeatability weights may comprise hierarchical and/or scene repeatability weights, which may comprise, for example, the product of the hierarchical and scene repeatability weights.
Wherein the hierarchy repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the address hierarchy to which some or all of the address vectors or candidate address vectors belong may be simultaneously directed. The address level may be a level obtained by dividing different levels of addresses manually or in a preset manner, for example, it may be an administrative level, such as a city, county, district, etc., or may be a further living area level, such as a house, bedroom, bathroom, etc., which is not limited by the present invention.
Alternatively, the hierarchy repeatability weight may be inversely proportional to the level refinement of the address vector or some or all of the candidate address vectors to which the address hierarchy belongs in all address hierarchies, because the higher the level refinement is, the easier the address hierarchy is to rename to refer to more address objects, e.g., urban province, fewer road renaming entities, with great effect on address determination, while building number, room number are high in renaming, and less in determining the address.
The scene repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the scene type to which some or all of the addresses in the addresses to be matched or the candidate addresses belong may be simultaneously pointed. Optionally, the scene type is used for indicating the scene function type of part or all of the addresses to be matched or the candidate addresses, and the scene function type can be scenes in different dimensions, for example, the scene function type can be a district, a school, a hospital or a restaurant, and the names of the addresses corresponding to different scenes can be different in degree, so that the index can be set to characterize the addresses.
Optionally, the scene types include a random naming scene and an involuntary naming scene, wherein a scene repetition weight corresponding to the random naming scene is lower than a scene repetition weight corresponding to the involuntary naming scene. Optionally, a random naming scenario is used to indicate a scenario address with a higher degree of freedom in naming, such as a self-contained restaurant or individual enterprise, so that the probability of its renaming is greater and the scenario repetition weight should be lower. Alternatively, the non-arbitrary naming scenario is used to indicate a scenario address with a lower degree of freedom in naming, such as a cell, school or hospital address, so that the likelihood of its renaming is less and the scenario repetition weight should be higher.
Optionally, the data appearance described in the present invention includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern. Preferably, it is a data name or data identification.
103. And screening target addresses corresponding to the addresses to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.
Optionally, the entity address object indicated by the target address may be determined as the entity address object to be indicated by the address to be matched, so as to achieve determination of final practical meaning of the address to be matched, and further facilitate subsequent execution of a series of service operations or data processing operations related to the address to be matched according to the entity address object corresponding to the address to be matched.
Therefore, the embodiment of the invention can fully combine the repetition degree weight to calculate the similarity between the address to be matched and the plurality of candidate addresses, so that the similarity between different candidate addresses and the address to be matched can be more accurately determined, and the accuracy and the efficiency of address matching can be effectively improved.
As an optional implementation manner, the address vector comprises a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched, and the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate addresses.
Through the arrangement, when the similarity is calculated later, the similarity calculation can be performed on the address fragment vector and the candidate address fragment vector of the same level, and then statistics can be performed, so that more accurate similarity can be obtained.
Optionally, in the step 101, the vectorizing processing is performed on the address to be matched to obtain a corresponding address vector, which includes:
splitting the address to be matched to obtain address fragments to be matched corresponding to a plurality of address levels;
and carrying out vectorization processing on the address fragments to be matched to obtain corresponding address fragment vectors.
Optionally, the splitting of the address to be matched may be performed by using an address semantic analysis algorithm model, for example, a pre-trained address word segmentation neural network model or other algorithm models are used to split the address to be matched to obtain address fragments to be matched corresponding to multiple address levels.
Therefore, according to the alternative embodiment, the address to be matched can be split to obtain address fragments to be matched corresponding to a plurality of address levels, vectorization processing is carried out to obtain corresponding address fragment vectors, so that similarity calculation can be carried out on the address vectors of different levels subsequently, more accurate similarity can be obtained through calculation, and therefore accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, in the step 102, performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any candidate address, where the method includes:
For any one candidate address in a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;
Calculating the weighted vector similarity between any address fragment vector corresponding to the address to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight
And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the addresses to be matched to obtain the weighted vector similarity between the addresses to be matched and the candidate addresses.
Alternatively, the similarity according to the present invention may comprise cosine distances and/or euclidean distances, which may be a weighted sum of one or both.
Therefore, according to the alternative embodiment, the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched can be calculated, so that the weighted vector similarity between the address to be matched and the candidate address is obtained, and therefore more accurate similarity can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, the level repeatability weight corresponding to any address level may be calculated by the following steps:
For any address hierarchy, acquiring a plurality of address fragment information corresponding to the address hierarchy;
Screening a plurality of repeated segment sets corresponding to the address hierarchy according to the entity address object corresponding to each address segment information, wherein each repeated segment set comprises a plurality of address segment information which have the same data appearance but correspond to different entity address objects;
And determining the level repetition degree weight corresponding to the address level according to the number of the address fragment information included in all the repeated fragment sets.
Optionally, the plurality of address fragment information corresponding to the address hierarchy may be address fragment information associated with a specific address hierarchy stored in a preset address database, and the address fragment information may be address fragments of different data appearance types.
Therefore, according to the alternative embodiment, the level repeatability weight corresponding to the address level can be determined according to the number of the address fragment information included in all the repeated fragment sets, so that more accurate level repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, in the step, determining the hierarchy repeatability weight corresponding to the address hierarchy according to the number of address fragment information included in all the repeated fragment sets includes:
calculating a statistical value of the number of address fragment information included in all the repeated fragment sets;
and determining the level repetition degree weight corresponding to the address level according to the statistic value.
Alternatively, the statistical value may include at least one of a sum value, an average value, and a weighted average value. In particular, the hierarchy repetition weight should be inversely proportional to the statistics. Alternatively, the hierarchy repeatability weights may be the inverse of the statistics, or other inversely proportional mathematical relationships.
In a particular embodiment, each field in the address library may be statistically the same name referring to the average number of different entities. The smaller this average number of renames, the more limited the single name can refer to an entity, and the more decisive the address is. For example, "Shanghai city" is the only thing, and "Baoshan region" is a heavy name, and the average value of the city level is relatively lower than the average value of the region level, and the final level repeatability weight can be the inverse of the average value of the heavy name.
Therefore, according to the alternative implementation mode, the hierarchy repeatability weight corresponding to the address hierarchy can be determined according to the inverse mathematical relation value of the statistical value of the number of the address fragment information included in all the repeated fragment sets, so that more accurate hierarchy repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, in step 103, the selecting, according to the similarity of the weight vectors, the target address corresponding to the address to be matched from the plurality of candidate addresses includes:
arranging a plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;
and determining the preset number of candidate addresses of the address sequence as target addresses corresponding to the addresses to be matched.
Therefore, through the optional implementation manner, the preset number of candidate addresses with the highest similarity of the weight vectors can be determined as the target addresses corresponding to the addresses to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.
As an optional implementation manner, in step 103, the selecting, according to the similarity of the weight vectors, the target address corresponding to the address to be matched from the plurality of candidate addresses includes:
And screening at least one candidate address with the similarity of the weight vector larger than a preset similarity threshold value from the plurality of candidate addresses, and determining the candidate address as a target address corresponding to the address to be matched.
Therefore, through the optional implementation manner, at least one candidate address with the similarity of the weight vector larger than the preset similarity threshold value can be screened out from the plurality of candidate addresses and is determined to be the target address corresponding to the address to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of an address matching device based on a repetition weight according to an embodiment of the present invention. The address matching device based on the repetition weight described in fig. 2 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 2, the address matching device based on the repetition degree weight may include:
the address processing module 201 is configured to perform vectorization processing on an address to be matched, and obtain a corresponding address vector.
Optionally, the address to be matched may be an address input by a user or determined by the system according to a preset rule, which is generally an address that does not correspond to the physical address object, and the corresponding physical address object needs to be determined by matching. Alternatively, the address to be matched may include information of a plurality of address levels.
Alternatively, the vectorization processing of the present invention may be performed by using a vectorization algorithm model of a corresponding word or character, for example, vectorization processing of addresses or address fragments may be performed by using some word vector models that are pre-trained, or vectorization processing may be performed by using feature extractors of some neural network models that are related to the trained word prediction.
The similarity calculation module 202 is configured to perform similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database, so as to obtain a weight vector similarity between the address to be matched and any candidate address.
Specifically, the weight vector similarity includes a product of a similarity between the address vector and the candidate address vector and a repetition weight, where the repetition weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the corresponding address vector or candidate address vector may be simultaneously directed.
By setting the repetition degree weight, the degree of possible duplicate names corresponding to the address vector or the candidate address vector can be effectively represented, for example, the address corresponding to the address vector is likely to correspond to a plurality of different addresses, and then the address vector has very high duplicate name, so that the weight of the address vector in calculating the total similarity is reduced, and the repetition degree weight is inversely proportional to the duplicate name, so that the accuracy of the finally calculated similarity can be effectively improved.
Alternatively, the repeatability weights may comprise hierarchical and/or scene repeatability weights, which may comprise, for example, the product of the hierarchical and scene repeatability weights.
Wherein the hierarchy repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the address hierarchy to which some or all of the address vectors or candidate address vectors belong may be simultaneously directed. The address level may be a level obtained by dividing different levels of addresses manually or in a preset manner, for example, it may be an administrative level, such as a city, county, district, etc., or may be a further living area level, such as a house, bedroom, bathroom, etc., which is not limited by the present invention.
Alternatively, the hierarchy repeatability weight may be inversely proportional to the level refinement of the address vector or some or all of the candidate address vectors to which the address hierarchy belongs in all address hierarchies, because the higher the level refinement is, the easier the address hierarchy is to rename to refer to more address objects, e.g., urban province, fewer road renaming entities, with great effect on address determination, while building number, room number are high in renaming, and less in determining the address.
The scene repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the scene type to which some or all of the addresses in the addresses to be matched or the candidate addresses belong may be simultaneously pointed. Optionally, the scene type is used for indicating the scene function type of part or all of the addresses to be matched or the candidate addresses, and the scene function type can be scenes in different dimensions, for example, the scene function type can be a district, a school, a hospital or a restaurant, and the names of the addresses corresponding to different scenes can be different in degree, so that the index can be set to characterize the addresses.
Optionally, the scene types include a random naming scene and an involuntary naming scene, wherein a scene repetition weight corresponding to the random naming scene is lower than a scene repetition weight corresponding to the involuntary naming scene. Optionally, a random naming scenario is used to indicate a scenario address with a higher degree of freedom in naming, such as a self-contained restaurant or individual enterprise, so that the probability of its renaming is greater and the scenario repetition weight should be lower. Alternatively, the non-arbitrary naming scenario is used to indicate a scenario address with a lower degree of freedom in naming, such as a cell, school or hospital address, so that the likelihood of its renaming is less and the scenario repetition weight should be higher.
Optionally, the data appearance described in the present invention includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern. Preferably, it is a data name or data identification.
And the address screening module 203 is configured to screen a target address corresponding to the address to be matched from a plurality of candidate addresses according to the similarity of the weight vectors.
Optionally, the entity address object indicated by the target address may be determined as the entity address object to be indicated by the address to be matched, so as to achieve determination of final practical meaning of the address to be matched, and further facilitate subsequent execution of a series of service operations or data processing operations related to the address to be matched according to the entity address object corresponding to the address to be matched.
Therefore, the embodiment of the invention can fully combine the repetition degree weight to calculate the similarity between the address to be matched and the plurality of candidate addresses, so that the similarity between different candidate addresses and the address to be matched can be more accurately determined, and the accuracy and the efficiency of address matching can be effectively improved.
As an optional implementation manner, the address vector comprises a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched, and the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate addresses.
Through the arrangement, when the similarity is calculated later, the similarity calculation can be performed on the address fragment vector and the candidate address fragment vector of the same level, and then statistics can be performed, so that more accurate similarity can be obtained.
Optionally, the specific manner in which the address processing module 201 performs vectorization processing on the address to be matched to obtain the corresponding address vector includes:
splitting the address to be matched to obtain address fragments to be matched corresponding to a plurality of address levels;
and carrying out vectorization processing on the address fragments to be matched to obtain corresponding address fragment vectors.
Optionally, the splitting of the address to be matched may be performed by using an address semantic analysis algorithm model, for example, a pre-trained address word segmentation neural network model or other algorithm models are used to split the address to be matched to obtain address fragments to be matched corresponding to multiple address levels.
Therefore, according to the alternative embodiment, the address to be matched can be split to obtain address fragments to be matched corresponding to a plurality of address levels, vectorization processing is carried out to obtain corresponding address fragment vectors, so that similarity calculation can be carried out on the address vectors of different levels subsequently, more accurate similarity can be obtained through calculation, and therefore accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, the specific manner of performing similarity calculation on the address vector and the candidate address vectors corresponding to the plurality of candidate addresses in the preset address database by the similarity calculation module 202 to obtain the similarity of the weight vector between the address to be matched and any candidate address includes:
For any one candidate address in a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;
Calculating the weighted vector similarity between any address fragment vector corresponding to the address to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight
And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the addresses to be matched to obtain the weighted vector similarity between the addresses to be matched and the candidate addresses.
Alternatively, the similarity according to the present invention may comprise cosine distances and/or euclidean distances, which may be a weighted sum of one or both.
Therefore, according to the alternative embodiment, the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched can be calculated, so that the weighted vector similarity between the address to be matched and the candidate address is obtained, and therefore more accurate similarity can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, the apparatus further includes a weight calculation module, configured to calculate a hierarchy repeatability weight corresponding to any address hierarchy by performing the following steps:
For any address hierarchy, acquiring a plurality of address fragment information corresponding to the address hierarchy;
Screening a plurality of repeated segment sets corresponding to the address hierarchy according to the entity address object corresponding to each address segment information, wherein each repeated segment set comprises a plurality of address segment information which have the same data appearance but correspond to different entity address objects;
And determining the level repetition degree weight corresponding to the address level according to the number of the address fragment information included in all the repeated fragment sets.
Optionally, the plurality of address fragment information corresponding to the address hierarchy may be address fragment information associated with a specific address hierarchy stored in a preset address database, and the address fragment information may be address fragments of different data appearance types.
Therefore, according to the alternative embodiment, the level repeatability weight corresponding to the address level can be determined according to the number of the address fragment information included in all the repeated fragment sets, so that more accurate level repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, the specific manner of determining the level repeatability weight corresponding to the address level by the weight calculation module according to the number of address fragment information included in all the repeated fragment sets includes:
calculating a statistical value of the number of address fragment information included in all the repeated fragment sets;
and determining the level repetition degree weight corresponding to the address level according to the statistic value.
Alternatively, the statistical value may include at least one of a sum value, an average value, and a weighted average value. In particular, the hierarchy repetition weight should be inversely proportional to the statistics. Alternatively, the hierarchy repeatability weights may be the inverse of the statistics, or other inversely proportional mathematical relationships.
Therefore, according to the alternative implementation mode, the hierarchy repeatability weight corresponding to the address hierarchy can be determined according to the inverse mathematical relation value of the statistical value of the number of the address fragment information included in all the repeated fragment sets, so that more accurate hierarchy repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.
As an optional implementation manner, the specific manner of selecting, by the address screening module 203, the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors includes:
arranging a plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;
the method comprises the steps of determining a preset number of candidate addresses of an address sequence as target addresses corresponding to addresses to be matched, and arranging the candidate addresses from large to small according to the similarity of weight vectors to obtain the address sequence;
and determining the preset number of candidate addresses of the address sequence as target addresses corresponding to the addresses to be matched.
Therefore, through the optional implementation manner, the preset number of candidate addresses with the highest similarity of the weight vectors can be determined as the target addresses corresponding to the addresses to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.
As an optional implementation manner, the specific manner of selecting, by the address screening module 203, the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors includes:
And screening at least one candidate address with the similarity of the weight vector larger than a preset similarity threshold value from the plurality of candidate addresses, and determining the candidate address as a target address corresponding to the address to be matched.
Therefore, through the optional implementation manner, at least one candidate address with the similarity of the weight vector larger than the preset similarity threshold value can be screened out from the plurality of candidate addresses and is determined to be the target address corresponding to the address to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.
Example III
Referring to fig. 3, fig. 3 is a schematic diagram illustrating another address matching device based on repetition weight according to an embodiment of the present invention. The address matching device based on the repetition weight described in fig. 3 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 3, the address matching device based on the repetition degree weight may include:
a memory 301 storing executable program code;
a processor 302 coupled with the memory 301;
Wherein the processor 302 invokes executable program code stored in the memory 301 for performing the steps of the address matching method based on the repetition weight described in embodiment one.
Example IV
The embodiment of the invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the steps of the address matching method based on the repetition weight described in the embodiment.
Example five
The embodiment of the invention discloses a computer program product, which comprises a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the steps of the address matching method based on the repetition weight described in the embodiment.
The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings do not necessarily have to be in the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-transitory computer readable storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to portions of the description of method embodiments being relevant.
The apparatus, the device, the nonvolatile computer readable storage medium and the method provided in the embodiments of the present disclosure correspond to each other, and therefore, the apparatus, the device, and the nonvolatile computer storage medium also have similar advantageous technical effects as those of the corresponding method, and since the advantageous technical effects of the method have been described in detail above, the advantageous technical effects of the corresponding apparatus, device, and nonvolatile computer storage medium are not described herein again.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATEARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware DescriptionLanguage)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(RubyHardware Description Language), and VHDL (Very-High-SPEEDINTEGRATED CIRCUIT HARDWARE DESCRIPTION LANGUAGE) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, and the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
Finally, it should be noted that the address matching method and apparatus based on the repetition weight disclosed in the embodiments of the present invention are only disclosed in the preferred embodiments of the present invention, and are only for illustrating the technical solutions of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some of the technical features may be equivalently replaced, and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1.一种基于重复度权重的地址匹配方法,其特征在于,所述方法包括:1. A method for address matching based on repetition weight, characterized in that the method comprises: 对待匹配地址进行向量化处理,得到对应的地址向量;Perform vector processing on the address to be matched to obtain the corresponding address vector; 将所述地址向量与预设的地址数据库中多个候选地址对应的候选地址向量进行相似度计算,得到所述待匹配地址与任一所述候选地址之间的权重向量相似度;所述权重向量相似度包括所述地址向量与所述候选地址向量之间的相似度与重复度权重的乘积;所述重复度权重与对应的所述地址向量或所述候选地址向量的同一数据外观可能同时指向的不同实体对象的对象数量成反比;Calculating similarity between the address vector and candidate address vectors corresponding to multiple candidate addresses in a preset address database to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprises the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight is inversely proportional to the number of different physical objects that may be simultaneously pointed to by the same data appearance of the corresponding address vector or the candidate address vector; 根据所述权重向量相似度,从所述多个候选地址中筛选出所述待匹配地址对应的目标地址;Filtering a target address corresponding to the to-be-matched address from the multiple candidate addresses according to the weight vector similarity; 其中,所述重复度权重包括层级重复度权重和场景重复度权重;所述层级重复度权重与所述地址向量或所述候选地址向量中的部分或全部向量所属的地址层级的同一数据外观可能同时指向的不同实体对象的对象数量成反比;所述场景重复度权重与所述待匹配地址或所述候选地址中的部分或全部地址所属的场景类型的同一数据外观可能同时指向的不同实体对象的对象数量成反比;所述数据外观包括数据名称、数据向量、数据标识和数据可视化图案中的至少一种;The repetition weight includes a hierarchical repetition weight and a scene repetition weight; the hierarchical repetition weight is inversely proportional to the number of different entity objects that may be simultaneously pointed to by the same data appearance of the address hierarchy to which the address vector or part or all of the candidate address vectors belong; the scene repetition weight is inversely proportional to the number of different entity objects that may be simultaneously pointed to by the same data appearance of the scene type to which the to-be-matched address or part or all of the candidate addresses belong; the data appearance includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern; 其中,任一所述地址层级对应的所述层级重复度权重通过以下步骤计算得到:The level duplication weight corresponding to any of the address levels is calculated by the following steps: 对于任一所述地址层级,获取该地址层级对应的多个地址片段信息;For any of the address levels, obtaining information of multiple address fragments corresponding to the address level; 根据每一所述地址片段信息对应的实体地址对象,筛选出该地址层级对应的多个重复片段集合;每一所述重复片段集合中包括有多个所述数据外观相同但对应于不同实体地址对象的所述地址片段信息;According to the physical address object corresponding to each of the address fragment information, a plurality of repeated fragment sets corresponding to the address level are screened out; each of the repeated fragment sets includes a plurality of address fragment information having the same data appearance but corresponding to different physical address objects; 根据所有所述重复片段集合中包括的所述地址片段信息的数量,确定该地址层级对应的所述层级重复度权重;Determining the level repetition weight corresponding to the address level according to the amount of the address segment information included in all the repeated segment sets; 以及,所述根据所有所述重复片段集合中包括的所述地址片段信息的数量,确定该地址层级对应的所述层级重复度权重,包括:And, determining the level repetition weight corresponding to the address level according to the number of the address segment information included in all the repeated segment sets includes: 计算所有所述重复片段集合中包括的所述地址片段信息的数量的统计值;所述统计值包括总和值、平均值和加权平均值中的至少一种;Calculating a statistical value of the amount of the address fragment information included in all the repeated fragment sets; the statistical value includes at least one of a sum value, an average value, and a weighted average value; 根据所述统计值,确定该地址层级对应的所述层级重复度权重;所述层级重复度权重与所述统计值成反比。The level repetition weight corresponding to the address level is determined according to the statistical value; the level repetition weight is inversely proportional to the statistical value. 2.根据权利要求1所述的基于重复度权重的地址匹配方法,其特征在于,所述地址向量包括所述待匹配地址对应的多个地址层级对应的多个地址片段向量;所述候选地址向量包括所述候选地址对应的多个地址层级对应的多个候选地址片段向量;2. The address matching method based on repetition weight according to claim 1, wherein the address vector comprises a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the address to be matched; and the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate address; 以及,所述对待匹配地址进行向量化处理,得到对应的地址向量,包括:Furthermore, the vectorization processing of the address to be matched to obtain the corresponding address vector includes: 对待匹配地址进行拆分,以得到多个地址层级对应的待匹配地址片段;Split the address to be matched to obtain address fragments to be matched corresponding to multiple address levels; 对所述待匹配地址片段进行向量化处理,得到对应的地址片段向量。Vectorization is performed on the address fragment to be matched to obtain a corresponding address fragment vector. 3.根据权利要求2所述的基于重复度权重的地址匹配方法,其特征在于,所述将所述地址向量与预设的地址数据库中多个候选地址对应的候选地址向量进行相似度计算,得到所述待匹配地址与任一所述候选地址之间的权重向量相似度,包括:3. The address matching method based on repetition weight according to claim 2, wherein the step of calculating similarity between the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any of the candidate addresses comprises: 对于预设的地址数据库中多个候选地址中的任一所述候选地址,获取该候选地址对应的多个地址层级的候选地址片段的候选地址片段向量;For any candidate address among a plurality of candidate addresses in a preset address database, obtaining a candidate address fragment vector of candidate address fragments of a plurality of address levels corresponding to the candidate address; 计算所述待匹配地址对应的任一所述地址片段向量与同一所述地址层级的所述候选地址片段向量之间的加权向量相似度;所述加权向量相似度为所述地址片段向量与同一层级的所述候选地址片段向量之间的相似度与所述重复度权重的乘积;Calculating a weighted vector similarity between any of the address fragment vectors corresponding to the address to be matched and the candidate address fragment vectors at the same address level; the weighted vector similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector at the same level and the repetition weight; 计算所述待匹配地址对应的至少两个所述地址片段向量对应的所述加权向量相似度的和,得到所述待匹配地址与该候选地址之间的权重向量相似度。The sum of the weighted vector similarities corresponding to at least two address segment vectors corresponding to the address to be matched is calculated to obtain the weighted vector similarity between the address to be matched and the candidate address. 4.根据权利要求1所述的基于重复度权重的地址匹配方法,其特征在于,所述相似度包括余弦距离和/或欧式距离;和/或,所述层级重复度权重与所述地址向量或所述候选地址向量中的部分或全部向量所属的地址层级在所有地址层级中的层级细化程度成反比;和/或,所述场景类型包括随意性命名场景和非随意性命名场景;其中,所述随意性命名场景对应的所述场景重复度权重低于所述非随意性命名场景对应的所述场景重复度权重。4. The address matching method based on repetition weight according to claim 1 is characterized in that the similarity includes cosine distance and/or Euclidean distance; and/or the hierarchical repetition weight is inversely proportional to the hierarchical refinement of the address level to which the address vector or part or all of the candidate address vectors belong in all address levels; and/or the scene type includes a randomly named scene and a non-randomly named scene; wherein the scene repetition weight corresponding to the randomly named scene is lower than the scene repetition weight corresponding to the non-randomly named scene. 5.根据权利要求1所述的基于重复度权重的地址匹配方法,其特征在于,所述根据所述权重向量相似度,从所述多个候选地址中筛选出所述待匹配地址对应的目标地址,包括:5. The address matching method based on repetition weight according to claim 1, wherein the step of selecting a target address corresponding to the address to be matched from the plurality of candidate addresses based on the weight vector similarity comprises: 根据所述权重向量相似度从大到小对所述多个候选地址进行排列,得到地址序列;Arrange the plurality of candidate addresses from largest to smallest according to the similarity of the weight vectors to obtain an address sequence; 将所述地址序列的前预设数量个所述候选地址,确定为所述待匹配地址对应的目标地址;Determine the first preset number of candidate addresses in the address sequence as target addresses corresponding to the address to be matched; 和/或,and/or, 从所述多个候选地址中筛选出所述权重向量相似度大于预设的相似度阈值的至少一个候选地址,确定为所述待匹配地址对应的目标地址。At least one candidate address whose weight vector similarity is greater than a preset similarity threshold is screened out from the multiple candidate addresses and determined as the target address corresponding to the address to be matched. 6.一种基于重复度权重的地址匹配装置,其特征在于,所述装置用于执行如权利要求1-5任一项所述的基于重复度权重的地址匹配方法,且所述装置包括:6. A device for address matching based on repetition weight, characterized in that the device is used to perform the method for address matching based on repetition weight according to any one of claims 1 to 5, and the device comprises: 地址处理模块,用于对待匹配地址进行向量化处理,得到对应的地址向量;An address processing module is used to perform vector processing on the address to be matched to obtain the corresponding address vector; 相似度计算模块,用于将所述地址向量与预设的地址数据库中多个候选地址对应的候选地址向量进行相似度计算,得到所述待匹配地址与任一所述候选地址之间的权重向量相似度;所述权重向量相似度包括所述地址向量与所述候选地址向量之间的相似度与重复度权重的乘积;所述重复度权重与对应的所述地址向量或所述候选地址向量的同一数据外观可能同时指向的不同实体对象的对象数量成反比;a similarity calculation module, configured to perform similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database, to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprising the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight being inversely proportional to the number of different physical objects that the same data appearance of the corresponding address vector or candidate address vector may simultaneously point to; 地址筛选模块,用于根据所述权重向量相似度,从所述多个候选地址中筛选出所述待匹配地址对应的目标地址。The address screening module is used to screen out a target address corresponding to the to-be-matched address from the multiple candidate addresses according to the weight vector similarity. 7.一种基于重复度权重的地址匹配装置,其特征在于,所述装置包括:7. An address matching device based on repetition weight, characterized in that the device comprises: 存储有可执行程序代码的存储器;a memory storing executable program code; 与所述存储器耦合的处理器;a processor coupled to the memory; 所述处理器调用所述存储器中存储的所述可执行程序代码,执行如权利要求1-5任一项所述的基于重复度权重的地址匹配方法。The processor calls the executable program code stored in the memory to execute the address matching method based on repetition weight according to any one of claims 1 to 5. 8.一种计算机读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-5任一项所述的基于重复度权重的地址匹配方法。8. A computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program enables a computer to execute the address matching method based on repetition weight according to any one of claims 1 to 5.
CN202211353271.6A 2022-11-01 2022-11-01 Address matching method and device based on repetition weight Active CN115905464B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211353271.6A CN115905464B (en) 2022-11-01 2022-11-01 Address matching method and device based on repetition weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211353271.6A CN115905464B (en) 2022-11-01 2022-11-01 Address matching method and device based on repetition weight

Publications (2)

Publication Number Publication Date
CN115905464A CN115905464A (en) 2023-04-04
CN115905464B true CN115905464B (en) 2025-09-05

Family

ID=86485242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211353271.6A Active CN115905464B (en) 2022-11-01 2022-11-01 Address matching method and device based on repetition weight

Country Status (1)

Country Link
CN (1) CN115905464B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761089A (en) * 2020-10-26 2021-12-07 北京京东尚科信息技术有限公司 Address processing method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007143157A2 (en) * 2006-06-02 2007-12-13 Initiate Systems, Inc. Automatic weight generation for probabilistic matching

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761089A (en) * 2020-10-26 2021-12-07 北京京东尚科信息技术有限公司 Address processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115905464A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
CN107679082A (en) Question and answer searching method, device and electronic equipment
CN110020427B (en) Policy determination method and device
CN109086961A (en) A kind of Information Risk monitoring method and device
CN113837635A (en) Risk detection processing method, device and equipment
CN115187311A (en) Shop site selection method and device suitable for multiple industries
CN116308738B (en) Model training method, business wind control method and device
CN115952859B (en) Data processing method, device and equipment
CN108921190A (en) A kind of image classification method, device and electronic equipment
CN117473056A (en) Question answering method and device, electronic equipment and storage medium
CN117787418A (en) Risk identification method and device, storage medium and electronic equipment
CN108681490B (en) Vector processing method, device and equipment for RPC information
CN114511376A (en) Credit data processing method and device based on multiple models
CN115905464B (en) Address matching method and device based on repetition weight
CN117953258A (en) Training method of object classification model, object classification method and device
CN115564450A (en) Wind control method, device, storage medium and equipment
CN111242195B (en) Model, insurance wind control model training method and device and electronic equipment
CN108830298A (en) A kind of method and device of determining user characteristics label
CN119006832B (en) A model training method and point cloud segmentation method based on attention mechanism
CN111461352B (en) Model training method, service node identification device and electronic equipment
CN114723269B (en) A method, device and equipment for risk prevention and control of events
CN118468045B (en) A model training acceleration method, device and storage medium
CN114973225B (en) License plate identification method, device and equipment
CN116109008B (en) Method and device for executing service, storage medium and electronic equipment
CN119105310A (en) Test processing method and device
CN119848546A (en) Model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20230404

Assignee: Shanghai Jixun Technology Co.,Ltd.

Assignor: Shanghai Weizhi Zhuoxin Information Technology Co.,Ltd.

Contract record no.: X2025980022503

Denomination of invention: Address matching method and device based on repetition weight

Granted publication date: 20250905

License type: Common License

Record date: 20250916