CN115905464B

CN115905464B - Address matching method and device based on repetition weight

Info

Publication number: CN115905464B
Application number: CN202211353271.6A
Authority: CN
Inventors: 陆启衡; 侯方杰; 陶闯
Original assignee: Shanghai Weizhi Zhuoxin Information Technology Co ltd
Current assignee: Shanghai Weizhi Zhuoxin Information Technology Co ltd
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2025-09-05
Anticipated expiration: 2042-11-01
Also published as: CN115905464A

Abstract

The present invention discloses a method and device for address matching based on repetition weights. The method comprises: vectorizing an address to be matched to obtain a corresponding address vector; calculating similarity between the address vector and candidate address vectors corresponding to multiple candidate addresses in a preset address database to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprises the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight is inversely proportional to the number of different physical objects that the same data appearance of the corresponding address vector or candidate address vector may simultaneously point to; and screening the target address corresponding to the address to be matched from the multiple candidate addresses based on the weighted vector similarity. It can be seen that the present invention can effectively improve the accuracy and efficiency of address matching.

Description

Address matching method and device based on repetition weight

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an address matching method and apparatus based on repetition weight.

Background

Along with the development of data processing algorithm technology and processing equipment performance, more and more transportation or transportation enterprises begin to adopt data processing technology to realize the processing of address data, wherein the address matching technology is an important ring, and can effectively improve the running efficiency and income of enterprises for the accurate matching of fuzzy addresses or wrong addresses. However, in the existing address matching technology, only similarity calculation is generally considered for the characterization data of the address characters, and the influence of the repetition degree corresponding to different addresses on the similarity is not considered, so that the matching accuracy is lower. Therefore, the existing address matching method based on the repetition degree weight has defects and needs to be solved.

Disclosure of Invention

The invention aims to solve the technical problem of providing an address matching method and device based on repetition weight, which can effectively improve the accuracy and efficiency of address matching.

In order to solve the technical problem, the first aspect of the present invention discloses an address matching method based on repetition weight, which comprises:

vectorizing the addresses to be matched to obtain corresponding address vectors;

Performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain weight vector similarity between the address to be matched and any candidate address, wherein the weight vector similarity comprises the product of similarity between the address vector and the candidate address vector and repetition weight, and the repetition weight is inversely proportional to the number of objects of different entity objects to which the corresponding address vector or the same data appearance of the candidate address vector possibly points at the same time;

And screening target addresses corresponding to the addresses to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.

In a first aspect of the present invention, as an optional implementation manner, the repeatability weight includes a level repeatability weight and/or a scene repeatability weight, the level repeatability weight is inversely proportional to the number of objects of different entity objects that may be pointed to simultaneously by the same data appearance of the address level to which some or all of the address vectors or the candidate address vectors belong, the scene repeatability weight is inversely proportional to the number of objects of different entity objects that may be pointed to simultaneously by the same data appearance of the scene type to which some or all of the addresses or the candidate addresses belong, and the data appearance includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern.

In a first aspect of the present invention, the address vector includes a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched;

And performing vectorization processing on the addresses to be matched to obtain corresponding address vectors, wherein the vectorization processing comprises the following steps:

splitting the address to be matched to obtain address fragments to be matched corresponding to a plurality of address levels;

and carrying out vectorization processing on the address fragments to be matched to obtain corresponding address fragment vectors.

In a first aspect of the present invention, as an optional implementation manner, the calculating the similarity between the address vector and a candidate address vector corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any one of the candidate addresses includes:

For any one of a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;

Calculating the weighted vector similarity between any one of the address fragment vectors corresponding to the addresses to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight

And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched to obtain the weighted vector similarity between the address to be matched and the candidate address.

In a first aspect of the present invention, as an optional implementation manner, the similarity includes a cosine distance and/or a euclidean distance, and/or the level repeatability weight is inversely proportional to a level refinement degree of an address level to which the address vector or some or all of the candidate address vectors belong in all address levels, and/or the scene type includes an arbitrary naming scene and an involuntary naming scene, wherein the scene repeatability weight corresponding to the arbitrary naming scene is lower than the scene repeatability weight corresponding to the involuntary naming scene.

As an optional implementation manner, in the first aspect of the present invention, the level repeatability weight corresponding to any one of the address levels may be calculated by:

For any address hierarchy, acquiring a plurality of address fragment information corresponding to the address hierarchy;

Screening a plurality of repeated segment sets corresponding to the address hierarchy according to the entity address object corresponding to each address segment information, wherein each repeated segment set comprises a plurality of address segment information which have the same data appearance but correspond to different entity address objects;

Determining the level repetition degree weight corresponding to the address level according to the quantity of the address fragment information included in all the repeated fragment sets;

and determining the hierarchy repeatability weight corresponding to the address hierarchy according to the number of the address fragment information included in all the repeated fragment sets, including:

Calculating a statistical value of the number of address fragment information included in all the repeated fragment sets, wherein the statistical value comprises at least one of a sum value, an average value and a weighted average value;

and determining the level repeatability weight corresponding to the address level according to the statistic value, wherein the level repeatability weight is inversely proportional to the statistic value.

In a first aspect of the present invention, the selecting, according to the similarity of the weight vectors, the target address corresponding to the address to be matched from the plurality of candidate addresses includes:

Arranging the plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;

Determining a preset number of candidate addresses in the address sequence as target addresses corresponding to the addresses to be matched;

And/or the number of the groups of groups,

And screening at least one candidate address with the similarity of the weight vector larger than a preset similarity threshold value from the plurality of candidate addresses, and determining the candidate address as a target address corresponding to the address to be matched.

The second aspect of the present invention discloses an address matching device based on repetition degree weight, the device comprising:

The address processing module is used for vectorizing the address to be matched to obtain a corresponding address vector;

the similarity calculation module is used for calculating the similarity between the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain the similarity of the weight vector between the address to be matched and any candidate address, wherein the similarity of the weight vector comprises the product of the similarity between the address vector and the candidate address vector and the weight of the repetition degree, and the weight of the repetition degree is inversely proportional to the number of objects of different entity objects to which the corresponding address vector or the same data appearance of the candidate address vector possibly points at the same time;

and the address screening module is used for screening the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.

In a second aspect of the present invention, the repeatability weight comprises a level repeatability weight and/or a scene repeatability weight, the level repeatability weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of an address level to which some or all of the address vectors or the candidate address vectors belong may be simultaneously directed, the scene repeatability weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of a scene type to which some or all of the addresses or the candidate addresses belong may be simultaneously directed, and the data appearance comprises at least one of a data name, a data vector, a data identification and a data visualization pattern.

As an optional implementation manner, in the second aspect of the present invention, the address vector includes a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched; the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate address;

And the address processing module carries out vectorization processing on the address to be matched to obtain a specific mode of a corresponding address vector, and the specific mode comprises the following steps:

In a second aspect of the present invention, the method for calculating the similarity between the address vector and a candidate address vector corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any one of the candidate addresses includes:

In a second aspect of the present invention, as an optional implementation manner, the similarity includes a cosine distance and/or a euclidean distance, and/or the level repeatability weight is inversely proportional to a level refinement degree of an address level to which the address vector or some or all of the candidate address vectors belong in all address levels, and/or the scene type includes an arbitrary naming scene and an involuntary naming scene, wherein the scene repeatability weight corresponding to the arbitrary naming scene is lower than the scene repeatability weight corresponding to the involuntary naming scene.

As an optional implementation manner, in the second aspect of the present invention, the apparatus further includes a weight calculation module, configured to calculate the level repeatability weight corresponding to any of the address levels by performing the following steps:

And the weight calculation module determines a specific mode of the hierarchy repetition degree weight corresponding to the address hierarchy according to the number of the address fragment information included in all the repeated fragment sets, and the specific mode comprises the following steps:

In a second aspect of the present invention, the specific manner of the address screening module for screening the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors includes:

And/or the number of the groups of groups,

The third aspect of the present invention discloses another address matching device based on repetition degree weight, the device comprises:

a memory storing executable program code;

a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to perform some or all of the steps in the address matching method based on the repetition weight disclosed in the first aspect of the present invention.

A fourth aspect of the present invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute some or all of the steps of the address matching method based on repetition weight disclosed in the first aspect of the present invention.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

The embodiment of the invention discloses an address matching method and device based on repetition weight, wherein the method comprises the steps of carrying out vectorization processing on an address to be matched to obtain a corresponding address vector, carrying out similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain weight vector similarity between the address to be matched and any candidate address, wherein the weight vector similarity comprises the product of the similarity between the address vector and the candidate address vector and the repetition weight, the repetition weight is inversely proportional to the number of objects of different entity objects possibly pointed by the same data appearance of the corresponding address vector or the candidate address vector at the same time, and screening target addresses corresponding to the address to be matched from the plurality of candidate addresses according to the weight vector similarity. Therefore, the embodiment of the invention can fully combine the repetition degree weight to calculate the similarity between the address to be matched and the plurality of candidate addresses, so that the similarity between different candidate addresses and the address to be matched can be more accurately determined, and the accuracy and the efficiency of address matching can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an address matching method based on repetition weight according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an address matching device based on repetition weight according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of another address matching device based on repetition weight according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "second," "second," and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or elements but may, in the alternative, include other steps or elements not expressly listed or inherent to such process, method, article, or device.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The invention discloses an address matching method and device based on repetition degree weight, which can fully combine the repetition degree weight to calculate the similarity between an address to be matched and a plurality of candidate addresses so as to more accurately determine the similarity degree between different candidate addresses and the address to be matched, thereby effectively improving the accuracy and efficiency of address matching. The following will describe in detail.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of an address matching method based on repetition weight according to an embodiment of the present invention. The address matching method based on the repetition weight described in fig. 1 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 1, the address matching method based on the repetition degree weight may include the following operations:

101. And carrying out vectorization processing on the addresses to be matched to obtain corresponding address vectors.

Optionally, the address to be matched may be an address input by a user or determined by the system according to a preset rule, which is generally an address that does not correspond to the physical address object, and the corresponding physical address object needs to be determined by matching. Alternatively, the address to be matched may include information of a plurality of address levels.

Alternatively, the vectorization processing of the present invention may be performed by using a vectorization algorithm model of a corresponding word or character, for example, vectorization processing of addresses or address fragments may be performed by using some word vector models that are pre-trained, or vectorization processing may be performed by using feature extractors of some neural network models that are related to the trained word prediction.

102. And performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain the similarity of the weight vector between the address to be matched and any candidate address.

Specifically, the weight vector similarity includes a product of a similarity between the address vector and the candidate address vector and a repetition weight, where the repetition weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the corresponding address vector or candidate address vector may be simultaneously directed.

By setting the repetition degree weight, the degree of possible duplicate names corresponding to the address vector or the candidate address vector can be effectively represented, for example, the address corresponding to the address vector is likely to correspond to a plurality of different addresses, and then the address vector has very high duplicate name, so that the weight of the address vector in calculating the total similarity is reduced, and the repetition degree weight is inversely proportional to the duplicate name, so that the accuracy of the finally calculated similarity can be effectively improved.

Alternatively, the repeatability weights may comprise hierarchical and/or scene repeatability weights, which may comprise, for example, the product of the hierarchical and scene repeatability weights.

Wherein the hierarchy repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the address hierarchy to which some or all of the address vectors or candidate address vectors belong may be simultaneously directed. The address level may be a level obtained by dividing different levels of addresses manually or in a preset manner, for example, it may be an administrative level, such as a city, county, district, etc., or may be a further living area level, such as a house, bedroom, bathroom, etc., which is not limited by the present invention.

Alternatively, the hierarchy repeatability weight may be inversely proportional to the level refinement of the address vector or some or all of the candidate address vectors to which the address hierarchy belongs in all address hierarchies, because the higher the level refinement is, the easier the address hierarchy is to rename to refer to more address objects, e.g., urban province, fewer road renaming entities, with great effect on address determination, while building number, room number are high in renaming, and less in determining the address.

The scene repetition degree weight is inversely proportional to the number of objects of different entity objects to which the same data appearance of the scene type to which some or all of the addresses in the addresses to be matched or the candidate addresses belong may be simultaneously pointed. Optionally, the scene type is used for indicating the scene function type of part or all of the addresses to be matched or the candidate addresses, and the scene function type can be scenes in different dimensions, for example, the scene function type can be a district, a school, a hospital or a restaurant, and the names of the addresses corresponding to different scenes can be different in degree, so that the index can be set to characterize the addresses.

Optionally, the scene types include a random naming scene and an involuntary naming scene, wherein a scene repetition weight corresponding to the random naming scene is lower than a scene repetition weight corresponding to the involuntary naming scene. Optionally, a random naming scenario is used to indicate a scenario address with a higher degree of freedom in naming, such as a self-contained restaurant or individual enterprise, so that the probability of its renaming is greater and the scenario repetition weight should be lower. Alternatively, the non-arbitrary naming scenario is used to indicate a scenario address with a lower degree of freedom in naming, such as a cell, school or hospital address, so that the likelihood of its renaming is less and the scenario repetition weight should be higher.

Optionally, the data appearance described in the present invention includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern. Preferably, it is a data name or data identification.

103. And screening target addresses corresponding to the addresses to be matched from the plurality of candidate addresses according to the similarity of the weight vectors.

Optionally, the entity address object indicated by the target address may be determined as the entity address object to be indicated by the address to be matched, so as to achieve determination of final practical meaning of the address to be matched, and further facilitate subsequent execution of a series of service operations or data processing operations related to the address to be matched according to the entity address object corresponding to the address to be matched.

Therefore, the embodiment of the invention can fully combine the repetition degree weight to calculate the similarity between the address to be matched and the plurality of candidate addresses, so that the similarity between different candidate addresses and the address to be matched can be more accurately determined, and the accuracy and the efficiency of address matching can be effectively improved.

As an optional implementation manner, the address vector comprises a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the addresses to be matched, and the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate addresses.

Through the arrangement, when the similarity is calculated later, the similarity calculation can be performed on the address fragment vector and the candidate address fragment vector of the same level, and then statistics can be performed, so that more accurate similarity can be obtained.

Optionally, in the step 101, the vectorizing processing is performed on the address to be matched to obtain a corresponding address vector, which includes:

Optionally, the splitting of the address to be matched may be performed by using an address semantic analysis algorithm model, for example, a pre-trained address word segmentation neural network model or other algorithm models are used to split the address to be matched to obtain address fragments to be matched corresponding to multiple address levels.

Therefore, according to the alternative embodiment, the address to be matched can be split to obtain address fragments to be matched corresponding to a plurality of address levels, vectorization processing is carried out to obtain corresponding address fragment vectors, so that similarity calculation can be carried out on the address vectors of different levels subsequently, more accurate similarity can be obtained through calculation, and therefore accuracy and efficiency of address matching can be effectively improved.

As an optional implementation manner, in the step 102, performing similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any candidate address, where the method includes:

For any one candidate address in a plurality of candidate addresses in a preset address database, acquiring candidate address fragment vectors of candidate address fragments of a plurality of address levels corresponding to the candidate address;

Calculating the weighted vector similarity between any address fragment vector corresponding to the address to be matched and the candidate address fragment vector of the same address hierarchy, wherein the weighted similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector of the same hierarchy and the repetition weight

And calculating the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the addresses to be matched to obtain the weighted vector similarity between the addresses to be matched and the candidate addresses.

Alternatively, the similarity according to the present invention may comprise cosine distances and/or euclidean distances, which may be a weighted sum of one or both.

Therefore, according to the alternative embodiment, the sum of the weighted vector similarity corresponding to at least two address fragment vectors corresponding to the address to be matched can be calculated, so that the weighted vector similarity between the address to be matched and the candidate address is obtained, and therefore more accurate similarity can be calculated, and the accuracy and efficiency of address matching can be effectively improved.

As an optional implementation manner, the level repeatability weight corresponding to any address level may be calculated by the following steps:

And determining the level repetition degree weight corresponding to the address level according to the number of the address fragment information included in all the repeated fragment sets.

Optionally, the plurality of address fragment information corresponding to the address hierarchy may be address fragment information associated with a specific address hierarchy stored in a preset address database, and the address fragment information may be address fragments of different data appearance types.

Therefore, according to the alternative embodiment, the level repeatability weight corresponding to the address level can be determined according to the number of the address fragment information included in all the repeated fragment sets, so that more accurate level repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.

As an optional implementation manner, in the step, determining the hierarchy repeatability weight corresponding to the address hierarchy according to the number of address fragment information included in all the repeated fragment sets includes:

calculating a statistical value of the number of address fragment information included in all the repeated fragment sets;

and determining the level repetition degree weight corresponding to the address level according to the statistic value.

Alternatively, the statistical value may include at least one of a sum value, an average value, and a weighted average value. In particular, the hierarchy repetition weight should be inversely proportional to the statistics. Alternatively, the hierarchy repeatability weights may be the inverse of the statistics, or other inversely proportional mathematical relationships.

In a particular embodiment, each field in the address library may be statistically the same name referring to the average number of different entities. The smaller this average number of renames, the more limited the single name can refer to an entity, and the more decisive the address is. For example, "Shanghai city" is the only thing, and "Baoshan region" is a heavy name, and the average value of the city level is relatively lower than the average value of the region level, and the final level repeatability weight can be the inverse of the average value of the heavy name.

Therefore, according to the alternative implementation mode, the hierarchy repeatability weight corresponding to the address hierarchy can be determined according to the inverse mathematical relation value of the statistical value of the number of the address fragment information included in all the repeated fragment sets, so that more accurate hierarchy repeatability weight can be calculated, and the accuracy and efficiency of address matching can be effectively improved.

As an optional implementation manner, in step 103, the selecting, according to the similarity of the weight vectors, the target address corresponding to the address to be matched from the plurality of candidate addresses includes:

arranging a plurality of candidate addresses from large to small according to the similarity of the weight vectors to obtain an address sequence;

and determining the preset number of candidate addresses of the address sequence as target addresses corresponding to the addresses to be matched.

Therefore, through the optional implementation manner, the preset number of candidate addresses with the highest similarity of the weight vectors can be determined as the target addresses corresponding to the addresses to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.

Therefore, through the optional implementation manner, at least one candidate address with the similarity of the weight vector larger than the preset similarity threshold value can be screened out from the plurality of candidate addresses and is determined to be the target address corresponding to the address to be matched, so that an accurate address matching result can be effectively and accurately obtained, and the accuracy and efficiency of address matching are improved.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of an address matching device based on a repetition weight according to an embodiment of the present invention. The address matching device based on the repetition weight described in fig. 2 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 2, the address matching device based on the repetition degree weight may include:

the address processing module 201 is configured to perform vectorization processing on an address to be matched, and obtain a corresponding address vector.

The similarity calculation module 202 is configured to perform similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database, so as to obtain a weight vector similarity between the address to be matched and any candidate address.

And the address screening module 203 is configured to screen a target address corresponding to the address to be matched from a plurality of candidate addresses according to the similarity of the weight vectors.

Optionally, the specific manner in which the address processing module 201 performs vectorization processing on the address to be matched to obtain the corresponding address vector includes:

As an optional implementation manner, the specific manner of performing similarity calculation on the address vector and the candidate address vectors corresponding to the plurality of candidate addresses in the preset address database by the similarity calculation module 202 to obtain the similarity of the weight vector between the address to be matched and any candidate address includes:

As an optional implementation manner, the apparatus further includes a weight calculation module, configured to calculate a hierarchy repeatability weight corresponding to any address hierarchy by performing the following steps:

As an optional implementation manner, the specific manner of determining the level repeatability weight corresponding to the address level by the weight calculation module according to the number of address fragment information included in all the repeated fragment sets includes:

As an optional implementation manner, the specific manner of selecting, by the address screening module 203, the target address corresponding to the address to be matched from the plurality of candidate addresses according to the similarity of the weight vectors includes:

the method comprises the steps of determining a preset number of candidate addresses of an address sequence as target addresses corresponding to addresses to be matched, and arranging the candidate addresses from large to small according to the similarity of weight vectors to obtain the address sequence;

Example III

Referring to fig. 3, fig. 3 is a schematic diagram illustrating another address matching device based on repetition weight according to an embodiment of the present invention. The address matching device based on the repetition weight described in fig. 3 is applied to an address data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 3, the address matching device based on the repetition degree weight may include:

a memory 301 storing executable program code;

a processor 302 coupled with the memory 301;

Wherein the processor 302 invokes executable program code stored in the memory 301 for performing the steps of the address matching method based on the repetition weight described in embodiment one.

Example IV

The embodiment of the invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the steps of the address matching method based on the repetition weight described in the embodiment.

Example five

The embodiment of the invention discloses a computer program product, which comprises a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the steps of the address matching method based on the repetition weight described in the embodiment.

The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings do not necessarily have to be in the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-transitory computer readable storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to portions of the description of method embodiments being relevant.

The apparatus, the device, the nonvolatile computer readable storage medium and the method provided in the embodiments of the present disclosure correspond to each other, and therefore, the apparatus, the device, and the nonvolatile computer storage medium also have similar advantageous technical effects as those of the corresponding method, and since the advantageous technical effects of the method have been described in detail above, the advantageous technical effects of the corresponding apparatus, device, and nonvolatile computer storage medium are not described herein again.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATEARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware DescriptionLanguage)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(RubyHardware Description Language), and VHDL (Very-High-SPEEDINTEGRATED CIRCUIT HARDWARE DESCRIPTION LANGUAGE) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, and the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Finally, it should be noted that the address matching method and apparatus based on the repetition weight disclosed in the embodiments of the present invention are only disclosed in the preferred embodiments of the present invention, and are only for illustrating the technical solutions of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some of the technical features may be equivalently replaced, and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for address matching based on repetition weight, characterized in that the method comprises:

Perform vector processing on the address to be matched to obtain the corresponding address vector;

Calculating similarity between the address vector and candidate address vectors corresponding to multiple candidate addresses in a preset address database to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprises the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight is inversely proportional to the number of different physical objects that may be simultaneously pointed to by the same data appearance of the corresponding address vector or the candidate address vector;

Filtering a target address corresponding to the to-be-matched address from the multiple candidate addresses according to the weight vector similarity;

The repetition weight includes a hierarchical repetition weight and a scene repetition weight; the hierarchical repetition weight is inversely proportional to the number of different entity objects that may be simultaneously pointed to by the same data appearance of the address hierarchy to which the address vector or part or all of the candidate address vectors belong; the scene repetition weight is inversely proportional to the number of different entity objects that may be simultaneously pointed to by the same data appearance of the scene type to which the to-be-matched address or part or all of the candidate addresses belong; the data appearance includes at least one of a data name, a data vector, a data identifier, and a data visualization pattern;

The level duplication weight corresponding to any of the address levels is calculated by the following steps:

For any of the address levels, obtaining information of multiple address fragments corresponding to the address level;

According to the physical address object corresponding to each of the address fragment information, a plurality of repeated fragment sets corresponding to the address level are screened out; each of the repeated fragment sets includes a plurality of address fragment information having the same data appearance but corresponding to different physical address objects;

Determining the level repetition weight corresponding to the address level according to the amount of the address segment information included in all the repeated segment sets;

And, determining the level repetition weight corresponding to the address level according to the number of the address segment information included in all the repeated segment sets includes:

Calculating a statistical value of the amount of the address fragment information included in all the repeated fragment sets; the statistical value includes at least one of a sum value, an average value, and a weighted average value;

The level repetition weight corresponding to the address level is determined according to the statistical value; the level repetition weight is inversely proportional to the statistical value.

2. The address matching method based on repetition weight according to claim 1, wherein the address vector comprises a plurality of address fragment vectors corresponding to a plurality of address levels corresponding to the address to be matched; and the candidate address vector comprises a plurality of candidate address fragment vectors corresponding to a plurality of address levels corresponding to the candidate address;

Furthermore, the vectorization processing of the address to be matched to obtain the corresponding address vector includes:

Split the address to be matched to obtain address fragments to be matched corresponding to multiple address levels;

Vectorization is performed on the address fragment to be matched to obtain a corresponding address fragment vector.

3. The address matching method based on repetition weight according to claim 2, wherein the step of calculating similarity between the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database to obtain a weight vector similarity between the address to be matched and any of the candidate addresses comprises:

For any candidate address among a plurality of candidate addresses in a preset address database, obtaining a candidate address fragment vector of candidate address fragments of a plurality of address levels corresponding to the candidate address;

Calculating a weighted vector similarity between any of the address fragment vectors corresponding to the address to be matched and the candidate address fragment vectors at the same address level; the weighted vector similarity is the product of the similarity between the address fragment vector and the candidate address fragment vector at the same level and the repetition weight;

The sum of the weighted vector similarities corresponding to at least two address segment vectors corresponding to the address to be matched is calculated to obtain the weighted vector similarity between the address to be matched and the candidate address.

4. The address matching method based on repetition weight according to claim 1 is characterized in that the similarity includes cosine distance and/or Euclidean distance; and/or the hierarchical repetition weight is inversely proportional to the hierarchical refinement of the address level to which the address vector or part or all of the candidate address vectors belong in all address levels; and/or the scene type includes a randomly named scene and a non-randomly named scene; wherein the scene repetition weight corresponding to the randomly named scene is lower than the scene repetition weight corresponding to the non-randomly named scene.

5. The address matching method based on repetition weight according to claim 1, wherein the step of selecting a target address corresponding to the address to be matched from the plurality of candidate addresses based on the weight vector similarity comprises:

Arrange the plurality of candidate addresses from largest to smallest according to the similarity of the weight vectors to obtain an address sequence;

Determine the first preset number of candidate addresses in the address sequence as target addresses corresponding to the address to be matched;

and/or,

At least one candidate address whose weight vector similarity is greater than a preset similarity threshold is screened out from the multiple candidate addresses and determined as the target address corresponding to the address to be matched.

6. A device for address matching based on repetition weight, characterized in that the device is used to perform the method for address matching based on repetition weight according to any one of claims 1 to 5, and the device comprises:

An address processing module is used to perform vector processing on the address to be matched to obtain the corresponding address vector;

a similarity calculation module, configured to perform similarity calculation on the address vector and candidate address vectors corresponding to a plurality of candidate addresses in a preset address database, to obtain a weighted vector similarity between the address to be matched and any of the candidate addresses; the weighted vector similarity comprising the product of the similarity between the address vector and the candidate address vector and a repetition weight; the repetition weight being inversely proportional to the number of different physical objects that the same data appearance of the corresponding address vector or candidate address vector may simultaneously point to;

The address screening module is used to screen out a target address corresponding to the to-be-matched address from the multiple candidate addresses according to the weight vector similarity.

7. An address matching device based on repetition weight, characterized in that the device comprises:

a memory storing executable program code;

a processor coupled to the memory;

The processor calls the executable program code stored in the memory to execute the address matching method based on repetition weight according to any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program enables a computer to execute the address matching method based on repetition weight according to any one of claims 1 to 5.