CN114969041A

CN114969041A - Processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing

Info

Publication number: CN114969041A
Application number: CN202210592302.7A
Authority: CN
Inventors: 吴峰; 张朝宗; 李银生; 王红; 聂永川; 任雁; 毋鹏杰; 杨扬; 刘淼; 张义倩
Original assignee: Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Current assignee: Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-08-30
Anticipated expiration: 2042-05-27
Also published as: ZA202211776B; CN114969041B

Abstract

The invention discloses a processing method for multisource main and auxiliary entity identity discrimination and data self-complementing, which is applied to the field of big data processing. According to the invention, through technical methods of identity probability calculation of a main entity and an auxiliary entity, index supplement and data merging of the same entity, entity directory item extraction and storage, entity sub-directory item separation and the like, the problems of processing and grouping of the main entity and the auxiliary entity respectively according to the identity probability, cross-source entity merging and data supplement, entity relationship unified storage, entity on-demand separation and the like are solved systematically, and a feasible solution is provided for multi-source and large-scale data association operation.

Description

Processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing

Technical Field

The invention relates to the technical field of big data application, in particular to a processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing.

Background

The existing entity identification, extraction and storage method for processing multi-source data generally includes collecting according to source or type, matching and identifying one by one according to entity attributes of data, and due to lack of distinguishing mechanisms such as entity bibliographic items, same scene, entity attribute classification and weight, data redundancy, non-uniform expression, low matching accuracy, low execution efficiency, identification process information loss and the like are caused, and the method is mainly embodied in the following aspects:

1) data redundancy and non-uniform expression. In the prior art, when entities of heterogeneous data are collected, collection according to sources or types is generally adopted, and indexes of the collected entity data are often inconsistent due to various indexes of the entities represented in the data, so that unified storage, standard expression and external service supply cannot be realized.

2) The entity matching accuracy is not high. The existing identification technology for entities generally carries out matching and identification according to entity attributes of data, and due to the restriction of factors such as various entity attributes and large data quantity, the problems of low matching degree, low precision and the like generally exist.

3) Entity identification is not efficiently performed. In the prior art, entities are usually judged in sequence according to the attribute sequence of the entities, and the problems of long entity identification and calculation time, inconsistent attribute sequence and the like are often caused due to the lack of classification definition, weight assignment and the like aiming at the attributes of the entities.

4) The entity is relatively static and the data quality cannot be improved. In the prior art, when an entity is identified and extracted, a direct separation mode is generally adopted, the attribute expansion is limited, mutual correction, supplement and expansion of data are not or rarely carried out according to implicit attributes among the data, the data self-perfection cannot be realized, and the data quality cannot be effectively ensured.

5) Identifying a process information loss. In the prior art, when an entity is identified, only the attribute information of the same entity which is successfully identified is usually recorded, and a large probability event in the process of identifying the entity is rarely recorded, for example, the situation that two entities are identified as the same entity with a large probability but cannot be completely identified as the same entity is judged, which is not favorable for deep mining and analysis of data relationship.

Disclosure of Invention

The invention provides a processing method for multisource principal and subordinate entity identity discrimination and data self-supplementation, which is used for solving the problems of principal and subordinate entity identity discrimination, automatic data merging and supplementation and the like of multisource and multistage data and provides a feasible solution for carrying out multisource and large-scale data correlation operation.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows.

A processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing specifically comprises the following steps:

A1. extracting a main entity bibliography MEFS and an accessory entity bibliography SEFS from an entity bibliography database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an accessory entity S (M) from an entity application scene database ESDB of the source A, extracting entity static database related entity information from an entity static database RSDB, extracting information representing a single-source same entity according to the main entity and the same scene information by using a single-source same entity screening and data supplementing device, storing the information into a same entity database SEDB, and performing data supplementation;

A2. extracting entity static library related entity information from an entity static database RSDB, extracting an accessory entity entry SEFS from an entity entry database EFDB of a source B, extracting an application scene ES between a main entity M (M) and an accessory entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting same entity data information from a same entity database SEDB, judging the identity of heterogeneous entities according to rules by utilizing a heterogeneous same entity discriminator, extracting information representing the heterogeneous same entity, transmitting the information to a heterogeneous entity data supplifier, and simultaneously storing the information into a main entity dynamic database RVDB;

A3. extracting dynamic database entity data information from an entity dynamic database RVDB, extracting same entity data information from a same entity database SEDB, receiving information of a same source and a same entity from a heterogeneous same entity discriminator, supplementing the information of the heterogeneous entity by using a heterogeneous entity data supplementing device according to a time nearest principle, and storing the information of the heterogeneous entity supplementation into the entity dynamic database RVDB;

A4. extracting the same entity data information from the same entity database SEDB, extracting the entity data information of the dynamic database from the entity dynamic database RVDB, utilizing an entity directory item automatic extraction generator, extracting entity directory ELS information according to an entity directory essential item ELES, and storing the entity directory ELS information into an entity directory database EDDB;

A5. extracting dynamic database entity data information from an entity dynamic database RVDB, extracting entity directory information from an entity directory database EDDB, automatically separating sub-entity information from the entity directory database EDDB by utilizing a sub-entity automatic separator according to rules to form sub-entity directory information, and storing the sub-entity directory information into the entity directory database EDDB.

In the above processing method for the identification of identity of multiple main and auxiliary entities and the data self-complementing, the working method of the single-source identification of the same entity and the data complementing device in step a1 is as follows:

A11. reading a single-source multi-library data set DSB from an entity static library database RSDB of a source A;

A12. reading the number N1 of unwarehoused libraries from an entity bibliographic item database EFDB of a source A, and setting N1 as 1;

A13. reading a main entity bibliographic item MEFS of a library n1, obtaining a data set DSA of the main entity bibliographic item MEFS, simultaneously obtaining the number of records I1 of the data set DSA, and setting I1 as 1;

A14. reading the i1 th record in the DSA, matching the data in the DSB by using the unique item K of the entry data, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;

A15. extracting the related information representing the identity of the single-source entity of the main entity m1 corresponding to the record i1, and writing the related information into an identical entity database SEDB;

A16. reading a related information data set DSC of the master entity m1 in the source A, which characterizes the same entity, from the same entity database SEDB;

A17. reading an affiliated entity information set DSS corresponding to a main entity m1 from an entity application scene database ESDB, and judging whether a specific affiliated entity s has the same entity or not by using the same scene SS rule; if the same entity exists, performing step A18, otherwise, performing step A19;

A18. extracting the related information of the same entity of a specific affiliated entity s, and writing the related information into a SEDB of the same entity database;

A19. judging whether I1> I1 is true, if so, executing I1 to I1+1, and jumping to the step A14 for execution; otherwise, jumping to the step A110 for execution;

A110. judging whether N1> N1 is true, if so, executing N1 as N1+1, and jumping to the step A13 for execution; otherwise, ending.

In the above processing method for the identification of the identity of the multisource main and auxiliary entities and the data self-complementing, the specific method for the identification of the identity of the heterogeneous entities by the heterogeneous identity discriminator in the step a2 is as follows:

A21. reading the number N2 of the auxiliary entity types which are not put in a storage from an entity bibliography item database EFDB of a source B according to the entity types, and setting N2 as 1;

A22. reading the related information of the specific affiliated entity type n2, and simultaneously obtaining the warehousing threshold TH of the affiliated entity type n2 set by the system;

A23. judging whether a corresponding entity dynamic database RVDB exists or not according to the affiliated entity type n2, if so, executing a step A24, and if not, jumping to the step A214 for execution;

A24. reading a relevant information data set DSF representing the same affiliated entity type n2 from the same entity database SEDB according to the affiliated entity type n 2;

A25. reading a dynamic library information data set DSD from an entity dynamic library RVDB;

A26. reading a set DSG of an affiliated entity type n2 from an affiliated entity bibliography database EFDB of a source B to obtain a record number M2, and setting M2 to be 1;

A27. reading m2 records of the subject entity bibliographic items from the set DSG;

A28. reading a specific application scenario es between the affiliated entity corresponding to the record m2 and the main entity from an entity application scenario database (ESDB) of the source B according to the affiliated entity type n2 and the record m 2;

A29. reading a specific static database data set DSE corresponding to the record m2 from an entity static database RSDB of the source B according to the affiliated entity type n2 and the record m 2;

A210. acquiring set DSF information from step A24, acquiring set DSD information from step A25, acquiring record m2 information from step A27, acquiring application scenario es information from step A28, acquiring set DSE information from step A29, matching in the set DSD according to a set rule by using unique item, invariant item and common item attribute of record m2 of the subject entity bibliography item, and the application scenario es, set DSD, set DSE and set DSF information, and calculating a similarity probability P (A) between entities;

A211. judging whether P (A) > TH is true, if not, jumping to the step A213 to execute, if so, writing P (A) and the information representing the entity item into the same entity database SEDB;

A212. judging whether P (A) is true or not, if not, jumping to the step A213 for execution, if so, transmitting the information of the record m2, the specific record item d corresponding to the set DSD, the specific record item e corresponding to the set DSE and the specific record item f corresponding to the set DSF into a heterologous entity data supplyer, and starting the operation of the heterologous entity data supplyer;

A213. judging whether M2> M2 is true, if true, executing M2-M2 +1, and simultaneously jumping to the step A26 to execute; if not, perform step A214;

A214. judging whether N2> N2 is true, if true, executing N2-N2 +1, and simultaneously jumping to the step A22 to execute; if not, it ends.

In the above processing method for identity discrimination and data self-complementing of multi-source main and subsidiary entities, the specific method for information supplementation of the heterogeneous entity in step a3 is as follows:

A31. receiving information of a record m2, a specific record item d corresponding to the set DSD, a specific record item e corresponding to the set DSE and a specific record item f corresponding to the set DSF;

A32. aiming at the unique item, the invariable item and the common item attribute of a specific bibliographic item, obtaining the number N3 of attributes, and setting N3 as 1;

A33. obtaining the attribute name of the n3 th attribute;

A34. reading the corresponding data dn of the record item d according to the attribute name, and simultaneously, sequentially reading the corresponding data of the record m2, the record item e and the record item f, and comparing the corresponding data with the dn;

A35. judging whether dn is empty, if so, jumping to the step A36 for execution, and if not, switching to the step A37 for execution;

A36. supplementing corresponding latest data in the record m2, the record item e and the record item f into dn according to a time latest principle, and recording a time stamp and source information of the supplemented data;

A37. marking the time stamp and the source information of the corresponding attribute data in the record m2, the record item e and the record item f;

A38. forming a temporary record item d', judging whether N3> N3 is true, if so, jumping to the step A33 for execution, otherwise, executing the step A39;

A39. for other attributes except the unique item, the invariable item and the common item, reading corresponding attribute data in the record item m2, the record item e and the record item f in sequence, and comparing the attribute data with the record item d;

A310. recording the time stamp and the source information to form a latest temporary record item; updated into the entity dynamic database RVDB.

In the above processing method for multi-source identity discrimination and data self-complementing of main and auxiliary entities, the method for generating the entity directory information in step a4 includes:

A41. setting entity types according to a system, obtaining the number N4 of the entity types, and setting N4 as 1;

A42. reading an entity directory entry els and an entity directory essential entry eles of the entity n 4;

A43. reading the same entity data set DSH of 100% entity n from the same entity database SEDB;

A44. according to a set DSH, extracting relevant data information of an entity directory entry els of an entity n4 from an entity dynamic database according to a latest time principle to form a temporary data set DSI;

A45. according to a data non-null principle of a necessary item eles of an entity directory of an entity n4, filtering a set DSI to form a data subset DSJ;

A46. writing the set DSJ into an entity directory database EDDB as entity directory ELS information of an entity n 4;

A47. and judging whether N4> N4 is true, if so, making N4 equal to N4+1, and jumping to step A42 to execute, otherwise, ending.

In the above processing method for multi-source identity discrimination and data self-complementing of main and subsidiary entities, the automatic separation method of the entity directory information in step a5 includes:

A51. according to the user instruction, starting a sub-entity separation program of a specific entity n 5;

A52. reading an entity separation rule r specified or preset by a user;

A53. reading a directory data set DSO of a specific entity n5 from an entity directory database EDDB, and setting a temporary data set DSP;

A54. obtaining the number I5 of records in the set DSO, and setting I5 as 1;

A55. reading record n5 in the set DSO, reading corresponding dynamic library entity data information in the entity dynamic database RVDB according to the information of record n5, matching, if matching is successful, executing step A56, otherwise, executing step A57;

A56. adding the record n5 into a data set DSP;

A57. judging whether I5> I5 is true, if so, executing I5 to I5+1, jumping to step A55 for execution, and if not, executing step A58;

A58. and writing the data set DSP into an entity directory database EDDB.

Due to the adoption of the technical scheme, the technical progress of the invention is as follows.

According to the invention, through technical methods such as the calculation of the probability of identity between a main entity and an auxiliary entity, the index supplement and data combination of the same entity, the extraction and storage of entity directory items, the separation of entity directory sub-items and the like, the problems of the main entity and the auxiliary entity that the main entity processes and integrates according to the probability of identity, the cross-source entity combination and data supplement, the unified storage of entity relations, the separation of entities according to needs and the like are systematically solved, and a feasible solution is provided for the multi-source and large-scale data association operation.

Mainly has the following remarkable effects.

1) The data is regular and the expression is uniform. Because the invention provides the identification, extraction and storage according to the entity entry, and the secondary processing and extraction of data according to the entity entry, compared with the prior art, the indexes are standardized and unified, the data can be regularly and uniformly stored, the entity expression is more uniform, and the use is more flexible.

2) The entity matching accuracy and the execution efficiency are improved. The invention classifies the concrete attributes of the entities, gives different weights, combines information such as contract scenes and the like, and performs entity matching and extraction, compared with the prior art, the matching difficulty is smaller, and the matching precision is higher; the calculation attribute is less, and the execution efficiency is higher; the problems of front-back contradiction, inconsistency and the like of the attribute values can be effectively relieved.

3) The data quality is improved. In the process of extracting and storing the entity data, the invention realizes the self-perfection and correction of the entity data by extracting and identifying the hidden attribute.

4) The same entity probabilities are recorded. The invention respectively stores and processes according to the same entity probability in the identification process, and compared with the prior art, the accuracy of data fusion is improved; the difficulty of secondary entity identification is reduced; the method is beneficial to deep mining and data analysis of different scene applications and entity relations.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a flow chart of the present invention;

fig. 3 is a schematic diagram of a working flow of the single-source identical entity screening and data supplementing apparatus according to the present invention;

FIG. 4 is a schematic view of the working flow of the same and different entity discriminator according to the invention;

FIG. 5 is a schematic flow chart of the operation of the data augmenter of the heterogeneous entity according to the present invention;

FIG. 6 is a schematic diagram of the work flow of the automatic entity directory entry extraction generator according to the present invention;

FIG. 7 is a schematic diagram of the working process of the automatic fruit body separator according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

A processing method for multisource main and auxiliary entity identity screening and data self-complementing is applied to the field of big data processing, and provides a technical scheme for stripping multisource data entities according to main and auxiliary entities, screening the same entities according to the same scene, entity attribute classification, weight and the like, respectively processing and storing the screening probability, and providing feasibility for different scene applications of data, deep mining of entity relations and data analysis.

In actual operation, firstly, extracting information representing a single-source same entity; then, judging the information of different sources and the same entity, and performing data supplement and expansion; and finally forming an entity directory item and an entity sub-directory item.

In the present invention, the following database is applied: 1) an entity static database RSDB (relative static database) for storing data of multiple databases from the same source (single source); 2) an entity dynamic database RVDB (relative variety database) for storing indexes and data of entities from different sources after integration; 3) an entity entry database EFDB (EntityFeatureDatabase) for storing information such as a main entity entry MEFS and related data, an auxiliary entity entry SEFS and related data and the like; 4) an entity application scene database (ESDB) (EntitySenseDatabase) for storing the application scene ES between the main entity M (M) and the subordinate entity S (M).

In the present invention, the terminology used includes: 1) a Source (Source) S, which is used to describe a set of data sets of a particular subject, with stability and continuity over a period of time; 2) a library (Data-Set) DS, which refers to a Set of Data sets generated by a source for a certain period of time, and may be composed of one or more two-dimensional Data tables; 3) table (Table) T, which refers to a two-dimensional data Table in the library; 4) an Entity (Entity), which can be a research object with relative stability and uniqueness described by a group of characteristic variables, is divided into a main Entity and an auxiliary Entity according to the mutual dependent relationship among different entities; 5) a main entity (MainEntity) refers to a research entity described by all or most of attributes in a source, generally, only one main entity in one source is represented in an entity (main entity corresponding to the entity) format, and the main entity is represented as m (m); 6) subordinate Entity (subordinate Entity) refers to an Entity dependent on the main Entity in the source, and usually, the subordinate Entity is a part of the main Entity or a set of variables for describing the attributes of the main Entity, and is represented in the format of "Entity (main Entity corresponding to Entity)", and the subordinate Entity is represented as s (m); 7) entity entry EFS (EntityFeatureStructure: entity feature structures) that reflect a set of index sets of entity attributes; 8) the main entity bibliography item MEFS (MainEntity FeatureStructure: primary entity feature structure) indicating a set of index sets that reflect attributes of the primary entity; 9) affiliated entity bibliography item SEFS (subsidiary entityfeaturestructure: secondary physical feature structure): the index set can reflect the affiliated entity and the incidence relation between the affiliated entity and the main entity, not only can reflect the self attribute of the affiliated entity, but also can reflect the related attribute of the state of the main entity where the affiliated entity is located; 10) in the same scenario ss (samesense), when an entity is stripped, for an affiliated entity in the same source, the same scenario is obtained when the indexes are consistent and the corresponding specific main entities are consistent.

In order to identify entity identity, dividing the attribute of an entity bibliography item into a unique item, an invariant item and a common item, wherein: the unique item k (key) refers to an attribute that can characterize the uniqueness of an entity, such as: identity card number, unified social credit code, organizational code, etc.; the invariant term uc (Unchange) refers to an attribute that an entity typically does not change often or never, such as: names, sexes, etc. of the person entities, unit names, addresses, etc. of the organization entities; the common term N (normal) refers to the attribute of the entity except for the unique term K and the invariant term UC.

In order to provide services for the external application entity and extract entity directory entries, the entity directory entries and the entity directory entries are used as necessary entries, wherein: the entity directory item els (entityliststructure) refers to a limited set of attributes that can reflect the basic status of an entity, selected according to a specific application, for example: for an "organization" entity, the basic items can be set as "organization name", "unified social credit code", "address", etc.; the entity directory essential item eles (entitylistessesententialsstructure) refers to a limited set of attributes selected according to a specific application that can guarantee the entity directory to be meaningful, typically name class attributes, the absence of which can render a specific entity meaningless, for example: the "organization name" of the "organization" entity, the "name" of the "personnel" entity, etc.

In the invention, after entity identification, extraction and processing, the heterogeneous data are respectively stored in the following two databases: an entity directory database EDDB (EntityDirectoryDatabase) stores entity directory information of different sources for providing services to the outside; the same entity database sedb (sameentitydatabase) stores information characterizing the same entity.

The implementation of the invention depends on a plurality of modules, as shown in fig. 1, including a single-source same-entity screening and data supplementing device, a heterogeneous same-entity screening device, a heterogeneous entity data supplementing device, an entity entry automatic extraction generator, and an automatic entity segregator.

A processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing is disclosed, the flow of which is shown in figure 2, and the method specifically comprises the following steps.

A1. Extracting a main entity bibliography MEFS and an accessory entity bibliography SEFS from an entity bibliography database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an accessory entity S (M) from an entity application scene database ESDB of the source A, extracting entity static library related entity information from an entity static database RSDB, extracting information representing a single-source same entity according to the main entity, the same scene and the like by utilizing a single-source same entity screening and data supplementing device, storing the information into a same entity database SEDB, and performing data supplementation.

In this step, the working method of the single-source same-entity screening and data supplementing device is as shown in fig. 3, which is specifically as follows.

A18. extracting the same entity related information of the specific affiliated entity s, and writing the same entity related information into a same entity database SEDB;

A110. judging whether N1> N1 is true, if true, executing N1 to N1+1, and jumping to execute the step A13; otherwise, ending.

A2. Extracting entity static library related entity information from an entity static database RSDB, extracting an accessory entity entry SEFS from an entity entry database EFDB of a source B, extracting an application scene ES between a main entity M (M) and an accessory entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting same entity data information from a same entity database SEDB, judging the identity of heterogeneous entities by using a heterogeneous and same entity discriminator according to rules, extracting information representing the heterogeneous and same entities, transmitting the information to a heterogeneous entity data supplifier, and simultaneously storing the information into a main entity dynamic database RVDB.

In this step, the process of discriminating the identity of the heterogeneous entity by the heterogeneous entity discriminator is shown in fig. 4, and the specific method is as follows.

A21. Reading the number N2 of the accessory entity types which are not put in a storage from an entity entry database EFDB of a source B according to the entity types, and setting N to be 1;

A22. reading the relevant information of the specific affiliated entity type n2, and simultaneously obtaining the warehousing threshold TH of the affiliated entity type n2 set by the system;

A23. judging whether the corresponding entity dynamic database RVDB exists or not according to the affiliated entity type n2, if so, executing the step A24, and if not, jumping to the step A214 for execution;

A26. reading a set DSG of an affiliated entity type n from an affiliated entity entry database EFDB of a source B to obtain the number M of records, wherein M is 1;

A210. acquiring set DSF information from step A24, acquiring set DSD information from step A25, acquiring record m information from step A27, acquiring application scenario es information from step A28, acquiring set DSE information from step A29, matching in the set DSD according to set rules by using unique items, invariable items and common item attributes of the record m2 of the subject entry of the affiliated entity and the application scenario es, set DSD, set DSE and set DSF information, and calculating the similarity probability P (A) among the entities;

in this embodiment: when the personnel entities are matched, aiming at the information of two personnel, if the identity card numbers are the same, P (A) is 100 percent; if the name and the mobile phone number are the same, P (A) is 100%; if the name and unit are the same, P (A) is 80%, etc.

A214. judging whether N2> N2 is true, if true, executing N2-N2 +1, and simultaneously jumping to the step A22 to execute; if not, the process is ended.

A3. The method comprises the steps of extracting dynamic database entity data information from an entity dynamic database RVDB, extracting same entity data information from the same entity database SEDB, receiving information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the information of the heterogeneous entities by using a heterogeneous entity data supplementing device according to the time recency principle and the like, and storing the information of the heterogeneous entity supplementation into the entity dynamic database RVDB.

In this step, the flow of the information supplementation of the heterologous entity is shown in fig. 5, and the specific method is as follows.

A33. obtaining the attribute name of the n3 th attribute;

A38. forming a temporary recording item d', judging whether N3> N3 is true, if true, jumping to the step A33 for execution, otherwise, executing the step A39;

A39. for other attributes except the unique item, the invariable item and the common item, reading corresponding attribute data in the record m2, the record item e and the record item f in sequence, and comparing the attribute data with the record item d;

A4. Extracting the same entity data information from the same entity database SEDB, extracting the dynamic database entity data information from the entity dynamic database RVDB, utilizing an entity directory item automatic extraction generator, extracting entity directory ELS information according to an entity directory essential item ELES, and storing the entity directory ELS information into the entity directory database EDDB.

In this step, a specific flow of the entity directory information is shown in fig. 6, and a generation method thereof is as follows.

A43. reading from the same entity database SEDB the same entity data set DSH of entity n for which p (a) ═ 100%;

A44. according to the set DSH, extracting the related data information of the entity directory entry els of the entity n from the entity dynamic database according to the latest time principle to form a temporary data set DSI;

A45. according to a data non-null principle of an entity directory essential item eles of the entity n, filtering a set DSI to form a data subset DSJ;

In this step, the automatic separation method of the information of the fruit body directory is as shown in fig. 7, specifically as follows.

A51. Starting a fruit body separation program of a specific entity n according to a user instruction;

A52. reading an entity separation rule r specified or preset by a user;

A53. reading a directory data set DSO of a specific entity n from an entity directory database EDDB, and setting a temporary data set DSP;

A54. obtaining the number I5 of records in the set DSO, and setting 5I to 1;

A56. adding the record n5 into a data set DSP;

A58. and writing the data set DSP into an entity directory database EDDB.

The application of the invention can realize the following functions.

1) And proposing a main and additional entity bibliographic item and a directory item. When the entity of the heterogeneous data is identified, a large number of various data index items are screened and extracted according to the subject and subsidiary entity entry, so that the consistency of indexes representing the entity and the uniform storage of the data are facilitated, meanwhile, the secondary processing and extraction of the data are performed according to the entity name entry, and the unified external service and large-scale data relation calculation of the data are facilitated.

2) The entity is matched with the scene. When the entity attributes of the data are used for matching and identifying, a same-scene identification mechanism of the entity is introduced according to the entity application scene of the data, the entity matching difficulty and complexity are reduced, and the entity matching accuracy is improved.

3) And proposing entity attribute classification and weight. According to the attribute characteristics of the entity, the attributes of the entity bibliography items are divided into unique items, invariable items and common items, different weight values are respectively given to the unique items, and entity identification is carried out by utilizing the weight values, so that the entity identification calculation time is favorably reduced, and the problems of attribute front-back contradiction and the like are solved.

4) And respectively storing and processing the discrimination probabilities. In the process of identifying the entities, besides the information of the same entity successfully identified, the probability of the same entity among a plurality of entities is also recorded and respectively stored and processed, so that the difficulty of secondary entity identification is reduced, and the deep mining and data analysis of different scene applications and entity relations are facilitated.

Claims

1. A processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing is characterized by comprising the following steps:

2. The method for processing the identity discrimination and data complementation of the multi-source main and auxiliary entities according to claim 1, wherein the working method of the single-source identity discrimination and data complementation device in step a1 is as follows:

A14. reading the i1 th record in the DSA, matching the unique item K of the entry data with the data in the DSB, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;

3. The method for multi-source identity screening and data self-complementing of main and additional entities according to claim 2, wherein the method for the heterology and identity discriminator to discriminate the heterology entity identity in step a2 comprises:

A28. reading a specific application scenario es between the affiliated entity corresponding to the record m2 and the main entity from an entity application scenario database ESDB of the source B according to the affiliated entity type n2 and the record m 2;

A213. judging whether M2> M2 is true, if true, executing M2-M2 +1, and simultaneously jumping to execute the step A26; if not, perform step A214;

4. The method for multi-source identity screening and data self-complementing of main and subsidiary entities according to claim 3, wherein the method for information supplementation of the alien entity in step A3 comprises:

A32. aiming at the unique item, the invariable item and the common item attribute of a specific bibliographic item, obtaining the number N3 of the attributes, and setting N3 as 1;

A33. obtaining the attribute name of the n3 th attribute;

5. The method for multi-source entity identity screening and data complementation according to claim 4, wherein the method for generating the entity directory information in the step A4 comprises the following steps:

6. The method for multi-source identity screening and data complementation of main and subsidiary entities according to claim 5, wherein the automatic separation method of the sub-entity directory information in step A5 comprises the following steps:

A51. starting a sub-entity separation program of a specific entity n5 according to a user instruction;

A52. reading an entity separation rule r specified or preset by a user;

A54. obtaining the number I5 of records in the set DSO, and setting I5 as 1;

A56. adding the record n5 into a data set DSP;

A58. and writing the data set DSP into an entity directory database EDDB.