CN109885693B

CN109885693B - Rapid knowledge comparison method and system based on knowledge graph

Info

Publication number: CN109885693B
Application number: CN201910025419.5A
Authority: CN
Inventors: 李兵; 熊燚铭; 胡方家; 陈健; 赵玉琦; 陈秀清
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2021-08-03
Anticipated expiration: 2039-01-11
Also published as: CN109885693A

Abstract

The invention provides a quick knowledge comparison method and system based on a knowledge graph, including constructing a knowledge representation unit, splitting and parsing entries in various fields into knowledge representation units; constructing a knowledge graph, including saving the knowledge representation unit in a graph database to form a The knowledge graph forms a many-to-many graph structure relationship between domain entries; constructs the domain concepts to be compared, including determining the domain concepts that need to be compared, splits and parses them into knowledge representation units, stores them in the knowledge graph and establishes them without destroying the original Temporary mention relationship of graph structure; multi-level topology extraction of domain concepts; comparison of multi-level topologies to calculate the weights of topological nodes, and then to calculate the weighted similarity of domain concepts to obtain knowledge comparison results. The invention can quickly and automatically realize the knowledge comparison and classification of massive documents, supports complex comparison applications, has high real-time performance, strong practicability, improves the accuracy of subsequent query fusion, and has important market value.

Description

Method and system for rapid knowledge comparison based on knowledge graph

Technical Field

The invention belongs to the technical field of computer knowledge comparison, and particularly relates to a knowledge fusion method in the field of knowledge maps.

Background

Knowledge is described by the knowledge map in a mode of constructing entities and entity relations, so that the knowledge can be exchanged, circulated and processed among computers and between the computers and people more easily. On the application level, synonymous concepts from different sources cannot be effectively understood by a computer, and an effective technical scheme for solving the knowledge comparison problem mainly aiming at knowledge fusion is urgently needed. The patent provides a topology extraction and comparison scheme, which can rapidly compare the synonymy degree between concepts in two fields.

The comparison result is used for guiding knowledge comparison and classification of massive documents, such as determining whether two concepts are the same concept, for example, determining whether the two concepts have higher inclusion relationship. The realization of quick contrast can save system resources, improves the real-time practicality of technical application, for example in the medical field, and this contrast result can support quick automatic determination whether a certain case belongs to a certain field, helps the patient to find relevant departments fast.

Disclosure of Invention

Aiming at the problems in the prior knowledge comparison technology, the invention provides a rapid comparison scheme based on a topological structure.

The technical scheme provided by the invention is a fast knowledge comparison method based on a knowledge graph, which comprises the following steps,

step 1, constructing a knowledge representation unit, and splitting and analyzing entries of each field into the knowledge representation unit; the knowledge representation unit comprises a field node AreaNode, a classification node CategoryNode and a description node TextNode, wherein each attribute of each entry is stored in the field node AreaNode, the classification of the entry is stored in the classification node CategoryNode, a detailed sub-entry of the description entry is stored in the description node TextNode, and after a word segmentation method is used for segmenting the attribute text and the description node description text, the mentioned field node and the mentioned node establish an MENTION relationship, wherein the MENTION represents the MENTION;

step 2, constructing a knowledge graph, including storing all knowledge representation units obtained in the step 1 into a graph database to form the knowledge graph, and forming many-to-many graph structure relations among the field entries;

step 3, constructing the domain concepts to be compared, including determining the domain concepts A, B to be compared, splitting and analyzing the domain concepts A, B into knowledge representation units, then storing the knowledge graphs obtained in the step 2 and establishing a temporary reference relationship which does not damage the structure of the original graph;

step 4, extracting the multilevel topology of the domain concept, including extracting the topological structure of the domain concept A, B on the knowledge graph by using a subgraph matching mode, wherein the domain nodes in the knowledge representation unit and other domain nodes related to the description nodes through the MENTION relationship are the first-level topology of the domain concept, the domain nodes in the first-level topology directly generate the MENTION relationship or the domain nodes indirectly generate the MENTION relationship through the description nodes are the second-level topology of the domain concept, the homological N-level topology refers to other domain nodes directly mentioned by the N-1 level topology and other domain nodes indirectly generating the MENTION relationship through the description nodes, and the extracted nodes are not extracted any more;

and 5, comparing the multilevel topology, namely obtaining a data graph according to the topological structure of the domain concept A, B extracted in the step 4, calculating the weight of the topological nodes, and then calculating the weighted similarity alpha of the domain concept A, B to obtain a knowledge comparison result.

Furthermore, in step 5, the topological node weights are calculated based on the following definitions,

in the knowledge graph, a sub-network with a node V as a center and a depth d is modeled into a data graph G (V, d) { V (G (V, d)), E (G (V, d)) }, wherein V (G (V, d)) refers to a point set formed by all nodes in the data graph G (V, d), E (G (V, d)) refers to an edge set generated by all node links in the data graph G (V, d), and the node V is defined additionally₀The depth in the data map G (v, D) is D (G (v, D), v₀)，v₀E.g. V (G (V, d)), so as to obtain V for any node₀W (G (v, d), v) of₀) As indicated by the general representation of the,

when considering that two nodes respectively form a node v₁、v₂Node v is a topology formed by the sub-networks with the center and the depth d₀The weights therein are found as follows,

W(G(v₁，d)，G(v₂，d)，v₀)＝W(G(v₁，d)，v₀)+W(G(v₂，d)，v₀)

wherein, G (v)₁D) and G (v)₂D) is respectively with node v₁、v₂Data graph, W (V) with depth d modeled by sub-network with center as the center₁，d)，v₀) And W (G (v)₂，d)，v₀) Is node v₀In data graph G (v)₁D) and G (v)₂And weight in d).

In step 5, the weighted similarity a of the domain concept A, B is calculated as follows,

let the central nodes of the domain concept A and the domain concept B be nodes v respectively₁、v₂Data graph G of corresponding expansion₁(d)＝G(v₁,d)、G₂(d)＝G(v₂D), data graph G₁(d) Node set V of₁(d)＝V(G₁(d) Data graph G)₂(d) Node set V of₂(d)＝V(G₂(d) The domain concept A, B weighted similarity α is found as follows:

wherein the weight W (G)₁(d)，G₂(d)，v₀)＝W(G(v₁，d)，G(v₂，d)，v₀)。

The invention also provides a system for rapidly comparing knowledge based on the knowledge graph, which comprises the following modules,

the first module is used for constructing a knowledge representation unit and splitting and analyzing each field entry into the knowledge representation unit; the knowledge representation unit comprises a field node AreaNode, a classification node CategoryNode and a description node TextNode, wherein each attribute of each entry is stored in the field node AreaNode, the classification of the entry is stored in the classification node CategoryNode, a detailed sub-entry of the description entry is stored in the description node TextNode, and after a word segmentation method is used for segmenting the attribute text and the description node description text, the mentioned field node and the mentioned node establish an MENTION relationship, wherein the MENTION represents the MENTION;

the second module is used for constructing the knowledge graph, and comprises the steps of storing all knowledge representation units obtained by the first module into a graph database to form the knowledge graph, and forming many-to-many graph structure relations among the domain entries;

the third module is used for constructing the domain concepts needing to be compared, and comprises determining the domain concepts needing to be compared A, B, splitting and analyzing the domain concepts A, B into knowledge representation units, then storing the knowledge representation units into the knowledge graph obtained by the second module, and establishing a temporary reference relationship which does not damage the structure of the original graph;

a fourth module, configured to extract a multi-level topology of the domain concept, where the multi-level topology includes extracting a topology structure of the domain concept A, B on a knowledge graph in a subgraph matching manner, a domain node in a knowledge representation unit and other domain nodes describing nodes associated by an MENTION relationship are first-level topologies of the domain concept, the domain node in the first-level topology directly generates the MENTION relationship or indirectly generates the MENTION relationship through the description node is a second-level topology of the domain concept, and the like N-level topology refers to other domain nodes mentioned directly by the N-1 level topology and other domain nodes indirectly generating the MENTION relationship through the description node, and the extracted nodes are not extracted any more;

and the fifth module is used for comparing the multilevel topology, and comprises the steps of obtaining a data graph according to the topological structure of the domain concept A, B extracted by the fourth module, calculating the weight of the topological nodes, and then calculating the weighted similarity alpha of the domain concept A, B to obtain a knowledge comparison result.

And in a fifth module, calculating topology node weights based on the following definitions,

In the fifth module, moreover, the weighted similarity a of the computing domain concept A, B is implemented as follows,

let the central nodes of the domain concept A and the domain concept B be nodes v respectively₁、v₂Corresponding data map G₁(d)＝G(v₁，d)、G₂(d)＝G(v₂D), data graph G₁(d) Node set V of₁(d)＝V(G₁(d) Data graph G)₂(d) Node set V of₂(d)＝V(G₂(d) The domain concept A, B weighted similarity α is found as follows:

wherein the weight W (G)₁(d)，G₂(d)，v₀) I.e. W (G (v))₁，d)，G(v₂，d)，v₀)。

The invention has the following advantages:

the invention provides a topology extraction scheme, which meets the following requirements:

1. the direct characteristics of the domain concept in this domain are effectively described.

2. The expansion shows the indirect characteristic of the domain concept in the domain.

3. The extraction speed and the extraction depth basically keep a linear relation.

The invention provides a topology comparison scheme, which meets the following requirements:

1. the difference between the two domain concepts is effectively described.

2. Sensitive to highly similar domain concepts.

3. The direct features are well distinguished from the indirect features.

Therefore, the method can quickly and automatically realize knowledge comparison and classification of massive documents, is used for tasks requiring complex comparison, such as document filing, entry quick classification, knowledge fusion and the like, adopts an automatic scheme to replace a large amount of manual analysis, has high real-time performance and strong practicability, improves the precision of subsequent query fusion, and has important market value.

Drawings

Fig. 1 is a schematic diagram of a specific form of a knowledge representation unit according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a multi-level topology extraction method according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.

The fast knowledge comparison method based on the knowledge graph provided by the embodiment of the invention comprises the following steps:

step 1: and constructing a knowledge representation unit. The invention provides a method for resolving and analyzing a domain entry into a knowledge representation unit shown in figure 1, wherein each attribute of the entry is stored in a domain node (AreaNode), the category of the entry is stored in a category node (CategoryNode), a detailed sub-entry describing the entry is stored in a description node (TextNode), and after a word segmentation method is used for segmenting the attribute text and the description node description text, a MENTION (MENTION) relationship is established between the domain node and the MENTION node.

The knowledge representation unit refers to a knowledge representation mode with the same overall structure, the same set of logic constraints and different sub-structures, and fig. 1 is a method for representing knowledge in the patent. The knowledge representation unit adopts a uniform overall structure and associated logic, the semantics described by the sub-structures are changed, and for example, the sub-structures of the medicines can include semantic descriptions of attributes such as chemical structures, physicochemical properties, adverse reactions, use cautions and the like.

The domain entries are words related to the domain and paraphrases thereof, for example, 20 ten thousand entries in the medical domain can be stored in a document. When each domain entry in the domain is divided and analyzed into the knowledge representation unit shown in fig. 1, the complete information of the entry including the knowledge name, description and the like is stored in the domain node (area node), the classification to which the entry belongs is stored in the classification node (category node), and the detailed sub-entry for describing the entry is stored in the description node (TextNode). The patent ensures that the complete information of knowledge is not damaged by splitting and analyzing the domain entries into the knowledge representation units, and the unstructured entries are converted into the structured knowledge information which can be understood by a computer.

The classification nodes are connected with each other by using an inclusion (EMBRACE) relationship and represent the inclusion relationship between classification and classification; the classification nodes and the domain nodes are connected by using an Inclusion (INDLUDE) relationship and represent the inclusion relationship of the classification to the domain node entities; the domain nodes and the description nodes are connected by using an Inclusion (INVOLVE) relationship to represent the inclusion relationship of the domain nodes to the sub-entries, and the description nodes are also connected by using an Inclusion (INVOLVE) relationship to record the sub-descriptions of the description information. In addition, the domain nodes are connected by adopting a reference (MENTION) relationship, and the direct association relationship between the domain nodes is recorded; the description nodes and the field nodes are connected by adopting a reference (MENTION) relationship, and the reference relationship of the description nodes to the field nodes is recorded, wherein the two reference relationships are main components for generating the connection between the field nodes and the field nodes.

Note: the three relations of EMBRACE, INCLUDE and INVOLVE are not different semantically, and all represent inclusion and are used for bias sequence reasoning of knowledge. The computer is used for distinguishing the node types at two ends of the relation during storage, so that the computer can conveniently establish indexes of the corresponding relation and play an accelerating role in the comparison process.

The domain concept knowledge representation scheme provided by the invention can completely describe the knowledge structure of the domain concept, and does not damage the structure of other domain concepts when generating relationship with other domain concepts.

Step 2: and (5) constructing a knowledge graph. And (3) storing all the knowledge representation units obtained in the step (1) into a graph database to form a knowledge graph, and forming a many-to-many graph structure relationship among the field entries.

The knowledge representation unit formed in the step 1 already generates a preliminary association map through the classification nodes and the domain nodes, and the complexity of the relationship does not support comparison at this time. After text word segmentation is carried out on the description text and the description node description of the domain nodes through a maximum entropy model, a reference (MENTION) relation is established between the mentioned domain nodes and the mentioned nodes (including the domain nodes and the description nodes). After the operation, each knowledge representation unit is stored in a graph database to form a knowledge graph, and then the domain words form a many-to-many graph structure relationship directly or indirectly through the reference relationship.

And step 3: and constructing a domain concept to be compared. The domain concept refers to a mutually-associated knowledge representation unit which can be correctly expressed by a computer after the domain entries are processed by the steps and does not influence the understanding of people. When the domain entries A, B need to be compared, the domain entries A, B are split and analyzed into knowledge representation units shown in fig. 1 according to step 1, then the knowledge representation units are stored in the knowledge graph obtained in step 2 according to step 2, and a temporary reference relation which does not damage the structure of the original graph is established, at this time, the knowledge representation units are associated with other knowledge representation units to form domain concepts.

And 4, step 4: a multi-level topology of the domain concept is extracted. The topological structure of the domain concepts A, B on the knowledge graph is extracted using subgraph matching.

Fig. 2 is a multi-level topology representation method of the patent, the leftmost side is a knowledge representation unit of a domain concept, domain nodes in the knowledge representation unit and other domain nodes describing nodes related by an MENTION relationship are first-level topologies of the domain concept, and the domain nodes in the first-level topologies directly generate the domain nodes of the MENTION relationship or indirectly generate the field nodes of the MENTION relationship by the description nodes are second-level topologies of the domain concept. Similarly, the N-level topology refers to other domain nodes directly mentioned by the N-1 level topology and other domain nodes indirectly generating the MENTION relationship through the description nodes, and the extracted nodes are not extracted.

And 5: multi-level topologies are compared. And obtaining a data graph from the topological structure of the domain concept A, B extracted in the step 4, calculating the weight of the topological node according to the definition 1, and then calculating the weighted similarity alpha of the domain concept A, B according to the definition 2. The alpha quantization represents how similar the domain concept a is to the domain concept B. This degree of similarity can be used in various domains, and alpha values can be used to guide concept fusion. When the similarity alpha is used for a comparison task, the similarity alpha has direct significance, and when the similarity alpha is used for a classification task, concepts of all fields of a certain class are regarded as a first-level topology, so that the problem of overlarge comparison calculation amount of every two concepts in the traditional classification task can be solved.

Defining:

define 1 node weight: in the graph, a sub-network with a depth d and a node V as a center can be modeled into a data graph G (V, d) { V (G (V, d)), E (G (V, d)) }, wherein V (G (V, d)) refers to a point set formed by all nodes in the data graph G (V, d), E (G (V, d)) refers to an edge set generated by all node links in the data graph G, and a node V is defined additionally₀The depth in the data map G (v, D) is D (G (v, D), v₀)，v₀E.g. V (G (V, d)), so as to obtain V for any node₀W (G (v, d), v) of₀) Expressed as:

when considering that two nodes respectively form a node v₁、v₂Node v in a centralized, depth d subnetwork and co-constructed topology₀The weights therein are found as follows:

W(G(v₁，d)，G(v₂，d)，v₀)＝W(G(v₁，d)，v₀)+W(G(v₂，d)，v₀) Formula II

Wherein, G (v)₁D) and G (v)₂D) is respectively with node v₁、v₂Data graph modeled for a hub, depth d sub-network, node v₀In data graph G (v)₁D) and G (v)₂Weight W (G (v) in d)₁，d)，v₀) And W (G (v)₂，d)，v₀) And obtaining according to the formula I.

Defining 2 similarity comparison of weighted topological nodes: let the central nodes of the domain concept A and the domain concept B be nodes v respectively₁、v₂Giving two data graphs G respectively based on the expansion of the domain concept A and the domain concept B₁(d)＝G(v₁，d)、G₂(d)＝G(v₂D), data graph G₁(d) Node set V of₁(d)＝V(G₁(d) Data graph G)₂(d) Node set V of₂(d)＝V(G₂(d) Wherein the weight value W (G) in the graph₁(d),G₂(d)，v₀) E (0,2d), arbitrary node v₀The closer to the two-domain concept root node (the domain node used to represent the knowledge unit) the higher the weight. G₁(d) G to G₂(d) The weighted similarity α of G is found as follows:

wherein the weight W (G)₁(d),G₂(d)，v₀) I.e. W (G (v))₁，d),G(v₂，d)，v₀) And obtaining according to the formula II. Data graph G₁(d)、G₂(d) Is an extension of the field concept, and d can be preferably 3 when compared.

In specific implementation, the technical scheme can adopt a computer software technology to realize an automatic operation process, and can also adopt a modularized mode to provide a corresponding system. The embodiment of the invention also provides a fast knowledge comparison system based on the knowledge graph, which comprises the following modules,

and the fifth module is used for comparing the multilevel topologies, obtaining a data graph according to the topological structure of the domain concept A, B extracted by the fourth module, calculating the weight of the topological nodes, and then calculating the weighted similarity alpha of the domain concept A, B, wherein the weighted similarity alpha is used for concept fusion.

The specific implementation of each module can refer to corresponding steps, and the detailed description of the invention is omitted.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.

Claims

1. A fast knowledge comparison method based on knowledge graph is characterized in that: comprises the following steps of (a) carrying out,

2. The method for fast knowledge comparison based on knowledge-graph according to claim 1, wherein: in step 5, the topology node weights are calculated based on the following definitions,

3. The method for fast knowledge comparison based on knowledge-graph according to claim 2, characterized in that: in step 5, the weighted similarity a of the domain concept A, B is calculated as follows,

let the central nodes of the domain concept A and the domain concept B be nodes v respectively₁、v₂Data graph G of corresponding expansion₁(d)＝G(v₁，d)、G₂(d)＝G(v₂D), data graph G₁(d) Node set V of₁(d)＝V(G₁(d) Data graph G)₂(d) Node set V of₂(d)＝V(G₂(d) The domain concept A, B weighted similarity α is found as follows:

4. A fast knowledge comparison system based on knowledge graph is characterized in that: comprises the following modules which are used for realizing the functions of the system,

5. The system of knowledge-graph-based rapid knowledge comparison of claim 4, wherein: in a fifth module, topology node weights are computed based on the following definitions,

6. The system of knowledge-graph-based rapid knowledge comparison of claim 5 wherein: in the fifth module, the weighted similarity a of the computing domain concept A, B is implemented as follows,