Disclosure of Invention
Based on the above, it is necessary to provide a community discovery method, device and storage medium capable of accurately obtaining mineral knowledge maps with little computational power dependency and good generalization in real time
In order to achieve the above object, in one aspect, an embodiment of the present application provides a community finding method for a mineral knowledge graph, including the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
In one embodiment, the step of confirming the second distance of any two entities in the knowledge base based on the first distance includes:
Setting a cut-off function;
and obtaining a second distance based on the truncated function and the first distance.
In one embodiment, in the step of obtaining the second distance based on the truncated function and the first distance, the second distance is obtained based on the following formula:
Wherein WD is the second distance, d w is the first distance, delta c is the cut-off function, all i is the maximum number of simultaneous occurrences of any two entities within the preset range.
In one embodiment, the step of obtaining the similarity of any two entities to the concept pool includes:
acquiring description texts and concept libraries corresponding to any two entities;
Vectorization calculation is carried out on the description texts corresponding to any two entities, and vectorized entities are obtained;
Based on the vectorization entity, acquiring the vector of the target concept from the concept library;
Based on the vector of the vectorized entity and the target concept, the similarity of any two entities in the concept library is obtained.
In one embodiment, in the step of obtaining the similarity of any two entities in the concept pool based on the vector of the vectorized entity and the target concept, the similarity is obtained based on the following formula:
Wherein, CD is similarity, i is category code number of target concept, allc represents category number; One of any two vectorized entities; Another of any two vectorized entities; Is a vector of target concepts of category i extracted according to one of the vectorized entities Is a vector of target concepts according to category i extracted by another vectorizing entity. W ci is the weight of the target concept with category designation i.
In one embodiment, the method further comprises the steps of:
acquiring description texts corresponding to communities;
Summarizing the description text corresponding to each community to obtain community summarization.
In one embodiment, the method further comprises the steps of:
receiving a request of a user;
converting the request into a request vector;
similarity matching is carried out on the request vector and community summarization, and a target community is obtained;
And extracting the entity matched with the request vector from the target community.
In one embodiment, the step of performing text blocking on the knowledge base to obtain each text block includes:
splitting the knowledge base to obtain a plurality of sentences;
vectorizing each sentence to obtain a plurality of vectors;
based on cosine similarity of vectors of adjacent sentences, the adjacent sentences are classified into the same text block or different text blocks.
In one aspect, an embodiment of the present invention provides a community finding device for mineral knowledge graph, including:
The block module is used for acquiring a knowledge base, and performing text block on the knowledge base to obtain each text block;
the entity extraction module is used for obtaining the entity of the knowledge base based on each text block;
the building module is used for building a first entity relation diagram according to each entity and the relation among the entities;
The distance acquisition module is used for acquiring a first distance between any two entities in each text block and confirming a second distance between any two entities in the knowledge base based on the first distance;
The similarity acquisition module is used for acquiring the similarity of any two entities in the concept library;
The weight value calculation module is used for confirming the sum of the similarity and the second distance as the weight value of the edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
the modularity calculation module is used for obtaining the modularity according to the weight value of the edge;
And the aggregation module is used for aggregating the nodes of the first entity relation graph based on the increment of the modularity to obtain a plurality of communities.
In another aspect, the application provides a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of the above method when run.
One of the above technical solutions has the following advantages and beneficial effects:
the community discovery method of the mineral knowledge graph gathers discrete knowledge base documents, reduces the dependence on a high-performance large model, and improves the capability of answering questions. Compared with the means of searching weight values from prompt words in the traditional technology, the method provides a quantification mode of weight analysis, reduces performance requirements and obtains better community distinction. Finally, the aggregation of communities and the recombination of community data are realized through modularity, a logical basis is provided for the formation of the geochemical communities, and the uncertainty of large model output is reduced.
Detailed Description
In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. Embodiments of the application are illustrated in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module" and "component" may be used in combination.
It is to be understood that in the following embodiments, "connected" is understood to mean "electrically connected", "communicatively connected", etc., if the connected circuits, modules, units, etc., have electrical or data transfer between them.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," and/or the like, specify the presence of stated features, integers, steps, operations, elements, components, or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.
Currently, traditional knowledge graph construction means for mineral resources rely mainly on relational databases (e.g. trade databases) or on small-scale data sets and small-scale deep learning models. However, the limitations inherent to these methods result in the following series of problems:
1. the limitation of cross-text entity relationship extraction is that these methods have difficulty accurately capturing and establishing the relationship between core knowledge points when processing cross-paragraph, cross-sentence, or even cross-document information. This drawback greatly hampers the development of general knowledge, making it difficult to comprehensively extract key information from a number of documents.
And the data update is delayed, namely, the data update speed is slow due to the fact that the training process of the deep learning model is complex and time-consuming and the construction process is complex. This means that a batch of data is often required to go through a long period of time from collection to model training completion and put into use.
2. The model has insufficient generality, and in view of the fact that a small model is usually adopted by the traditional method, the application range of the model is often limited to a specific field. Under the technical background that the current large-scale model continuously emerges, the small-scale model has difficulty in meeting the requirements of construction of cross-domain and universal knowledge maps.
3. Application development challenges once knowledge graph construction is complete, subsequent application development efforts are also challenged. In particular, when implementing a natural language-based question-answer query function, it is often difficult to rapidly and efficiently develop applications meeting user needs due to technical limitations of conventional methods.
In summary, the traditional mineral resource knowledge graph construction method has significant defects in aspects of cross-text relation extraction, data timeliness, model universality, application development and the like, and the community discovery method of the mineral resource knowledge graph can effectively solve the problems.
In one embodiment, as shown in fig. 1, there is provided a community finding method of a mineral knowledge graph, including the steps of:
s110, acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
In particular, the acquired knowledge base can be obtained by rapidly processing and analyzing a large number of mineral resource related documents through a language model. The content of the knowledge base is partitioned into smaller, easily handled text blocks according to some logic or rule. This facilitates the extraction and processing of information in subsequent steps. In one example, a natural paragraph may also be obtained through a format feature to form a text block. In another specific example, the text blocks can be divided by adopting a vector model, so that the text blocks can be aggregated according to the meaning, the phenomenon that the text blocks interrupt the meaning is avoided to a certain extent, and the dividing errors caused by the frequent diversification of the format, typesetting and punctuation mark style of paragraph dividing are avoided.
S120, obtaining an entity of a knowledge base based on each text block;
Specifically, key entities are extracted from each text block. The entity may be a noun phrase, a person name, a place name, an organization name, or the like. Further, the step is to extract the entities one by taking the text block as a unit through the understanding capability of the large language model, so as to obtain the entity text. After extraction, the entity text is cleaned in the first step by a regular matching method, and then the entity model is cleaned in the second step by a vector model again, so that repeated information is completely eliminated, and a final entity is obtained. Entity extraction of the knowledge base determines the division of communities.
S130, constructing a first entity relationship diagram according to each entity and the relationship among the entities;
In particular, analyzing relationships between entities in a text block may identify acknowledgements by co-occurrence, semantic similarity, or specific relationship templates. And constructing a preliminary entity relation diagram according to the identified entities and the relations thereof, wherein the nodes represent the entities and the edges represent the relations.
S140, acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
Wherein the first distance refers to the distance between two entities within different text blocks and the second distance refers to the sum of the distances between two entities within the knowledge base. It should be noted that the relevance of entities may be improved as described below. b. Two entities appear in the same article multiple times. c. Two entities appear in the same article, different text blocks. d. Two entities appear in the same article, different text blocks, and appear multiple times in each text block. e. Two entities appear in the same article, different text blocks, and appear multiple times in each paragraph, and in each text block, appear simultaneously in multiple sentences. Further, the distance between any two entities may be calculated based on the lexical distance or sentence level distance within the text block, or may be calculated based on other means known in the art.
S150, obtaining the similarity of any two entities in the concept library;
In particular, in some cases two entities do not appear in the same piece of text, but they share the concept of belonging to a certain category. In this category, however, there is a certain association between the two entities. Further, by introducing a specialized ground science expertise library as the concept library. In a specific example, as shown in fig. 2, the step of obtaining the similarity of any two entities to the concept pool includes S210, obtaining description texts corresponding to any two entities and the concept pool, S220, performing vectorization calculation on the description texts corresponding to any two entities to obtain vectorized entities, S230, obtaining a vector of a target concept from the concept pool based on the vectorized entities, and S240, obtaining the similarity of any two entities to the concept pool based on the vectorized entities and the vector of the target concept. Namely, summarizing the entity to obtain entity summary (namely, the description text corresponding to the entity), vectorizing the content in the entity summary by using embedding model, extracting the similarity between the entity summary and related concepts in the concept library, comparing the similarity between the extracted concepts, and finally combining all the category comparisons.
Specifically, in the step of obtaining the similarity of any two entities in the concept library based on the vector of the vectorized entity and the target concept, the similarity is obtained based on the following formula:
Wherein, CD is similarity, i is category code number of target concept, allc represents category number; One of any two vectorized entities; Another of any two vectorized entities; Is a vector of target concepts of category i extracted according to one of the vectorized entities Is a vector of target concepts according to category i extracted by another vectorizing entity. W ci is the weight of the target concept with category designation i.
S160, confirming the sum of similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
specifically, the similarity and the second distance are added to obtain a weight value of each edge in the first entity relation diagram, and the association degree between the entities is reflected.
S170, obtaining modularity according to the weight value of the edge;
The modularity is an index for measuring community division quality. It reflects the connection tightness between nodes inside the community relative to the connection tightness of nodes outside the community. In this step, the modularity of the whole entity relationship graph is calculated according to the weight value of the edge.
And S180, aggregating the nodes of the first entity relationship graph based on the increment of the modularity to obtain a plurality of communities.
Specifically, the nodes of the first entity relationship graph are aggregated through a module degree formula and a module degree increment formula.
Modularity:
Module degree increment:
wherein Σ in represents the sum of the weights of all the edges within community C, Σ tot represents the sum of the weights of all the variable edges pointing to community C, k i represents the sum of the weights of all the edges pointing to node i, k i,in represents the sum of the weights of node i and the edges of community C, and m represents the sum of the weights of all the edges.
In S160, the weight of each side has been obtained, and the sum m of the weights of all sides is obtained with the weight of each side. And then, finding out the node with the largest delta Q among all the nodes, and aggregating the nodes into a community. In the first step of community aggregation, each node is an independent community, so Σ tot is the sum of the weights of the relation between a certain node and all other nodes, namely the sum of the weights of the edges formed by every two other nodes:
From the ΔQ formula, to complete aggregation, the weight values and initial nodes of the target node k i,in and the initial node are as follows The difference between these two nodes determines the chance of forming a community. Since k i represents the sum of the weights of all the edges pointing to node i, i.e., the target node, the initial node and the target node are easy to form a community when the weight of the relationship between the target node and other nodes is small and the weight of the relationship between the initial node is large enough. A round of iteration may group some of the nodes into a certain community. The communities are used as new nodes, nodes formed by the communities and other communities are continuously aggregated according to the formula of delta Q, and iteration is repeated until the new communities cannot be formed, namely the delta Q is smaller than a certain threshold value, and the threshold value is usually 0. The community with multiple nodes has larger ki and in values, so that the community is easier to aggregate with new nodes into a new community. Thus, the community gradually completes the aggregation.
The community discovery method of the mineral knowledge graph gathers discrete knowledge base documents, reduces the dependence on a high-performance large model, and improves the capability of answering questions. Compared with the means of searching weight values from prompt words in the traditional technology, the method provides a quantification mode of weight analysis, reduces performance requirements and obtains better community distinction. Finally, the aggregation of communities and the recombination of community data are realized through modularity, a logical basis is provided for the formation of the geochemical communities, and the uncertainty of large model output is reduced.
In one embodiment, as shown in fig. 3, the step of confirming the second distance of any two entities in the knowledge base based on the first distance includes:
s310, setting a truncation function;
In particular, the purpose of the truncation function is to not use this distance when d w is greater than a preset threshold. Since the previous summation actions have taken into account that the entity is present in different text blocks, it is necessary to add this truncation function to eliminate interference by the entity across text distances.
S320, obtaining a second distance based on the truncated function and the first distance.
In one embodiment, in the step of obtaining the second distance based on the truncated function and the first distance, the second distance is obtained based on the following formula:
Wherein WD is the second distance, d w is the first distance, delta c is the cut-off function, all i is the maximum number of simultaneous occurrences of any two entities within the preset range.
Further, the weight W l of the edge in the first entity-relationship diagram is:
in one embodiment, the method further comprises the steps of:
acquiring description texts corresponding to communities;
Summarizing the description text corresponding to each community to obtain community summarization.
Specifically, after the knowledge base is abstracted and divided into a plurality of communities, the initially aggregated communities are formed by combining entities only, and text description is lacking, so that subsequent operations such as retrieval enhancement generation and the like are difficult to perform. To solve this problem, descriptive text of each entity is extracted. On the basis, the text description of all entities in the communities is deeply summarized and integrated by utilizing the capability of the large model, so that the generalized description of each community is formed. In this way, each community has own exclusive descriptive text, and a solid foundation and convenience are provided for subsequent knowledge graph searching. The method not only enriches the information content of communities, but also greatly improves the practicability and operability of the knowledge graph.
In one embodiment, as shown in fig. 4, the method further comprises the steps of:
S410, receiving a request of a user;
S420, converting the request into a request vector;
s430, performing similarity matching on the request vector and community generalization to obtain a target community;
s440, extracting the entity matched with the request vector in the target community.
Specifically, the aggregated communities and the corresponding entities form a core structure of the knowledge graph, wherein the communities are taken as tree nodes, and the entities are taken as leaf nodes, so that the level is clear and the logic is clear. When performing RAG (RETRIEVAL-Augmented Generation, search enhancement generation) operations, we first convert the user's query request into a request vector using an advanced vector model, and then cosine-similarity-matches this vector with the generalized descriptions of the various communities, thus accurately extracting communities that are highly similar to the user's request. Next, within these filtered communities, we again apply similarity matching techniques to carefully extract the entities that best fit the user's request.
Compared to naive RAG operations, or even RAGs equipped with a reorderer, our approach can extract text data more widely, significantly improving the answer quality and depth of large models when dealing with complex questions. In addition, the aggregation of communities not only enhances the efficiency and accuracy of data retrieval, but also opens up a new thinking path for researchers, helping them to deeply mine the potential laws and internal links of the research objects. The innovative method not only optimizes the application experience of the knowledge graph, but also brings brand new revenues and possibilities to the fields of academic research and data analysis.
In one embodiment, as shown in fig. 5, the step of performing text blocking on the knowledge base to obtain each text block includes:
s510, splitting the knowledge base to obtain a plurality of sentences;
S520, vectorizing each sentence to obtain a plurality of vectors;
S530, classifying adjacent sentences into the same text block or different text blocks based on cosine similarity of vectors of the adjacent sentences.
Specifically, the knowledge base text is firstly segmented according to sentences, front and rear sentences are vectorized through embedding models, and then cosine similarity of the two vectors is calculated. A predetermined threshold is then set to determine if the two sentences are similar. And if the two sentences are similar, aggregating the two sentences together, if the two sentences are dissimilar, forming a new text block by the later sentences, and repeating the steps until the text block is completed. Through the method, the text blocks can be aggregated according to the semantics, and the phenomenon that the text blocks interrupt the semantics is avoided to a certain extent.
In one embodiment, a community finding device of a mineral knowledge graph includes:
The block module is used for acquiring a knowledge base, and performing text block on the knowledge base to obtain each text block;
the entity extraction module is used for obtaining the entity of the knowledge base based on each text block;
the building module is used for building a first entity relation diagram according to each entity and the relation among the entities;
The distance acquisition module is used for acquiring a first distance between any two entities in each text block and confirming a second distance between any two entities in the knowledge base based on the first distance;
The similarity acquisition module is used for acquiring the similarity of any two entities in the concept library;
The weight value calculation module is used for confirming the sum of the similarity and the second distance as the weight value of the edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
the modularity calculation module is used for obtaining the modularity according to the weight value of the edge;
And the aggregation module is used for aggregating the nodes of the first entity relation graph based on the increment of the modularity to obtain a plurality of communities.
The specific limitation of the community finding device for the mineral knowledge graph can be referred to the limitation of the community finding method for the mineral knowledge graph hereinabove, and the description thereof will not be repeated here. The modules in the community discovery device of the mineral knowledge graph can be fully or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
In one embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
When the embodiment of the application is specifically implemented, the above embodiments can be referred to, and the application has corresponding technical effects.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSP DEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in essence or a part contributing to the prior art or a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk. It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.