[go: up one dir, main page]

CN120144780B - Community discovery method, device and storage medium for mineral knowledge graph - Google Patents

Community discovery method, device and storage medium for mineral knowledge graph

Info

Publication number
CN120144780B
CN120144780B CN202510214034.9A CN202510214034A CN120144780B CN 120144780 B CN120144780 B CN 120144780B CN 202510214034 A CN202510214034 A CN 202510214034A CN 120144780 B CN120144780 B CN 120144780B
Authority
CN
China
Prior art keywords
entities
distance
similarity
community
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510214034.9A
Other languages
Chinese (zh)
Other versions
CN120144780A (en
Inventor
赵汀
谭贺元
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Mineral Resources of Chinese Academy of Geological Sciences
Original Assignee
Institute of Mineral Resources of Chinese Academy of Geological Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Mineral Resources of Chinese Academy of Geological Sciences filed Critical Institute of Mineral Resources of Chinese Academy of Geological Sciences
Priority to CN202510214034.9A priority Critical patent/CN120144780B/en
Publication of CN120144780A publication Critical patent/CN120144780A/en
Application granted granted Critical
Publication of CN120144780B publication Critical patent/CN120144780B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Forestry; Mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Husbandry (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Mining & Mineral Resources (AREA)
  • Marine Sciences & Fisheries (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Computing Systems (AREA)
  • Agronomy & Crop Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种矿产知识图谱的社区发现方法、装置及存储介质。其中,矿产知识图谱的社区发现方法将离散的知识库文档进行聚合,降低对高性能大模型的依赖,提高回答问题的能力。相较于传统技术从提示词摸索权重值的手段,该方法提出了权重分析的量化方式,降低了性能要求外还获得更好的社区区分。最后通过模块度实现了社区的聚合和社区数据的重组,为地学社区的形成提供了逻辑上的依据,减少了大模型输出的不确定性。

The present invention relates to a community discovery method, device, and storage medium for a mineral knowledge graph. The community discovery method for a mineral knowledge graph aggregates discrete knowledge base documents, reducing reliance on high-performance large models and improving the ability to answer questions. Compared to traditional techniques that use weight values from prompt words, this method proposes a quantitative approach to weight analysis, which reduces performance requirements and achieves better community differentiation. Finally, modularity is used to achieve community aggregation and reorganization of community data, providing a logical basis for the formation of geological communities and reducing the uncertainty of large model outputs.

Description

Community discovery method, device and storage medium for mineral knowledge graph
Technical Field
The invention relates to the field of computer information processing, in particular to a community discovery method, device and storage medium of mineral knowledge graph.
Background
The large language model can rapidly process and analyze massive mineral resource related documents, accurately extract key information and provide powerful support for decision makers. Specifically, the large language model can assist researchers in rapidly sorting core data such as distribution, reserves, current exploitation situations and the like of mineral resources, and an exhaustive knowledge graph in the field of mineral resources is constructed. Through intelligent analysis, the model can accurately identify and distinguish direct and indirect complex relations in the knowledge graph, namely, community discovery is realized, so that a decision maker is helped to comprehensively grasp the situation, and a solid technical guarantee is provided for intelligent management of mineral resources.
However, the inventor discovers that the traditional knowledge graph construction method for mineral resources has the problems of large computational power dependence, high limitation and the like.
Disclosure of Invention
Based on the above, it is necessary to provide a community discovery method, device and storage medium capable of accurately obtaining mineral knowledge maps with little computational power dependency and good generalization in real time
In order to achieve the above object, in one aspect, an embodiment of the present application provides a community finding method for a mineral knowledge graph, including the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
In one embodiment, the step of confirming the second distance of any two entities in the knowledge base based on the first distance includes:
Setting a cut-off function;
and obtaining a second distance based on the truncated function and the first distance.
In one embodiment, in the step of obtaining the second distance based on the truncated function and the first distance, the second distance is obtained based on the following formula:
Wherein WD is the second distance, d w is the first distance, delta c is the cut-off function, all i is the maximum number of simultaneous occurrences of any two entities within the preset range.
In one embodiment, the step of obtaining the similarity of any two entities to the concept pool includes:
acquiring description texts and concept libraries corresponding to any two entities;
Vectorization calculation is carried out on the description texts corresponding to any two entities, and vectorized entities are obtained;
Based on the vectorization entity, acquiring the vector of the target concept from the concept library;
Based on the vector of the vectorized entity and the target concept, the similarity of any two entities in the concept library is obtained.
In one embodiment, in the step of obtaining the similarity of any two entities in the concept pool based on the vector of the vectorized entity and the target concept, the similarity is obtained based on the following formula:
Wherein, CD is similarity, i is category code number of target concept, allc represents category number; One of any two vectorized entities; Another of any two vectorized entities; Is a vector of target concepts of category i extracted according to one of the vectorized entities Is a vector of target concepts according to category i extracted by another vectorizing entity. W ci is the weight of the target concept with category designation i.
In one embodiment, the method further comprises the steps of:
acquiring description texts corresponding to communities;
Summarizing the description text corresponding to each community to obtain community summarization.
In one embodiment, the method further comprises the steps of:
receiving a request of a user;
converting the request into a request vector;
similarity matching is carried out on the request vector and community summarization, and a target community is obtained;
And extracting the entity matched with the request vector from the target community.
In one embodiment, the step of performing text blocking on the knowledge base to obtain each text block includes:
splitting the knowledge base to obtain a plurality of sentences;
vectorizing each sentence to obtain a plurality of vectors;
based on cosine similarity of vectors of adjacent sentences, the adjacent sentences are classified into the same text block or different text blocks.
In one aspect, an embodiment of the present invention provides a community finding device for mineral knowledge graph, including:
The block module is used for acquiring a knowledge base, and performing text block on the knowledge base to obtain each text block;
the entity extraction module is used for obtaining the entity of the knowledge base based on each text block;
the building module is used for building a first entity relation diagram according to each entity and the relation among the entities;
The distance acquisition module is used for acquiring a first distance between any two entities in each text block and confirming a second distance between any two entities in the knowledge base based on the first distance;
The similarity acquisition module is used for acquiring the similarity of any two entities in the concept library;
The weight value calculation module is used for confirming the sum of the similarity and the second distance as the weight value of the edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
the modularity calculation module is used for obtaining the modularity according to the weight value of the edge;
And the aggregation module is used for aggregating the nodes of the first entity relation graph based on the increment of the modularity to obtain a plurality of communities.
In another aspect, the application provides a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of the above method when run.
One of the above technical solutions has the following advantages and beneficial effects:
the community discovery method of the mineral knowledge graph gathers discrete knowledge base documents, reduces the dependence on a high-performance large model, and improves the capability of answering questions. Compared with the means of searching weight values from prompt words in the traditional technology, the method provides a quantification mode of weight analysis, reduces performance requirements and obtains better community distinction. Finally, the aggregation of communities and the recombination of community data are realized through modularity, a logical basis is provided for the formation of the geochemical communities, and the uncertainty of large model output is reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it will be apparent to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.
FIG. 1 is a first schematic flow chart of a community finding method of a mineral knowledge graph in an embodiment;
FIG. 2 is a schematic flowchart showing steps for obtaining similarity between any two entities in a concept pool according to one embodiment;
FIG. 3 is a schematic flow chart of a step of validating a second distance of any two entities in a knowledge base based on a first distance in one embodiment;
FIG. 4 is a second schematic flow chart of a community finding method of a mineral knowledge graph in an embodiment;
FIG. 5 is a schematic flow chart of steps for text blocking a knowledge base to obtain text blocks in one embodiment.
Detailed Description
In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. Embodiments of the application are illustrated in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module" and "component" may be used in combination.
It is to be understood that in the following embodiments, "connected" is understood to mean "electrically connected", "communicatively connected", etc., if the connected circuits, modules, units, etc., have electrical or data transfer between them.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," and/or the like, specify the presence of stated features, integers, steps, operations, elements, components, or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.
Currently, traditional knowledge graph construction means for mineral resources rely mainly on relational databases (e.g. trade databases) or on small-scale data sets and small-scale deep learning models. However, the limitations inherent to these methods result in the following series of problems:
1. the limitation of cross-text entity relationship extraction is that these methods have difficulty accurately capturing and establishing the relationship between core knowledge points when processing cross-paragraph, cross-sentence, or even cross-document information. This drawback greatly hampers the development of general knowledge, making it difficult to comprehensively extract key information from a number of documents.
And the data update is delayed, namely, the data update speed is slow due to the fact that the training process of the deep learning model is complex and time-consuming and the construction process is complex. This means that a batch of data is often required to go through a long period of time from collection to model training completion and put into use.
2. The model has insufficient generality, and in view of the fact that a small model is usually adopted by the traditional method, the application range of the model is often limited to a specific field. Under the technical background that the current large-scale model continuously emerges, the small-scale model has difficulty in meeting the requirements of construction of cross-domain and universal knowledge maps.
3. Application development challenges once knowledge graph construction is complete, subsequent application development efforts are also challenged. In particular, when implementing a natural language-based question-answer query function, it is often difficult to rapidly and efficiently develop applications meeting user needs due to technical limitations of conventional methods.
In summary, the traditional mineral resource knowledge graph construction method has significant defects in aspects of cross-text relation extraction, data timeliness, model universality, application development and the like, and the community discovery method of the mineral resource knowledge graph can effectively solve the problems.
In one embodiment, as shown in fig. 1, there is provided a community finding method of a mineral knowledge graph, including the steps of:
s110, acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
In particular, the acquired knowledge base can be obtained by rapidly processing and analyzing a large number of mineral resource related documents through a language model. The content of the knowledge base is partitioned into smaller, easily handled text blocks according to some logic or rule. This facilitates the extraction and processing of information in subsequent steps. In one example, a natural paragraph may also be obtained through a format feature to form a text block. In another specific example, the text blocks can be divided by adopting a vector model, so that the text blocks can be aggregated according to the meaning, the phenomenon that the text blocks interrupt the meaning is avoided to a certain extent, and the dividing errors caused by the frequent diversification of the format, typesetting and punctuation mark style of paragraph dividing are avoided.
S120, obtaining an entity of a knowledge base based on each text block;
Specifically, key entities are extracted from each text block. The entity may be a noun phrase, a person name, a place name, an organization name, or the like. Further, the step is to extract the entities one by taking the text block as a unit through the understanding capability of the large language model, so as to obtain the entity text. After extraction, the entity text is cleaned in the first step by a regular matching method, and then the entity model is cleaned in the second step by a vector model again, so that repeated information is completely eliminated, and a final entity is obtained. Entity extraction of the knowledge base determines the division of communities.
S130, constructing a first entity relationship diagram according to each entity and the relationship among the entities;
In particular, analyzing relationships between entities in a text block may identify acknowledgements by co-occurrence, semantic similarity, or specific relationship templates. And constructing a preliminary entity relation diagram according to the identified entities and the relations thereof, wherein the nodes represent the entities and the edges represent the relations.
S140, acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
Wherein the first distance refers to the distance between two entities within different text blocks and the second distance refers to the sum of the distances between two entities within the knowledge base. It should be noted that the relevance of entities may be improved as described below. b. Two entities appear in the same article multiple times. c. Two entities appear in the same article, different text blocks. d. Two entities appear in the same article, different text blocks, and appear multiple times in each text block. e. Two entities appear in the same article, different text blocks, and appear multiple times in each paragraph, and in each text block, appear simultaneously in multiple sentences. Further, the distance between any two entities may be calculated based on the lexical distance or sentence level distance within the text block, or may be calculated based on other means known in the art.
S150, obtaining the similarity of any two entities in the concept library;
In particular, in some cases two entities do not appear in the same piece of text, but they share the concept of belonging to a certain category. In this category, however, there is a certain association between the two entities. Further, by introducing a specialized ground science expertise library as the concept library. In a specific example, as shown in fig. 2, the step of obtaining the similarity of any two entities to the concept pool includes S210, obtaining description texts corresponding to any two entities and the concept pool, S220, performing vectorization calculation on the description texts corresponding to any two entities to obtain vectorized entities, S230, obtaining a vector of a target concept from the concept pool based on the vectorized entities, and S240, obtaining the similarity of any two entities to the concept pool based on the vectorized entities and the vector of the target concept. Namely, summarizing the entity to obtain entity summary (namely, the description text corresponding to the entity), vectorizing the content in the entity summary by using embedding model, extracting the similarity between the entity summary and related concepts in the concept library, comparing the similarity between the extracted concepts, and finally combining all the category comparisons.
Specifically, in the step of obtaining the similarity of any two entities in the concept library based on the vector of the vectorized entity and the target concept, the similarity is obtained based on the following formula:
Wherein, CD is similarity, i is category code number of target concept, allc represents category number; One of any two vectorized entities; Another of any two vectorized entities; Is a vector of target concepts of category i extracted according to one of the vectorized entities Is a vector of target concepts according to category i extracted by another vectorizing entity. W ci is the weight of the target concept with category designation i.
S160, confirming the sum of similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
specifically, the similarity and the second distance are added to obtain a weight value of each edge in the first entity relation diagram, and the association degree between the entities is reflected.
S170, obtaining modularity according to the weight value of the edge;
The modularity is an index for measuring community division quality. It reflects the connection tightness between nodes inside the community relative to the connection tightness of nodes outside the community. In this step, the modularity of the whole entity relationship graph is calculated according to the weight value of the edge.
And S180, aggregating the nodes of the first entity relationship graph based on the increment of the modularity to obtain a plurality of communities.
Specifically, the nodes of the first entity relationship graph are aggregated through a module degree formula and a module degree increment formula.
Modularity:
Module degree increment:
wherein Σ in represents the sum of the weights of all the edges within community C, Σ tot represents the sum of the weights of all the variable edges pointing to community C, k i represents the sum of the weights of all the edges pointing to node i, k i,in represents the sum of the weights of node i and the edges of community C, and m represents the sum of the weights of all the edges.
In S160, the weight of each side has been obtained, and the sum m of the weights of all sides is obtained with the weight of each side. And then, finding out the node with the largest delta Q among all the nodes, and aggregating the nodes into a community. In the first step of community aggregation, each node is an independent community, so Σ tot is the sum of the weights of the relation between a certain node and all other nodes, namely the sum of the weights of the edges formed by every two other nodes:
From the ΔQ formula, to complete aggregation, the weight values and initial nodes of the target node k i,in and the initial node are as follows The difference between these two nodes determines the chance of forming a community. Since k i represents the sum of the weights of all the edges pointing to node i, i.e., the target node, the initial node and the target node are easy to form a community when the weight of the relationship between the target node and other nodes is small and the weight of the relationship between the initial node is large enough. A round of iteration may group some of the nodes into a certain community. The communities are used as new nodes, nodes formed by the communities and other communities are continuously aggregated according to the formula of delta Q, and iteration is repeated until the new communities cannot be formed, namely the delta Q is smaller than a certain threshold value, and the threshold value is usually 0. The community with multiple nodes has larger ki and in values, so that the community is easier to aggregate with new nodes into a new community. Thus, the community gradually completes the aggregation.
The community discovery method of the mineral knowledge graph gathers discrete knowledge base documents, reduces the dependence on a high-performance large model, and improves the capability of answering questions. Compared with the means of searching weight values from prompt words in the traditional technology, the method provides a quantification mode of weight analysis, reduces performance requirements and obtains better community distinction. Finally, the aggregation of communities and the recombination of community data are realized through modularity, a logical basis is provided for the formation of the geochemical communities, and the uncertainty of large model output is reduced.
In one embodiment, as shown in fig. 3, the step of confirming the second distance of any two entities in the knowledge base based on the first distance includes:
s310, setting a truncation function;
In particular, the purpose of the truncation function is to not use this distance when d w is greater than a preset threshold. Since the previous summation actions have taken into account that the entity is present in different text blocks, it is necessary to add this truncation function to eliminate interference by the entity across text distances.
S320, obtaining a second distance based on the truncated function and the first distance.
In one embodiment, in the step of obtaining the second distance based on the truncated function and the first distance, the second distance is obtained based on the following formula:
Wherein WD is the second distance, d w is the first distance, delta c is the cut-off function, all i is the maximum number of simultaneous occurrences of any two entities within the preset range.
Further, the weight W l of the edge in the first entity-relationship diagram is:
in one embodiment, the method further comprises the steps of:
acquiring description texts corresponding to communities;
Summarizing the description text corresponding to each community to obtain community summarization.
Specifically, after the knowledge base is abstracted and divided into a plurality of communities, the initially aggregated communities are formed by combining entities only, and text description is lacking, so that subsequent operations such as retrieval enhancement generation and the like are difficult to perform. To solve this problem, descriptive text of each entity is extracted. On the basis, the text description of all entities in the communities is deeply summarized and integrated by utilizing the capability of the large model, so that the generalized description of each community is formed. In this way, each community has own exclusive descriptive text, and a solid foundation and convenience are provided for subsequent knowledge graph searching. The method not only enriches the information content of communities, but also greatly improves the practicability and operability of the knowledge graph.
In one embodiment, as shown in fig. 4, the method further comprises the steps of:
S410, receiving a request of a user;
S420, converting the request into a request vector;
s430, performing similarity matching on the request vector and community generalization to obtain a target community;
s440, extracting the entity matched with the request vector in the target community.
Specifically, the aggregated communities and the corresponding entities form a core structure of the knowledge graph, wherein the communities are taken as tree nodes, and the entities are taken as leaf nodes, so that the level is clear and the logic is clear. When performing RAG (RETRIEVAL-Augmented Generation, search enhancement generation) operations, we first convert the user's query request into a request vector using an advanced vector model, and then cosine-similarity-matches this vector with the generalized descriptions of the various communities, thus accurately extracting communities that are highly similar to the user's request. Next, within these filtered communities, we again apply similarity matching techniques to carefully extract the entities that best fit the user's request.
Compared to naive RAG operations, or even RAGs equipped with a reorderer, our approach can extract text data more widely, significantly improving the answer quality and depth of large models when dealing with complex questions. In addition, the aggregation of communities not only enhances the efficiency and accuracy of data retrieval, but also opens up a new thinking path for researchers, helping them to deeply mine the potential laws and internal links of the research objects. The innovative method not only optimizes the application experience of the knowledge graph, but also brings brand new revenues and possibilities to the fields of academic research and data analysis.
In one embodiment, as shown in fig. 5, the step of performing text blocking on the knowledge base to obtain each text block includes:
s510, splitting the knowledge base to obtain a plurality of sentences;
S520, vectorizing each sentence to obtain a plurality of vectors;
S530, classifying adjacent sentences into the same text block or different text blocks based on cosine similarity of vectors of the adjacent sentences.
Specifically, the knowledge base text is firstly segmented according to sentences, front and rear sentences are vectorized through embedding models, and then cosine similarity of the two vectors is calculated. A predetermined threshold is then set to determine if the two sentences are similar. And if the two sentences are similar, aggregating the two sentences together, if the two sentences are dissimilar, forming a new text block by the later sentences, and repeating the steps until the text block is completed. Through the method, the text blocks can be aggregated according to the semantics, and the phenomenon that the text blocks interrupt the semantics is avoided to a certain extent.
In one embodiment, a community finding device of a mineral knowledge graph includes:
The block module is used for acquiring a knowledge base, and performing text block on the knowledge base to obtain each text block;
the entity extraction module is used for obtaining the entity of the knowledge base based on each text block;
the building module is used for building a first entity relation diagram according to each entity and the relation among the entities;
The distance acquisition module is used for acquiring a first distance between any two entities in each text block and confirming a second distance between any two entities in the knowledge base based on the first distance;
The similarity acquisition module is used for acquiring the similarity of any two entities in the concept library;
The weight value calculation module is used for confirming the sum of the similarity and the second distance as the weight value of the edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
the modularity calculation module is used for obtaining the modularity according to the weight value of the edge;
And the aggregation module is used for aggregating the nodes of the first entity relation graph based on the increment of the modularity to obtain a plurality of communities.
The specific limitation of the community finding device for the mineral knowledge graph can be referred to the limitation of the community finding method for the mineral knowledge graph hereinabove, and the description thereof will not be repeated here. The modules in the community discovery device of the mineral knowledge graph can be fully or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
In one embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a knowledge base, and performing text blocking on the knowledge base to obtain each text block;
based on each text block, obtaining an entity of a knowledge base;
constructing a first entity relationship diagram according to each entity and the relationship among the entities;
Acquiring a first distance between any two entities in each text block, and confirming a second distance between any two entities in a knowledge base based on the first distance;
obtaining the similarity of any two entities in a concept library;
Confirming the sum of the similarity and the second distance as a weight value of an edge in the first entity relation diagram, wherein the edge is formed according to any two entities;
Obtaining modularity according to the weight value of the edge;
And based on the increment of the modularity, aggregating the nodes of the first entity relation graph to obtain a plurality of communities.
When the embodiment of the application is specifically implemented, the above embodiments can be referred to, and the application has corresponding technical effects.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSP DEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in essence or a part contributing to the prior art or a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk. It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1.一种矿产知识图谱的社区发现方法,其特征在于,包括:1. A community discovery method for a mineral knowledge graph, comprising: 获取知识库,并对所述知识库进行文本分块,得到各文本块;Acquire a knowledge base, and divide the knowledge base into text blocks to obtain text blocks; 基于各所述文本块,得到知识库的实体;Based on each of the text blocks, obtaining entities of the knowledge base; 根据各所述实体以及各所述实体间的关系,构建第一实体关系图;Constructing a first entity relationship graph according to the entities and the relationships between the entities; 获取任意两个所述实体在各所述文本块内部的第一距离,且基于所述第一距离确认任意两个所述实体在所述知识库的第二距离;Obtaining a first distance between any two of the entities within each of the text blocks, and determining a second distance between any two of the entities in the knowledge base based on the first distance; 获取任意两个所述实体于概念库的相似度;其中,获取任意两个所述实体于概念库的相似度的步骤,包括:获取任意两个所述实体对应的描述文本以及概念库;对任意两个所述实体对应的描述文本进行向量化计算,得到向量化实体;基于所述向量化实体,于所述概念库获取目标概念的向量;基于所述向量化实体和所述目标概念的向量,得到任意两个所述实体于概念库的相似度;Obtaining the similarity between any two entities and the concept library; wherein the step of obtaining the similarity between any two entities and the concept library includes: obtaining description texts corresponding to any two entities and the concept library; performing vectorization calculation on the description texts corresponding to any two entities to obtain vectorized entities; obtaining a vector of a target concept from the concept library based on the vectorized entities; and obtaining the similarity between any two entities and the concept library based on the vectorized entities and the vector of the target concept; 将所述相似度和所述第二距离之和,确认为所述第一实体关系图中边的权重值;其中,所述边为根据任意两个所述实体形成;Determining the sum of the similarity and the second distance as the weight value of an edge in the first entity relationship graph; wherein the edge is formed based on any two entities; 根据所述边的权重值,得到模块度;According to the weight value of the edge, the modularity is obtained; 基于所述模块度的增量,对所述第一实体关系图的节点进行聚合,得到多个社区。Based on the increment of the modularity, the nodes of the first entity relationship graph are aggregated to obtain a plurality of communities. 2.根据权利要求1所述的矿产知识图谱的社区发现方法,其特征在于,基于所述第一距离确认任意两个所述实体在所述知识库的第二距离步骤,包括:2. The community discovery method for a mineral knowledge graph according to claim 1, wherein the step of determining the second distance between any two entities in the knowledge base based on the first distance comprises: 设置截断函数;Set the truncation function; 基于所述截断函数和所述第一距离,得到所述第二距离。The second distance is obtained based on the truncation function and the first distance. 3.根据权利要求2所述的矿产知识图谱的社区发现方法,其特征在于,基于所述截断函数和所述第一距离,得到所述第二距离的步骤中,基于以下公式得到所述第二距离:3. The community discovery method for a mineral knowledge graph according to claim 2, wherein, in the step of obtaining the second distance based on the cutoff function and the first distance, the second distance is obtained based on the following formula: 其中,WD为所述第二距离;dw为第一距离;δc为截断函数;alli为在预设范围内任意两个所述实体同时出现的最大次数。Wherein, WD is the second distance; dw is the first distance; δc is the cutoff function; and alli is the maximum number of times any two entities appear simultaneously within a preset range. 4.根据权利要求1所述的矿产知识图谱的社区发现方法,其特征在于,基于所述向量化实体和所述目标概念的向量,得到任意两个所述实体于概念库的相似度的步骤中,基于以下公式得到所述相似度:4. The community discovery method for a mineral knowledge graph according to claim 1 is characterized in that, in the step of obtaining the similarity between any two entities in a concept library based on the vectorized entities and the vectors of the target concepts, the similarity is obtained based on the following formula: 其中,CD为所述相似度;i为所述目标概念的范畴代号;allc表示范畴的个数;为任意两个所述向量化实体中的一个;为任意两个所述向量化实体中的另一个;为根据其中一个向量化实体抽取的范畴i的目标概念的向量;根据为根据其中另一个向量化实体抽取的范畴i的目标概念的向量;Wci为范畴代号为i的目标概念的权重。Wherein, CD is the similarity; i is the category code of the target concept; allc represents the number of categories; is one of any two of the vectorized entities; is the other of any two of the vectorized entities; is the vector of the target concept of category i extracted according to one of the vectorized entities; is the vector of the target concept of category i extracted based on another vectorized entity; W ci is the weight of the target concept with category code i. 5.根据权利要求1所述的矿产知识图谱的社区发现方法,其特征在于,还包括步骤:5. The community discovery method of the mineral knowledge graph according to claim 1, further comprising the steps of: 获取各所述社区对应的描述文本;Obtaining description text corresponding to each of the communities; 对各所述社区对应的描述文本进行概括,得到社区概括。The description texts corresponding to the communities are summarized to obtain a community summary. 6.根据权利要求5所述的矿产知识图谱的社区发现方法,其特征在于,还包括步骤:6. The community discovery method of the mineral knowledge graph according to claim 5, further comprising the steps of: 接收到用户的请求;Receive a user's request; 将所述请求转换为请求向量;Converting the request into a request vector; 对所述请求向量和所述社区概括进行相似度匹配,得到目标社区;Performing similarity matching on the request vector and the community summary to obtain a target community; 于所述目标社区中,抽取与所述请求向量匹配的实体。In the target community, entities matching the request vector are extracted. 7.根据权利要求1所述的矿产知识图谱的社区发现方法,其特征在于,对所述知识库进行文本分块,得到各文本块的步骤,包括:7. The community discovery method for a mineral knowledge graph according to claim 1, wherein the step of dividing the knowledge base into text blocks to obtain each text block comprises: 将所述知识库进行拆分,得到多个语句;Splitting the knowledge base to obtain multiple statements; 将各所述语句进行向量化处理,得到多个向量;Vectorizing each of the statements to obtain multiple vectors; 基于相邻语句的向量的余弦相似度,将相邻语句归为同一文本块或不同文本块。Based on the cosine similarity of their vectors, adjacent sentences are classified as the same text block or different text blocks. 8.一种矿产知识图谱的社区发现装置,其特征在于,包括:8. A community discovery device for a mineral knowledge graph, comprising: 分块模块,用于获取知识库,并对所述知识库进行文本分块,得到各文本块;A block segmentation module is used to obtain a knowledge base and segment the knowledge base into text blocks to obtain text blocks; 实体抽取模块,用于基于各所述文本块,得到知识库的实体;An entity extraction module, configured to obtain entities in a knowledge base based on each of the text blocks; 构建模块,用于根据各所述实体以及各所述实体间的关系,构建第一实体关系图;A construction module, configured to construct a first entity relationship graph according to the entities and the relationships between the entities; 距离获取模块,用于获取任意两个所述实体在各所述文本块内部的第一距离,且基于所述第一距离确认任意两个所述实体在所述知识库的第二距离;a distance acquisition module, configured to acquire a first distance between any two entities within each text block, and determine a second distance between any two entities in the knowledge base based on the first distance; 相似度获取模块,用于获取任意两个所述实体于概念库的相似度;还用于获取任意两个所述实体对应的描述文本以及概念库,且对任意两个所述实体对应的描述文本进行向量化计算,得到向量化实体;还用于基于所述向量化实体,于所述概念库获取目标概念的向量;还用于基于所述向量化实体和所述目标概念的向量,得到任意两个所述实体于概念库的相似度;A similarity acquisition module is used to obtain the similarity between any two entities and the concept library; it is also used to obtain the description texts corresponding to any two entities and the concept library, and perform vectorization calculation on the description texts corresponding to any two entities to obtain vectorized entities; it is also used to obtain the vector of the target concept in the concept library based on the vectorized entity; it is also used to obtain the similarity between any two entities and the concept library based on the vectorized entity and the vector of the target concept; 权重值计算模块,用于将所述相似度和所述第二距离之和,确认为所述第一实体关系图中边的权重值;其中,所述边为根据任意两个所述实体形成;a weight value calculation module, configured to determine the sum of the similarity and the second distance as the weight value of an edge in the first entity relationship graph; wherein the edge is formed based on any two of the entities; 模块度计算模块,用于根据所述边的权重值,得到模块度;A modularity calculation module, configured to obtain the modularity according to the weight value of the edge; 聚合模块,用于基于所述模块度的增量,对所述第一实体关系图的节点进行聚合,得到多个社区。An aggregation module is used to aggregate the nodes of the first entity relationship graph based on the increment of the modularity to obtain multiple communities. 9.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至7任一项中所述的方法的步骤。9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the steps of the method according to any one of claims 1 to 7 when run.
CN202510214034.9A 2025-02-26 2025-02-26 Community discovery method, device and storage medium for mineral knowledge graph Active CN120144780B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510214034.9A CN120144780B (en) 2025-02-26 2025-02-26 Community discovery method, device and storage medium for mineral knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510214034.9A CN120144780B (en) 2025-02-26 2025-02-26 Community discovery method, device and storage medium for mineral knowledge graph

Publications (2)

Publication Number Publication Date
CN120144780A CN120144780A (en) 2025-06-13
CN120144780B true CN120144780B (en) 2025-09-19

Family

ID=95959691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510214034.9A Active CN120144780B (en) 2025-02-26 2025-02-26 Community discovery method, device and storage medium for mineral knowledge graph

Country Status (1)

Country Link
CN (1) CN120144780B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118673126A (en) * 2024-08-23 2024-09-20 山东浪潮科学研究院有限公司 RAG question and answer method, system and medium based on knowledge graph
CN119226466A (en) * 2024-09-14 2024-12-31 北京云科世纪科技有限公司 Knowledge discovery method, device and medium based on large language model and knowledge graph

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014143729A1 (en) * 2013-03-15 2014-09-18 Affinnova, Inc. Method and apparatus for interactive evolutionary optimization of concepts
CN108959370B (en) * 2018-05-23 2021-04-06 哈尔滨工业大学 A community discovery method and device based on entity similarity in knowledge graph
JP2023510667A (en) * 2019-11-25 2023-03-15 京東方科技集團股▲ふん▼有限公司 Character Acquisition, Page Processing and Knowledge Graph Construction Method and Apparatus, Medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118673126A (en) * 2024-08-23 2024-09-20 山东浪潮科学研究院有限公司 RAG question and answer method, system and medium based on knowledge graph
CN119226466A (en) * 2024-09-14 2024-12-31 北京云科世纪科技有限公司 Knowledge discovery method, device and medium based on large language model and knowledge graph

Also Published As

Publication number Publication date
CN120144780A (en) 2025-06-13

Similar Documents

Publication Publication Date Title
CN116150704B (en) Multimodal Fusion Representation Method and System Based on Semantic Similarity Matching
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN111753067A (en) A method, device and equipment for evaluating the innovativeness of technical disclosure text
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN111538803A (en) Method, device, equipment and medium for acquiring candidate question text to be matched
CN120144780B (en) Community discovery method, device and storage medium for mineral knowledge graph
CN117235137B (en) Professional information query method and device based on vector database
CN119623619A (en) Intellectual property and academic assistant system and implementation method, device, electronic device and storage medium thereof
CN118689970A (en) A question-answering method based on well engineering knowledge base and related device
CN118193801A (en) Data retrieval method, index construction method and article retrieval method
Sun et al. Modeling of unsupervised knowledge graph of events based on mutual information among neighbor domains and sparse representation
CN116842270A (en) Patent search term recommending method and device based on intention recognition and electronic equipment
CN111291182A (en) Hotspot event discovery method, device, device and storage medium
CN116431877A (en) Webpage big data content clustering method driven by cloud computing platform
CN116991967A (en) Method and device for generating event evolution relationship tree
CN116842160A (en) A patent search formula generation method, system, equipment and medium
CN114925230A (en) Voiceprint retrieval method, voiceprint retrieval device, voiceprint retrieval equipment and storage medium
CN115329195A (en) Artificial intelligence-based intent mining method, device, equipment and storage medium
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN119474324B (en) Question-answering method based on knowledge base and question-answering method based on knowledge base in building field
Chen English translation template retrieval based on semantic distance ontology knowledge recognition algorithm
Canhasi Fast Document Summarization using Locality Sensitive Hashing and Memory Access Efficient Node Ranking.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant