Disclosure of Invention
The application aims to provide a text question-answering method, a system, electronic equipment and a medium based on semantic retrieval, which can improve the accuracy and the retrieval efficiency of retrieval information corresponding to global problems, thereby improving the accuracy and the efficiency of text question-answering.
In a first aspect, an embodiment of the present application provides a text question-answering method based on semantic retrieval, where the method includes:
Acquiring a to-be-solved problem;
constructing a hierarchical knowledge graph with a high cohesive community;
Judging whether the to-be-solved problem belongs to a local problem or a global problem;
if the to-be-solved problem belongs to the local problem, carrying out local search through a local search module to obtain a plurality of solution texts similar to the to-be-solved problem;
If the to-be-solved problem belongs to the global problem, global searching is carried out in the hierarchical knowledge graph through a global searching module, a plurality of communities related to the to-be-solved problem are obtained, the communities are described as the context input into a large language model, and a solution text is generated.
Compared with the prior art, the first aspect of the application has the following beneficial effects:
The method comprises the steps of obtaining a problem to be solved, constructing a hierarchical knowledge graph with a high cohesive community, judging whether the problem to be solved belongs to a local problem or a global problem, if the problem to be solved belongs to the local problem, carrying out local search through a local search module to obtain a plurality of answer texts similar to the problem to be solved, and if the problem to be solved belongs to the global problem, carrying out global search in the hierarchical knowledge graph through a global search module to obtain a plurality of communities related to the problem to be solved, describing the communities as contexts, and inputting the contexts into a large language model to generate the answer texts. Therefore, by judging whether the problem to be solved belongs to the local problem or the global problem, the answer text is searched for by adopting different modes for different types of problems, and particularly, the global search is carried out in the hierarchical knowledge graph through the global search module, so that the method has higher efficiency in time and calculation cost, the processing capacity for the complex problem can be remarkably improved under the condition of not remarkably increasing the complexity of the system, and the accuracy and the search efficiency of the search information corresponding to the global problem can be improved, thereby improving the accuracy and the efficiency of text question answering.
In some embodiments, the determining whether the to-be-solved problem belongs to a local problem or a global problem includes:
constructing similarity distribution between the to-be-solved problem and a plurality of text blocks;
calculating the ratio between the peak value of the similarity distribution and the secondary peak value of the similarity distribution to obtain peak value intensity;
Calculating the second derivative of the similarity distribution at the peak value point to obtain the distribution sharpness;
And judging whether the problem to be solved belongs to a local problem or a global problem based on the peak intensity and the distribution sharpness.
In some embodiments, the constructing a similarity distribution between the questions to be solved and a plurality of text blocks includes:
;
Wherein, The distribution of the degree of similarity is represented,Represents the number of text blocks, andIs a positive integer which is used for the preparation of the high-voltage power supply,The bandwidth parameter is represented by a parameter of the bandwidth,Representing a similarity score between the question to be answered and any text block,Representing the questions to be answered and the thThe similarity score between the individual text blocks,Representing a kernel function.
In some embodiments, the calculating the second derivative of the similarity distribution at the peak point to obtain the distribution sharpness comprises:
;
;
Wherein, Indicating the degree of sharpness of the distribution,The distribution of the degree of similarity is represented,The point of the peak is indicated and,Representing the first derivative of the similarity distribution at the peak point,Representing the second derivative of the similarity distribution at the peak point,Representing the questions to be answered and the thThe similarity score between the individual text blocks,Representing the questions to be answered and the thSimilarity scores between text blocks.
In some embodiments, the determining, based on the peak intensity and the distribution sharpness, whether the problem to be solved belongs to a local problem or a global problem includes:
presetting a first threshold value and a second threshold value;
if the peak intensity is greater than or equal to the first threshold value and the distribution sharpness is greater than or equal to the second threshold value, judging that the problem to be solved belongs to a local problem;
and if the peak intensity is smaller than the first threshold value or the distribution sharpness is smaller than the second threshold value, judging that the problem to be solved belongs to a global problem.
In some embodiments, the performing global search in the hierarchical knowledge graph by using a global search module to obtain a plurality of communities related to the to-be-solved problem includes:
In the hierarchical knowledge graph, each node corresponds to one community;
Calculating the similarity between the to-be-solved problem and each node, and taking the similarity as an initial weight of each node;
Calculating the weighted community weight according to the initial weight;
and selecting a plurality of communities related to the to-be-solved problem based on the weighted community weights.
In some embodiments, the calculating the weighted community weight according to the initial weight includes:
;
Wherein, Representing the weighted community weight(s),Representing coefficients for balancing direct similarity with inherited weights,Representing the weight of the parent node and,The weights of the child nodes are represented,Representing coefficients for adjusting the contribution of each child node to the parent node weight,Representing the number of child nodes.
In a second aspect, an embodiment of the present application further provides a text question-answering system based on semantic retrieval, where the system includes:
The data acquisition unit is used for acquiring the to-be-solved problem;
the map construction unit is used for constructing a hierarchical knowledge map with a high cohesive community;
the problem judging unit is used for judging whether the problem to be solved belongs to a local problem or a global problem;
the local search unit is used for carrying out local search through the local search module if the to-be-solved problem belongs to the local problem, so as to obtain a plurality of solution texts similar to the to-be-solved problem;
And the global retrieval unit is used for carrying out global search in the hierarchical knowledge graph through the global retrieval module if the problem to be solved belongs to the global problem, obtaining a plurality of communities related to the problem to be solved, describing the communities as contexts, and inputting the contexts into a large language model to generate a solution text.
In a third aspect, an embodiment of the present application further provides an electronic device, including at least one control processor and a memory communicatively connected to the at least one control processor, where the memory stores instructions executable by the at least one control processor, where the instructions are executable by the at least one control processor to enable the at least one control processor to perform a text question-answering method based on semantic retrieval as described above.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a text question-answering method based on semantic retrieval as described above.
It is to be understood that the advantages of the second to fourth aspects compared with the related art are the same as those of the first aspect compared with the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
In the description of the present application, the description of first, second, etc. is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
It should also be appreciated that references to "one embodiment" or "some embodiments" or the like described in the specification of an embodiment of the present application mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution.
First, several nouns involved in the present application are parsed:
The technology is proposed by Facebook research institute, the method relies on semantic similarity retrieval, and the most relevant TopK text blocks are selected as the basis for answer generation by calculating the similarity between the questions and the embedded vectors of the text blocks in the knowledge base. This approach works well when dealing with local problems, but often fails to provide sufficient context information when dealing with global problems that require integration of multiple knowledge segments.
Large language model-meaning deep learning model trained using large amounts of text data so that the model can generate natural language text or understand the meaning of language text. These models can provide in-depth knowledge and language production about various topics by training on a vast dataset. The key idea is to learn the mode and structure of natural language through large-scale unsupervised training to simulate the human language cognition and generation process to a certain extent.
GraphRAG is a new search method, which uses knowledge graph and vector search in basic RAG architecture. Thus, it can integrate and understand a variety of knowledge, providing a broader, more comprehensive view of the data. GraphRAG by Microsoft institute, the method further expands the application of the graph structure and constructs the knowledge graph with communities. GraphRAG relies on LLMs to sequentially abstract and generate community contents when retrieving knowledge, and the method can solve global problems, but has high cost, and the final retrieval result still is to 'expand' the hierarchical structure into a linear structure, so that the potential of the hierarchical structure can not be fully utilized.
The natural language processing is an important research direction in the artificial intelligence field, integrates the knowledge in a plurality of disciplines such as linguistics, computer science, machine learning, mathematics, cognitive psychology and the like, is a cross discipline integrating the computer science, the artificial intelligence and the linguistics, comprises two main aspects of natural language understanding and natural language generation, and the research content comprises a plurality of layers such as characters, words, phrases, sentences, paragraphs, chapters and the like, and is a bridge for communication between machine language and human language.
Knowledgement-based Question Answering, KBQA) is an important research direction in the field of Natural Language Processing (NLP) aimed at answering a user's questions by extracting information from a structured Knowledge base or unstructured text. With the rapid development of large language models (Large Language Models, LLMs) such as GPT, LLaMA, and Gemini, these models demonstrate powerful, low sample learning capabilities that can generate accurate answers on the basis of small amounts of contextual information. However, LLMs still face challenges in practical applications, especially their limited contextual windows, which limit their application in a wider range of scenarios.
To overcome these limitations, search enhancement generation (RETRIEVAL-Augmented Generation, RAG) techniques have evolved. The RAG, by combining an external knowledge base with LLMs, enables the model to access and utilize knowledge beyond its contextual window when generating answers without additional fine-tuning. The conventional RAG method assumes that the required information is concentrated in a specific area of the knowledge base and retrieves the relevant information by selecting TopK text blocks with the highest semantic similarity to the problem. This approach is effective for Local questions (Local questions) that only need to be focused on a particular portion of the knowledge base, but for Global questions (Global questions) that need to integrate information from multiple portions of the knowledge base, such as full-text summaries (Query-FocusedSummarization, QFS), the retrieved information is not accurate enough.
In order to solve the problem that the information retrieved by the global problem in the prior art is not accurate enough, the application provides a text question-answering method, a text question-answering system, electronic equipment and a medium based on semantic retrieval.
Referring to fig. 1, an embodiment of the present application provides a text question-answering method based on semantic retrieval, which includes the following steps:
Step S100, obtaining a to-be-solved problem;
step S200, constructing a hierarchical knowledge graph with a high cohesive community;
step S300, judging whether the problem to be solved belongs to a local problem or a global problem;
Step S400, if the problem to be solved belongs to the local problem, carrying out local search through a local search module to obtain a plurality of solution texts similar to the problem to be solved;
And S500, if the problem to be solved belongs to the global problem, performing global search in the hierarchical knowledge graph through a global search module to obtain a plurality of communities related to the problem to be solved, describing the communities as contexts, and inputting the contexts into a large language model to generate a solution text.
In the embodiment, a hierarchical knowledge graph with a high cohesive community is constructed, whether the problem to be solved belongs to a local problem or a global problem is judged, if the problem to be solved belongs to the local problem, local search is conducted through a local search module to obtain a plurality of answer texts similar to the problem to be solved, if the problem to be solved belongs to the global problem, global search is conducted in the hierarchical knowledge graph through a global search module to obtain a plurality of communities related to the problem to be solved, and the communities are described as contexts and input into a large language model to generate the answer texts. Therefore, by judging whether the problem to be solved belongs to the local problem or the global problem, the answer text is searched for by adopting different modes for different types of problems, and particularly, the global search is carried out in the hierarchical knowledge graph through the global search module, so that the method has higher efficiency in time and calculation cost, the processing capacity for the complex problem can be remarkably improved under the condition of not remarkably increasing the complexity of the system, and the accuracy and the search efficiency of the search information corresponding to the global problem can be improved, thereby improving the accuracy and the efficiency of text question answering.
The construction of the hierarchical knowledge graph with the highly cohesive community can be to combine entities with the same name and type in the sub-graph and combine the relations with the same source entity and the target entity, so that a relation crossing the sub-graph is created, and finally, the hierarchical knowledge graph with the highly cohesive community is generated, and the global structure of the source document can be reflected more accurately.
The local search is performed through the local search module to obtain a plurality of answer texts similar to the to-be-solved problem, which can be a search method directly adopting the most classical K nearest neighbor (K-Nearest Neighbors, KNN) algorithm. And the local retrieval module calculates the similarity between the to-be-solved problem and the texts by adopting a similarity calculation mode to obtain a plurality of solution texts similar to the to-be-solved problem.
In some embodiments, determining whether the problem to be solved belongs to a local problem or a global problem includes:
constructing similarity distribution between the to-be-solved problem and a plurality of text blocks;
Calculating the ratio between the peak value of the similarity distribution and the secondary peak value of the similarity distribution to obtain peak value intensity;
calculating a second derivative of the similarity distribution at the peak value point to obtain the distribution sharpness;
and judging whether the problem to be solved belongs to a local problem or a global problem based on the peak intensity and the distribution sharpness.
In the embodiment, the problem to be solved is judged to belong to a local problem or a global problem based on the peak intensity and the distribution sharpness by constructing the similarity distribution between the problem to be solved and a plurality of text blocks, calculating the ratio between the peak value of the similarity distribution and the secondary peak value of the similarity distribution to obtain the peak intensity, calculating the second derivative of the similarity distribution at the peak point to obtain the distribution sharpness, and judging whether the problem to be solved belongs to the local problem or the global problem based on the peak intensity and the distribution sharpness. Therefore, the semantic similarity between the to-be-solved problem and each text block is calculated, so that similarity distribution between the to-be-solved problem and a plurality of text blocks is constructed, and the similarity distribution of the local problem and the global problem is obviously different, so that whether the to-be-solved problem belongs to the local problem or the global problem can be accurately judged by analyzing peak intensity and distribution sharpness, and a good data foundation is laid for later searching of accurate search information.
In some embodiments, constructing a similarity distribution between a question to be solved and a plurality of text blocks includes:
;
Wherein, The distribution of the degree of similarity is represented,Represents the number of text blocks, andIs a positive integer which is used for the preparation of the high-voltage power supply,The bandwidth parameter is represented by a parameter of the bandwidth,Representing a similarity score between the question to be answered and any text block,Representing questions to be answeredThe similarity score between the individual text blocks,Representing a kernel function.
In some embodiments, calculating the second derivative of the similarity distribution at the peak point to obtain the distribution sharpness includes:
;
;
Wherein, Indicating the degree of sharpness of the distribution,The distribution of the degree of similarity is represented,The point of the peak is indicated and,Representing the first derivative of the similarity distribution at the peak point,Representing the second derivative of the similarity distribution at the peak point,Representing questions to be answeredThe similarity score between the individual text blocks,Representing questions to be answeredSimilarity scores between text blocks.
In some embodiments, determining whether the problem to be solved belongs to a local problem or a global problem based on peak intensity and distribution sharpness comprises:
presetting a first threshold value and a second threshold value;
if the peak intensity is greater than or equal to a first threshold value and the distribution sharpness is greater than or equal to a second threshold value, judging that the problem to be solved belongs to a local problem;
if the peak intensity is smaller than the first threshold value or the distribution sharpness is smaller than the second threshold value, judging that the problem to be solved belongs to the global problem.
In this embodiment, a first threshold and a second threshold are preset, if the peak value intensity is greater than or equal to the first threshold and the distribution sharpness is greater than or equal to the second threshold, the problem to be solved is judged to belong to the local problem, and if the peak value intensity is less than the first threshold or the distribution sharpness is less than the second threshold, the problem to be solved is judged to belong to the global problem. Therefore, whether the problem to be solved belongs to the local problem or the global problem is judged by comprehensively considering the peak intensity and the distribution sharpness, and the problem to be solved belongs to the local problem or the global problem can be accurately judged, so that a good data basis is laid for searching accurate retrieval information in the later period.
In some embodiments, global searching is performed in the hierarchical knowledge graph by a global searching module to obtain a plurality of communities related to the to-be-solved problem, including:
In the hierarchical knowledge graph, each node corresponds to one community;
calculating the similarity between the problem to be solved and each node, and taking the similarity as an initial weight of each node;
Calculating the weighted community weight according to the initial weight;
Based on the weighted community weights, selecting a plurality of communities related to the to-be-solved problem.
In the embodiment, each node corresponds to one community in a hierarchical knowledge graph, similarity between a problem to be solved and each node is calculated and used as an initial weight of each node, weighted community weights are calculated according to the initial weights, and a plurality of communities related to the problem to be solved are selected based on the weighted community weights. Therefore, the global search module performs global search in the hierarchical knowledge graph, so that the method has higher efficiency in time and calculation cost, and the processing capacity of complex problems can be remarkably improved under the condition of not remarkably increasing the complexity of the system, thereby improving the accuracy and the search efficiency of search information corresponding to the global problems.
In some embodiments, calculating the weighted community weights from the initial weights includes:
;
Wherein, Representing the weighted community weight(s),Representing coefficients for balancing direct similarity with inherited weights,Representing the weight of the parent node and,The weights of the child nodes are represented,Representing coefficients for adjusting the contribution of each child node to the parent node weight,Representing the number of child nodes.
For ease of understanding by those skilled in the art, a set of preferred embodiments are provided below:
Knowledgement-based Question Answering, KBQA) is an important research direction in the field of Natural Language Processing (NLP) aimed at answering a user's questions by extracting information from a structured Knowledge base or unstructured text. With the rapid development of large language models (Large Language Models, LLMs) such as GPT, LLaMA, and Gemini, these models demonstrate powerful, low sample learning capabilities that can generate accurate answers on the basis of small amounts of contextual information. However, LLMs still face challenges in practical applications, especially their limited contextual windows, which limit their application in a wider range of scenarios.
To overcome these limitations, search enhancement generation (RETRIEVAL-Augmented Generation, RAG) techniques have evolved. The RAG, by combining an external knowledge base with LLMs, enables the model to access and utilize knowledge beyond its contextual window when generating answers without additional fine-tuning. The conventional RAG method assumes that the required information is concentrated in a specific area of the knowledge base and retrieves the relevant information by selecting TopK text blocks with the highest semantic similarity to the problem. This approach is effective for Local questions (Local questions) that only need to be focused on a particular portion of the knowledge base, but is not well behaved for Global questions (Global questions) that need to integrate information from multiple portions of the knowledge base, such as the full text summary (Query-Focused Summarization, QFS). For example, the following limitations remain when dealing with local or global problems with existing RAG methods:
1. The processing of local and global problems is unbalanced in that the existing method can only search local or global information, and a general and efficient information search Fan Shilai is lacked to simultaneously process the two types of problems.
2. The calculation cost is high, and some methods (such as GraphRAG) rely on LLM to carry out multi-round abstract generation when retrieving information, so that higher question and answer cost and longer response time are caused.
3. In the hierarchical retrieval process, the existing method often expands the hierarchical structure into a linear structure, so that high-level semantic information cannot be distinguished or the information is excessively generalized, and the advantages of the hierarchical structure cannot be fully utilized.
To solve the limitations of the prior art, this embodiment proposes a simple, general and efficient RAG retrieval paradigm, which can uniformly retrieve information required by Local (Local) and Global (Global) problems. Compared with the prior art, the main difference of the technical scheme of the embodiment is that:
1. problem classification algorithm the embodiment provides a problem classification algorithm based on semantic similarity distribution, which can accurately judge the problem type without depending on LLMs.
2. Hierarchical graph structure in this embodiment, the hierarchical graph structure is introduced to represent the original knowledge base, so that the global and local relations of knowledge can be better captured.
3. Hierarchical search strategy the present embodiment proposes a hierarchical search strategy that can minimize information loss in terms of time and cost efficiency, applicable to local and global problems.
Through the innovation, the method and the device can search more accurate and comprehensive information while keeping lower calculation cost, and are suitable for wide knowledge-intensive question-answering tasks. Referring to fig. 2, the technical solution of this embodiment specifically includes the following:
1. And a map construction module.
The method of this embodiment first improves GraphRAG knowledge graph construction techniques. Similar to GraphRAG, the present embodiment extracts entities and relationships from the source document. Each entity contains a name, description and type (from a predefined set of types), while each relationship includes a source entity, a target entity, a brief description, and a score representing the strength of the relationship. Through these components, the embodiment constructs a weighted undirected knowledge graph for capturing the structural semantics of the source document. The embodiment also detects hierarchical communities (HIERARCHICAL COMMUNITIES) in the graph and generates a summary of each community according to descriptions of entities and relationships within the communities. However, this embodiment introduces a key improvement to address one important limitation of GraphRAG in practical applications.
In an actual RAG scenario, the source document typically exceeds the context window limit of LLM, thus requiring a chunking process on the document. This results in the generation of a plurality of sub-maps, each representing only a portion of the document. If these sub-maps are not integrated, there will be a lack of global consistency between them, resulting in fragmentation of the community representation. In order to solve this problem, the present embodiment proposes a sub-map integration method. First, in this embodiment, entities with the same name and type in different sub-maps are combined, and a unified description is generated for each combined entity by using LLM, so as to ensure consistency and integrity. Second, for relationships with the same source and target entities, the present embodiment also incorporates and utilizes LLM to generate an integrated description and strength score. Through the integration process, the embodiment can create a relationship crossing sub-maps and finally generate a hierarchical knowledge map with a highly cohesive community, so that the global structure of the source document is reflected more accurately.
2. And a gating module.
The core function of the gating module is to judge whether the current problem (i.e. the problem to be solved, i.e. the user problem) belongs to a local problem or a global problem, so as to guide the selection of the subsequent retrieval module. The implementation is based on the following principle:
The embodiment first divides the original text into several text blocks and calculates the semantic similarity between the question and each text block. For a local question, the answer is usually focused on a specific area of the text, and therefore a significant peak will appear in the similarity distribution, indicating that the question is highly relevant to a certain block of text. For global problems, the required information tends to be scattered over multiple parts of the text, so the similarity distribution is relatively smooth and has no distinct peaks.
By analyzing morphological characteristics of the similarity distribution, the gating module can effectively distinguish types of problems, namely judging local problems if the similarity distribution has a significant peak value, and judging global problems if the similarity distribution is smoother. The mechanism provides a reliable basis for the selection of the subsequent retrieval module, so that the overall performance of the question-answering system is optimized. The specific theory proves that the method is as follows:
The present embodiment first models the similarity distribution using Kernel Density Estimation (KDE) to more accurately characterize the distribution of local and global problems. Given problems And (3) withIndividual text blocksIs scored asThe similarity distribution is modeled as:
(1);
Wherein, The distribution of the degree of similarity is represented,Representing a kernel function satisfying,Representing the bandwidth parameter, controlling the degree of smoothness,Representing a similarity score between the question and any text block,Representing questions to be answeredSimilarity scores between text blocks. Similarity distribution based on kernel density estimation as aboveThe following statistics of two features, peak intensity and distribution curvature, are defined:
1. Peak intensity (PEAK STRENGTH) for calculating similarity distribution The greater the ratio of peak to secondary peak, the more concentrated the distribution.
(2);
Wherein, Representing similarity distributionIs set at the peak of the (c),Representing similarity distributionIs a secondary peak of (a).
2. Distribution curvature Curvature calculating similarity distributionAt the peak pointSecond derivative at (c) for quantifying sharpness of the distribution:
(3);
The second derivative is then approximated using a center difference method:
(4);
Wherein, Representing the first derivative of the similarity distribution at the peak point,Representing the second derivative of the similarity distribution at the peak point,Representing questions to be answeredThe similarity score between the individual text blocks,Representing questions to be answeredSimilarity scores between text blocks. The present embodiment continues to assume that the distribution characteristics of the two types of problems satisfy the local problems: And is also provided with Global problem: Or (b) . Wherein the first threshold valueAnd a second threshold valueIs a preset threshold determined by statistical learning.
The theorem is demonstrated below:
If the answers to the local questions are concentrated in a single text block, the information of the global questions is uniformly distributed, and then the answer is added by AndTwo types of problems can be effectively distinguished.
And (3) proving:
If it is a local problem, there is a unique So thatSignificantly higher than othersDistributing KDEAt the position ofThe sharp peak is shown: (denominator approaches 0) (Larger curvature, sharp distribution) if it is a global problem, there isEvenly distribute KDENear flat: (no distinct peak) and (Smaller curvature, smooth distribution).
Conclusion of two types of problemsAndThere is a significant difference, by thresholding, it can be guaranteed that:
(5);
Wherein, Indicating the probability of correctly classifying a problem as a global problem or a local problem.
Supplementary proof:
1. and (5) selecting a bandwidth.
Bandwidth of a communication deviceIs critical to modeling of the KDE, the method of the present embodiment is optimized by Silverman criterion:
(6);
Wherein, Is the standard deviation of the samples.
2. Noise robustness.
The similarity distribution in an actual scene may contain noise. The signal to noise ratio is defined as follows:
(7);
Wherein, Representing problemsAnd all text blocksThe highest semantic similarity score between, i.e., signal strength. It reflects the degree of matching between the question and the most relevant text block, with a larger value indicating a stronger relevance of the question to a certain text block.The standard deviation of the semantic similarity score is expressed and used for measuring the degree of dispersion of the similarity score, namely the noise intensity. It calculates each similarity scoreAnd mean value ofThe average value of the sum of squares of deviations of (c) is obtained by rescheduling.The larger the similarity distribution, the more scattered the similarity distribution, the more noise; the smaller the description similarity distribution, the more concentrated the noise is. Calculated, the method of the embodiment The impact of noise on the threshold decision is negligible.
3. And a local retrieval module.
The core task of the module is to quickly locate the local information most relevant to the question from the knowledge base to support subsequent answer generation. The most classical search method based on K-nearest neighbor (K-Nearest Neighbors, KNN) algorithm is directly adopted here. The core idea of KNN algorithm is to select the most relevant by calculating the semantic similarity between the problem and the text blocks in the knowledge baseAnd taking the text blocks as search results. Specifically, given a problemAnd text blocks in a knowledge baseFirstly, the problems and text blocks are respectively encoded into semantic vectors through a pre-trained embedded generation model (Embedding)AndCosine similarity between the problem vector and each text block vector is then calculated:
(8);
finally, selecting the one with the highest similarity And taking the text blocks as search results. The KNN algorithm can match problems with text segments at fine granularity, thereby locating relevant information more accurately. The method is simple and efficient, and is suitable for answering local questions concentrated in a local area of a knowledge base.
4. And a global retrieval module.
The core task of the global retrieval module is to integrate multiple parts of information from the hierarchical knowledge graph so as to answer global questions requiring comprehensive knowledge. Traditional cosine similarity-based retrieval methods have limitations in dealing with global problems because the similarity score of higher-level nodes (representing abstract concepts) is typically low, which may cause them to be ignored, but only the parent-child relationships may be relied upon and related information may be omitted. In order to solve this problem, the present embodiment proposes a weighted search mechanism, which combines the position information of the nodes in the hierarchical structure on the basis of cosine similarity, so as to more comprehensively capture the requirement of global problem. In the hierarchical tree structure built Graph Indexer, each node corresponds to a community (community). The present embodiment first calculates an initial weight of each node based on cosine similarity:
(9);
Wherein, The embedded vector that is a problem is,An embedded vector is described for the node. In order to integrate hierarchical context information, the present embodiment updates the weight of each parent node. Wherein, As the weight of the parent node,As the weights of the child nodes,For the number of child nodes,For adjusting the contribution of each child node to the parent node weight,(Typically set to 0.3) is used to balance direct similarity with inheritance weights.
(10);
Based on the weighted community weights, the embodiment selects the community weight with the highest weightPersonal communities (in experiments)) And inputs its description as a context input LLM, generating an answer. The sub-answers can reduce noise and optimize the efficiency of use of the contextual windows compared to directly using the original description. These sub-answers serve as retrieved information, providing high quality knowledge support for subsequent answer generation.
Compared with the prior art, the technical scheme of the embodiment has the following advantages:
the retrieval method provided by the embodiment remarkably improves the performance of the question-answering system when processing local problems and global problems by combining the hierarchical knowledge graph and the weighted retrieval mechanism. For local problems, the traditional KNN retrieval method can rapidly locate text blocks most relevant to the problems, and ensure the accuracy and retrieval efficiency of answers, while for global problems, the hierarchical weighted retrieval mechanism avoids the limitation of single similarity measurement by integrating abstract information of high-level nodes and detailed information of bottom-level nodes, so that key information scattered in multiple parts of a knowledge base is more comprehensively captured. In addition, by generating sub-answers rather than directly using the original description, noise is further reduced and utilization of the contextual windows is optimized.
The core advantage of the method of the present embodiment is its versatility and efficiency. Through simple similarity distribution analysis, the gating module can automatically judge the problem type and flexibly select local retrieval or global retrieval strategies without relying on a complex Large Language Model (LLM) agent. Meanwhile, the hierarchical weighted retrieval mechanism has higher efficiency in time and calculation cost, and can remarkably improve the processing capacity of complex problems under the condition of not remarkably increasing the complexity of the system. Experimental results show that the method of the embodiment is superior to the existing most advanced method in various knowledge-intensive question-answering tasks, and reliable technical support is provided for practical application.
Referring to fig. 3, the embodiment of the present application further provides a text question-answering system based on semantic retrieval, which includes a data acquisition unit 100, a graph construction unit 200, a question judgment unit 300, a local retrieval unit 400, and a global retrieval unit 500, wherein:
A data acquisition unit 100 for acquiring a problem to be solved;
The map construction unit 200 is used for constructing a hierarchical knowledge map with a highly cohesive community;
a problem judging unit 300 for judging whether the problem to be solved belongs to a local problem or a global problem;
the local search unit 400 is configured to perform local search through the local search module if the to-be-solved problem belongs to the local problem, so as to obtain a plurality of solution texts similar to the to-be-solved problem;
The global search unit 500 is configured to perform global search in the hierarchical knowledge graph through the global search module if the to-be-solved problem belongs to the global problem, obtain a plurality of communities related to the to-be-solved problem, describe the communities as a context, and input the context into the large language model to generate a solution text.
In some embodiments, the problem determination unit 300 may be specifically configured to:
constructing similarity distribution between the to-be-solved problem and a plurality of text blocks;
Calculating the ratio between the peak value of the similarity distribution and the secondary peak value of the similarity distribution to obtain peak value intensity;
calculating a second derivative of the similarity distribution at the peak value point to obtain the distribution sharpness;
and judging whether the problem to be solved belongs to a local problem or a global problem based on the peak intensity and the distribution sharpness.
In some embodiments, the problem determination unit 300 may be specifically configured to:
;
Wherein, The distribution of the degree of similarity is represented,Represents the number of text blocks, andIs a positive integer which is used for the preparation of the high-voltage power supply,The bandwidth parameter is represented by a parameter of the bandwidth,Representing a similarity score between the question to be answered and any text block,Representing questions to be answeredThe similarity score between the individual text blocks,Representing a kernel function.
In some embodiments, the problem determination unit 300 may be specifically configured to:
;
;
Wherein, Indicating the degree of sharpness of the distribution,The distribution of the degree of similarity is represented,The point of the peak is indicated and,Representing the second derivative of the similarity distribution at the peak point,Representing questions to be answeredThe similarity score between the individual text blocks,Representing questions to be answeredSimilarity scores between text blocks.
In some embodiments, the problem determination unit 300 may be specifically configured to:
presetting a first threshold value and a second threshold value;
if the peak intensity is greater than or equal to a first threshold value and the distribution sharpness is greater than or equal to a second threshold value, judging that the problem to be solved belongs to a local problem;
if the peak intensity is smaller than the first threshold value or the distribution sharpness is smaller than the second threshold value, judging that the problem to be solved belongs to the global problem.
In some embodiments, the global retrieval unit 500 may be specifically configured to:
In the hierarchical knowledge graph, each node corresponds to one community;
calculating the similarity between the problem to be solved and each node, and taking the similarity as an initial weight of each node;
Calculating the weighted community weight according to the initial weight;
Based on the weighted community weights, selecting a plurality of communities related to the to-be-solved problem.
In some embodiments, the global retrieval unit 500 may be specifically configured to:
;
Wherein, Representing the weighted community weight(s),Representing coefficients for balancing direct similarity with inherited weights,Representing the weight of the parent node and,The weights of the child nodes are represented,Representing coefficients for adjusting the contribution of each child node to the parent node weight,Representing the number of child nodes.
It should be noted that, since a text question-answering system based on semantic search in the present embodiment and the above-mentioned text question-answering method based on semantic search are based on the same inventive concept, the corresponding content in the method embodiment is also applicable to the present system embodiment, and will not be described in detail here.
Referring to fig. 4, the embodiment of the present application further provides an electronic device, where the electronic device includes:
At least one memory;
At least one processor;
at least one program;
the programs are stored in the memory, and the processor executes at least one program to implement the text question-answering method of the present disclosure that implements the semantic retrieval described above.
The electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a vehicle-mounted computer, and the like.
The electronic device according to the embodiment of the application is described in detail below.
The processor 1600 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present disclosure;
the Memory 1700 may be implemented in the form of Read Only Memory (ROM), static storage, dynamic storage, or random access Memory (Random Access Memory, RAM). Memory 1700 may store an operating system and other application programs, related program code is stored in memory 1700 when the technical solutions provided by the embodiments of the present disclosure are implemented in software or firmware, and the processor 1600 invokes a text question-answering method based on semantic retrieval that performs the embodiments of the present disclosure.
An input/output interface 1800 for implementing information input and output;
The communication interface 1900 is used for realizing communication interaction between the device and other devices, and can realize communication in a wired manner (such as USB, network cable, etc.), or can realize communication in a wireless manner (such as mobile network, WIFI, bluetooth, etc.);
Bus 2000, which transfers information between the various components of the device (e.g., processor 1600, memory 1700, input/output interface 1800, and communication interface 1900);
wherein processor 1600, memory 1700, input/output interface 1800, and communication interface 1900 enable communication connections within the device between each other via bus 2000.
The disclosed embodiments also provide a storage medium that is a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above-described text question-answering method based on semantic retrieval.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments described in the embodiments of the present disclosure are for more clearly describing the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not limit the embodiments of the present disclosure, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. The storage medium includes various media capable of storing programs, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk. The embodiments of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application.
The embodiments of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application.