CN119202209A

CN119202209A - A method to improve the accuracy of knowledge question answering based on semantic elements

Info

Publication number: CN119202209A
Application number: CN202411713153.0A
Authority: CN
Inventors: 鲍钟峻; 黄升; 吴名朝; 刘军华; 覃国权
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2024-11-27
Filing date: 2024-11-27
Publication date: 2024-12-27
Anticipated expiration: 2044-11-27
Also published as: CN119202209B

Abstract

The invention provides a knowledge question and answer accuracy improving method based on semantic elements, which comprises the steps of extracting information from text data to construct semantic nodes, carrying out word segmentation processing on a question text input by a user to obtain a word sequence to form semantic element representation of the question, applying a golden investment algorithm to realize knowledge retrieval in a knowledge recall stage, matching candidate answers with the semantic elements of the user question, scoring and sorting according to keywords, constructing a prompt word of a large model according to query intention of the user question and the sorted candidate answers, calling the large model to generate a final answer, and presenting the output answer according to a designated format. By integrating the technologies of semantic analysis, knowledge recall, natural language processing and the like, the intelligent level and response accuracy of the knowledge question-answering system are remarkably improved, and higher-quality service is provided for users.

Description

Knowledge question-answering accuracy improving method based on semantic elements

Technical Field

The invention relates to the field of natural language processing and intelligent question-answering systems, in particular to a knowledge question-answering accuracy improving method based on semantic elements.

Background

With the rapid development of large model technology, various traditional IT applications are gradually transformed into intelligent applications such as data operation and maintenance questions and answers, etc. handling self-help questions and answers, intelligent access questions and answers, and the like. However, these intelligent question and answer applications face significant challenges in large-scale applications, especially in contradictions between high accuracy, high performance and user interaction. The main problems of the prior art include:

First, the difficulty of high quality knowledge organization makes it difficult for intelligent question-answering systems to effectively manage large numbers of complex knowledge documents. The current knowledge organization method is divided into two types, namely a low-cost document slicing and classified storage, but the method is difficult to effectively manage in the knowledge field, and a specialized organization based on a knowledge graph, but the construction cost is high and continuous maintenance is needed. This results in a lack of flexibility and adaptability in the organization of knowledge, affecting understanding and recall of subsequent user questions.

Second, the accuracy of user intent understanding is low. The traditional intention recognition method relies on manual inquiry scenes of users, is easy to cause high maintenance cost, cannot exhaust the intention of the users, and has the conflict problem of triggering modes. This limitation reduces the user experience and increases the management burden on the system.

Third, efficient recall of knowledge is also problematic. Existing recall techniques typically rely on multiple parallel recalls, but large amounts of irrelevant information may be generated in the user query, wasting system resources and reducing recall efficiency. The inefficiency of this process directly affects the speed and accuracy with which the user can obtain information.

In addition, the accuracy of knowledge ranking evaluation is insufficient, and the existing ranking model is often evaluated according to a pre-training corpus, so that when the new knowledge or the professional field is faced, the ranking effect is poor, and irrelevant information can be displayed preferentially, so that the effectiveness of user decision is affected.

Finally, the stability and pertinence of answer generation are insufficient. The answer formats generated by different large models for the same question are different, and consistency and adaptability of output cannot be ensured. This makes users often face the result of not meeting the demand when they seek an accurate answer, thereby affecting the response speed and processing efficiency of the overall service.

The problems can cause the accuracy and user satisfaction of the intelligent question-answering system in the practical application of the user to be greatly reduced, so that the efficiency of enterprises in the aspects of service response, information acquisition, decision support and the like is influenced, and the popularization and development of intelligent application are further restricted. Therefore, a novel method is needed to effectively solve the above technical problems, so as to improve the overall performance and user experience of the intelligent question-answering system.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a knowledge question-answering accuracy rate improving method based on semantic elements. The answer obtained by the user is accurate, accords with natural language expression habit, enhances interaction experience and reduces understanding difficulty of the user.

In order to achieve the above purpose, the invention provides a knowledge question-answering accuracy improving method based on semantic elements, which comprises the following steps:

Step 1, extracting key information from text data in related fields, constructing semantic nodes comprising entities, attributes and relations, forming a semantic element library, organizing the semantic element library through four layers of fields, classifications, carriers and knowledge, and supporting multi-level knowledge representation;

Step2, word segmentation processing is carried out on the problem text input by the user, a word sequence is obtained, and a word segmentation tool is used in combination with a domain knowledge base;

And 3, in the knowledge recall stage, a gold investment algorithm is applied to realize knowledge retrieval, wherein the gold investment algorithm is as follows:

The method comprises the steps of regarding semantic elements as keywords (investors), regarding knowledge content in a knowledge base as channels (producers), distributing initial funds for each keyword, and simulating investment behaviors of each keyword;

the investment priority and the profit mechanism of the keywords are defined according to the principle that the number of channels is inversely proportional to the return on investment.

Circulating all keywords, searching one by one, and recording the benefits (the quality and the relativity of the searched knowledge) of each round of searching;

The keyword with the highest return rate is preferentially selected in each round of search to carry out the next search, so that the semantic element requirements in the user problems are met;

And stopping searching when the benefits of all the keywords reach the expected or no new benefits, and forming a candidate answer set.

Step 4, matching the candidate answers with semantic elements of the user questions, scoring and sorting according to the relevance of keywords, context information, semantic coverage rate and the like, selecting a plurality of most relevant answers through a hierarchical sorting mechanism, and simplifying;

and 5, constructing a prompt word of the large model according to the query intention of the user question and the sequenced candidate answers, calling the large model to generate a final answer, and presenting the output answer according to a specified format to ensure that the content meets the user requirement.

Further, the step 1 specifically comprises the following steps:

Step 1.1, collecting text data in the field, including structured data and unstructured data;

Step 1.2, performing text preprocessing, performing format conversion on the collected data, and removing nonsensical characters, special symbols and noise information;

step 1.3, word segmentation is carried out on text data by using a word segmentation tool, and word segmentation results are optimized by combining a field word stock, so that the accuracy of word segmentation is ensured;

extracting semantic nodes including entities (such as concept names), attributes (such as description fields) and relationships (such as logical associations between entities);

Step 1.5, according to semantic nodes, classifying the knowledge content according to four layers of 'fields', 'classifications', 'carriers' and 'knowledge';

step 1.6, constructing a semantic element library, combining semantic nodes with a hierarchical structure to form a multi-level knowledge representation system;

And 1.7, periodically updating the semantic element library, and re-extracting the nodes and the relations according to the newly added domain knowledge to ensure the timeliness and the accuracy of the library content.

Further, the step 2 specifically comprises the following steps:

step 2.1, receiving an original problem text input by a user, cleaning the original problem text, and removing nonsensical characters and format symbols;

step 2.2, word segmentation is carried out on the problem text by using a word segmentation tool, a word sequence is generated, and a word segmentation result is optimized by combining domain knowledge;

step 2.3, marking the parts of speech of the segmented text, and identifying grammar roles (such as nouns and verbs) of each word;

Step 2.4, generating a dependency graph by using a syntactic dependency analysis tool, and extracting syntactic dependency among words;

Extracting semantic elements of the problem according to the analysis result, wherein the semantic elements comprise entities (specific objects related to the problem), attributes (modification components in the problem) and query intentions (operations required by a user);

And 2.6, matching the extracted semantic elements with a semantic element library, and marking semantic categories of the semantic elements to form a structured semantic representation.

Further, the step 3 is specifically as follows:

Initializing a retrieval process, namely decomposing semantic elements of the problem into a keyword set and mapping the keyword set to corresponding keywords in a semantic element library;

step 3.2, defining search variables, taking keywords as 'investors', taking contents in a knowledge base as 'channels', and distributing initial funds for each keyword;

Step 3.3, defining investment and income rules, setting keyword retrieval priority, and preferentially processing keywords with higher channel income;

Step 3.4, circularly searching each keyword, and recording the searched knowledge content and benefits, including the relativity and coverage of the knowledge;

step 3.5, according to the return rate of each round of retrieval, the keyword with the highest return rate is preferentially selected for the next retrieval;

step 3.6, stopping searching when all keyword searching is completed or no new benefits exist, and forming a candidate answer set;

And 3.7, primarily screening the contents in the candidate answer set, removing the knowledge blocks with low correlation or repetition, and ensuring the quality of recall contents.

Further, the step 4 is specifically as follows:

step 4.1, receiving a candidate answer set for knowledge recall, and matching the answer with semantic elements of the user questions;

step 4.2, calculating the correlation between the answers and the questions, including keyword matching degree, semantic similarity and context consistency;

step 4.3, grading and grading the candidate answers according to the intention of the user questions, and marking the priority of each answer;

Step 4.4, grading comparison is carried out on answers by using a four-level semantic system, and matching degrees of entities, attributes, relations and semantic structures are calculated in sequence;

step 4.5, sorting the candidate answers according to the comprehensive scores, and selecting the appointed number with the highest score as a final result;

and 4.6, simplifying the ordered answers, reserving necessary context information and removing redundant contents.

Further, the step 5 is specifically as follows:

Step 5.1, generating a prompt word for large model call according to the query intention of the user question and the ordered candidate answers;

Step 5.2, formatting the prompt word to ensure that the prompt word meets the input requirement of a large model (such as JSON structure or natural language description);

step 5.3, calling a large model to generate answers, and reorganizing key information in the candidate answers into complete natural language answers;

Step 5.4, carrying out grammar checking and consistency verification on the generated answers to ensure smooth language expression and accurate content;

Step 5.5, further processing the answer (such as list format, form display or paragraph description) according to the format requirement of the user problem;

And 5.6, outputting a final answer, feeding back the final answer to the user, and simultaneously recording an answer generation process for subsequent optimization.

Further, user feedback is also included, specifically as follows:

step 6.1, before the answer is generated, receiving the supplement answer of the user, and re-splitting the semantic elements and recalling knowledge to ensure accurate understanding of the intention of the user;

step 6.2, after providing the answer, the system allows the user to feed back the satisfaction and accuracy of the answer;

Step 6.3, collecting and analyzing user feedback data, and identifying common problems and weak links of the system;

and 6.4, adjusting a problem analysis method, a knowledge recall strategy and an answer generation mechanism according to the feedback result, and continuously optimizing the system performance.

Compared with the prior art, the invention has the beneficial effects that:

1. The invention provides a knowledge question-answering accuracy rate improving method based on semantic elements, which can systematically extract semantic dimensions, entities, attributes and relations in a specific field to form multi-level knowledge representation by constructing a semantic element library. The structured knowledge representation enables the system to comprehensively understand knowledge in the field, and improves accessibility and utilization efficiency of information.

2. The invention provides a knowledge question-answering accuracy rate improving method based on semantic elements, which can accurately identify key information in a problem by adopting a semantic element splitting technology in the analysis process of the user problem. The refined processing mode ensures that the system can better understand the query intention of the user, reduces ambiguity and improves the accuracy of problem identification.

3. The invention provides a knowledge question and answer accuracy rate improving method based on semantic elements, which is used for carrying out knowledge recall by applying a gold investment algorithm and can efficiently extract information related to user problems from a large number of knowledge blocks. Through continuous iterative optimization, the system can learn and improve the recall quality, thereby ensuring the timeliness and the relevance of the answer.

4. The invention provides a knowledge question-answering accuracy rate improving method based on semantic elements, wherein a multi-level weight comparison mechanism can accurately evaluate the advantages and disadvantages of candidate answers by comprehensively considering the similarity and the context correlation between the candidate answers and user questions. The method ensures that the answer returned finally is not only matched with the question, but also meets the specific requirements and the background of the user, and the method also has the capability of flexibly coping with various questions, and the system can be quickly adapted to the questions with different types and formats by constructing a question template. The characteristic enables the system to be more robust when processing diversified user queries, and meets wider application scenes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a semantic element based question-answering process;

FIG. 2 is a schematic diagram of automatic construction of a semantic element library;

FIG. 3 is a problem understanding flowchart;

FIG. 4 is a diagram of a golden recall algorithm;

FIG. 5 is a knowledge assessment algorithm diagram;

FIG. 6 is a schematic diagram of a semantic element construction flow;

FIG. 7 is a schematic diagram of a knowledge semantic base hierarchy;

FIG. 8 is a schematic diagram of document conversion and element recognition;

FIG. 9 is a schematic diagram of a Chinese semantic clause flow;

FIG. 10 is a schematic diagram of an intra-sentence semantic disambiguation flow;

FIG. 11 is a pre-complement semantic dependency analysis diagram;

FIG. 12 is a post-completion semantic dependency analysis diagram;

FIG. 13 is a schematic diagram of a semantic element extraction flow;

Fig. 14 is a schematic diagram for understanding based on semantic elements.

Detailed Description

The technical solution of the present invention will be more clearly and completely explained by the description of the preferred embodiments of the present invention with reference to the accompanying drawings.

The invention provides a knowledge organization and knowledge recall method based on key semantic elements for a large-model knowledge question-answering scene, and the method can reduce the sorting cost of document knowledge and improve the accuracy of the large-model knowledge question-answering. The realization of the whole technical scheme comprises five steps of semantic element library construction, question understanding engineering, knowledge recall engineering, knowledge evaluation engineering and answer generation engineering, wherein the relation among the steps is shown in figure 1;

s1 automatic construction of semantic element library

As shown in fig. 2, semantic elements are basic units of expression in natural language, which may be words, phrases, or sentences, used to convey specific information. In Natural Language Processing (NLP), the recognition and representation of semantic elements is the basis for tasks such as text analysis, information extraction, and knowledge representation. The automatic construction mechanism of the semantic element library comprises five steps of semantic dimension setting, document uploading, document slicing, syntactic dependency analysis and semantic element extraction, and finally the semantic elements stored are combined and expressed by using four layers of dimension, entity, modifier and relationship.

Problem understanding engineering as shown in fig. 3, human understanding of a word first recognizes each word, sentence therein, and then understands semantic meaning therein according to the read word, as human reading understanding. If an unrecognized word is encountered, we need to make a dictionary lookup or ask an experienced person for teaching. The invention is characterized in that the concept of understanding the problem is that physical clauses are firstly carried out according to punctuation marks, then logical clauses are carried out according to logical connecting words, then element splitting is carried out by using a semantic element library, and finally the residual part is split by using natural language. For the word segmentation which cannot be understood by the element library, the inquiry clarification is needed.

Knowledge recall engineering as shown in fig. 4, in order to solve the problems of slow recall performance and low accuracy in the prior art, the invention proposes a gold investment algorithm for knowledge recall. By the algorithm, the complexity involved in information retrieval and the complexity of information integration and filtering can be well solved. The whole algorithm core idea is to solve the problems that keyword weight cannot be specified, recall efficiency is low and the like in traditional knowledge retrieval by simulating the relation between investment and channel. The algorithm optimizes the keyword retrieval process through a set of analogy mechanism of investment and income, and improves the efficiency and accuracy of knowledge recall. Specifically, the search keyword is regarded as "business" (i.e., investor), and the search channel in the knowledge base is regarded as "channel" (i.e., producer). The keywords are subjected to knowledge retrieval through the investment channel, each round of retrieval determines the retrieval priority of the next keyword according to the investment income, and finally, efficient and accurate knowledge recall is realized. The platform gives some starting funds (recall opportunities) to the enterprises (elements), drives the enterprises (retrieve elements) to reasonably invest (retrieve) by utilizing the funds (recall opportunities), creates more value, and can recall the knowledge required by knowledge engineering with lower expenditure through 1 or a plurality of times of iteration.

Knowledge assessment engineering As shown in FIG. 5, in order to solve the problem of the traditional ordering model on the black box of the user, the invention constructs a multi-level semantic weight comparison algorithm based on semantic elements. Firstly, extracting key elements from a knowledge block through a semantic element extraction tool of S1, secondly, obtaining word segmentation intersection by using a word segmentation intersection algorithm, calculating the duty ratio, introducing the semantic elements of the previous stage to perform secondary sequencing under the condition of first-round evaluation and the same weight importance of the physical elements in the dimension and the physical elements which are higher than the dimension elements by a plurality of proportions. And under the same condition again, introducing the semantic elements of the parent directory to sort, and finally obtaining the closest sort.

The answer generation project solves the problem that the traditional large model is not friendly to original understanding and generates inappropriateness, and the invention discloses an expression mode based on MarkDown and JSON structure combination. The exit of the segment is marked with # for each knowledge block, the ranking score is marked with (), the semantic elements associated with the problem for the current knowledge block are represented with [ and the individual knowledge blocks are expressed with < data > </data >. And the understanding capability of the large model on knowledge is improved through a solidification combination and large model fine adjustment mode.

As a specific embodiment, the following is specific:

And constructing a semantic element library. Text corpus of a specific field, including books, papers, web pages and question-answer records, is collected, ensuring that a broad knowledge of the field is covered. And denoising, word segmentation and part-of-speech tagging are carried out on the collected corpus by applying a text preprocessing technology so as to clearly identify a text structure. On the basis, the entity in the text is identified, and the type, attribute and relation of the entity are extracted to form a preliminary semantic dimension. And integrating the identified entities and the relationships thereof into a semantic element library to construct a multi-level knowledge representation comprising a concept layer, an entity layer and a relationship layer.

And splitting semantic elements for the user problem. And receiving the question text input by the user, and cleaning the characters to remove nonsensical characters. And performing word segmentation on the problem text by using a word segmentation technology to generate a word sequence. Part of speech tagging is applied to identify the part of speech of each word in the word sequence to determine its grammatical role. And extracting the entity, attribute and query intention in the problem text to form structured data and marking the category of the structured data.

And (5) carrying out knowledge recall by using a gold investment algorithm. Candidate knowledge blocks are retrieved from the semantic element library based on the query intent of the user question. And extracting the characteristics of each candidate knowledge block to generate characteristic vectors. And calculating the similarity score between each candidate knowledge block and the user question according to the feature vector, sequencing the similarity scores, and selecting the knowledge block with the highest score as a candidate answer.

And carrying out multi-level weight comparison on the candidate answers based on the constructed semantic elements. And carrying out preliminary evaluation on the similarity between the candidate answers and the user questions, and calculating a similarity value. And extracting the context information of the candidate answers, and calculating the context correlation score. And carrying out weighted combination on the similarity value and the context correlation score to generate a weighted score, and selecting an optimal candidate answer according to the weighted score to ensure the correlation and accuracy of the answer.

And generating natural language answers according to the query intention of the user questions and the related semantic elements thereof. Based on the query intent and related semantic elements of the user question, a logical framework of answers is constructed. Information is extracted from the selected knowledge blocks and integrated into answer content. And carrying out grammar check on the generated answers to ensure that language specifications are met, and returning final answers to the users to meet the questioning requirements of the users.

Through the steps, the accuracy of knowledge question and answer can be effectively improved, and the system can more accurately understand and answer the questions of the user.

As an alternative to the use of a single-piece carrier,

The semantics of a natural sentence consists of two parts, namely internal semantics and external semantics. The external semantic environment refers to a knowledge base catalog, a document title, an in-document catalog and a natural section where a natural sentence is located. The internal semantic environment refers to each semantic element composed of natural sentences, including words, phrases, and the like. And comprehensively utilizing the internal and external semantic environments of the natural sentence to extract multi-level semantic elements, and providing a retrieval basis for subsequent knowledge recall. The semantic element construction comprises eight steps of semantic extraction rule setting, document element identification, semantic clause, local semantic disambiguation, local semantic completion, keyword semantic element extraction, entity semantic element extraction and fact semantic element extraction, as shown in fig. 6.

The semantic extraction rule is set as shown in fig. 7, wherein the rule comprises two parts, namely a knowledge domain classification rule, by setting a knowledge base and a knowledge base catalog, and setting keyword labels for the knowledge base and the catalog. For example, the operation and maintenance knowledge base is provided with catalogues of interfaces, reports, indexes, labels and the like, and each catalogue uses keywords for semantic feature description, for example, the interfaces comprise keywords of models, source systems and the like. Secondly, the semantic inheritance rule is that knowledge blocks cut by the document inherit semantic features such as a knowledge base, a catalogue, the document and the like from top to bottom. And obtaining a knowledge base, a catalogue and a knowledge semantic basic system of the features through the step.

Document conversion and element identification as shown in fig. 8, the doc or pdf is converted to Markdown format using LlamaIndex or similar document processing tools. The invention adds the catalog generation enhancement method based on sequence rule identification on the basis of the document processing tool. I.e. for content under any directory, if encountered

1.2, 3; 1), 2), 3), Etc., the platform will intercept text using rules that are configurable in the background, the rule templates encompassing the sequence numbers common to word documents. By this step a document directory tree is obtained containing titles, multi-level directories, document elements (natural segments, pictures, tables).

Chinese semantic clauses are shown in FIG. 9, which uses HanLP or similar tools to use punctuation rules to clause Chinese. The Chinese rules utilized by the tool comprise periods (, exclamation marks (|), question marks "(. First, for the symbol content in the text box, no semantic clause is performed. Secondly, setting a similarity coefficient for the ultra-long sentences, taking partitionable partition punctuation marks (commas) in the sentences for partitioning, and performing recursion judgment by using semantic similarity after partitioning.

As shown in fig. 10, the intra-sentence semantic disambiguation traverses the first N natural segments of the article, the first N natural segments of the present directory and the upper directory, and searches for a declaration of a reference. For example, at the beginning of an article about a company, there is a description that "a company is hereinafter referred to as a company". We extract this information from the text using the prompt word engineering of the large model. Before each extraction, judging whether the cache is extracted or not, and if the cache is extracted, not calling the large model. Through this step, the designation of the word noun is obtained. And loading a seed word stock preset in the knowledge base by using jieba or a similar tool, and performing word segmentation to obtain a word segmentation array. A reference word library is provided that consists of common reference words (e.g. "you", "me", "he", "it", etc.) and reference words specific to this document. Comparing the index word library with the word segmentation array, if the index word is found, writing the index word of the large model to obtain the target word after the index, and inserting the index supplement into the original text. For example, "a company does not hold a company share", the enhanced sentence is "a company does not hold a company [ full scale ] share".

Semantic element completion as shown in fig. 11 and 12, a complete sentence is a descriptive description of a particular object or event. Sentence semantic integrity is often lacking in key central entities. For example, the sentence "date of employment: 3 th month 6 of 2017" is an incomplete semantic sentence, and the complete semantic sentence should be "date of employment of a subject is 3 th month 6 of 2017". The invention firstly carries out data preprocessing on sentences according to the sentence denoising rule, and comprises the steps of replacing interference symbols in the sentences by using spaces, converting time symbols in a specific format and the like. And then carrying out semantic structure analysis by using HanLP or similar tools to obtain a semantic structure tree. And judging whether the semantic structure tree contains a subject role and a guest role. If none of the models is contained, the large model is supplemented with key subjects or objects by using the prompt words of the large model in combination with the context (the first N natural segments). For example, the date of employment that the completed sentence should be "[ subject ] is 3 days 6 months in 2017.

Semantic element extraction as shown in fig. 13, a deactivated word library is first manually created, and the deactivated words include "have", "bar", and the like. The stop words in the sentence are replaced with spaces. And then enabling Hanlp tools to load domain knowledge base terms, and using Hanlp tools or similar tools to name entity recognition tools to perform name entity recognition to obtain a name entity set. And then carrying out semantic analysis by using Hanlp tools or similar semantic analysis tools, and extracting the association relation between the subject roles and the object roles of the named entity to obtain a fact semantic element set. Then, a Hanlp tool or a similar tool word segmentation tool is used for word segmentation, and word segmentation is carried out and then the word segmentation tool is fused with word segmentation of an upper-level catalog, so that keyword semantic elements are obtained. And finally, storing the entity semantic elements, the keyword semantic elements and the fact semantic elements to form a semantic element stock.

S2, understanding based on semantic element problem

As shown in fig. 14, as human reading and understanding, human is to simplify the complex problem, and then to read the sub-problem word by word to understand the meaning of the meaning word by word, and to make reasoning. The invention simulates partial behaviors of human reading, firstly, a problem is parsed and decomposed into a plurality of atomic problems by Hanlp tools or similar tools, then the atomic problems are decomposed into entity semantic elements, fact semantic elements and keyword semantic elements, and then an understanding threshold value is set. If the range of any element understanding exceeds the set threshold, the platform is considered to understand the user problem. If the specified range is not reached, the platform is required to initiate an active inquiry, and the unintelligible words are clarified. This is iterated until user problem understanding is completed.

1) Firstly, using Hanlp tools or similar tools to make physical clauses according to punctuation marks (comma, period, exclamation mark, line-feeding symbol, etc.), then using Hanlp tools or similar tools to make syntactic dependency analysis on the problem after the clause, and said non-step can obtain grammar dependency tree. Judging whether the atomic problems contain the connective words such as sum, ratio and the like and the connected nouns, if so, removing the connective words and the front and rear contents, and respectively accessing the connected nouns to obtain two decomposed sentences.

2) Element decomposition, namely constructing a well-built disabling word stock before use. The stop words in the sentence are replaced with spaces. And then enabling Hanlp tools to load domain knowledge base terms, and using Hanlp tools or similar tools to name entity recognition tools to perform name entity recognition to obtain a name entity set. And then carrying out semantic analysis by using Hanlp tools or similar semantic analysis tools, and extracting the association relation between the subject roles and the object roles of the named entity to obtain a fact semantic element set. Then, a Hanlp tool or a similar tool word segmentation tool is used for word segmentation, and word segmentation is carried out and then the word segmentation tool is fused with word segmentation of an upper-level catalog, so that keyword semantic elements are obtained. Finally, entity semantic elements, keyword semantic elements and fact semantic elements are obtained.

3) And (3) understanding the problem, namely respectively using entity semantic elements, keyword semantic elements and fact semantic elements to compare the coverage of the original problem according to the set problem understanding rule. The semantic elements are preferentially used for comparison, when the semantic element comparison cannot be satisfied, the synonymous words of the semantic words are used for comparison, when the synonymous words cannot be matched yet, recommended words with semantic similarity exceeding a specified threshold are used for comparison. If the coverage is greater than the specified coefficient, problem understanding ends. Otherwise, the platform initiates an active inquiry, and after the user supplements the problem, the large model + promt mode is used for carrying out supplementary understanding on the isolated semantic element points until the problem coverage is larger than the execution threshold. When the circulation is larger than the set maximum times, the understanding requirement can not be met, and the platform directly refuses to answer the question.

S3, recall based on semantic element knowledge

According to the knowledge of S1 and S2, the knowledge stored in the knowledge base is three index types, namely, a keyword index, a fact map index and an original sentence index. The three indexes respectively correspond to sentence semantic retrieval, key information retrieval and multi-hop reasoning retrieval. Sentence semantic retrieval refers to a knowledge organization manner that can directly find answers in the text, like FAQ. And the multi-hop reasoning search is used for coping with multi-hop type problem search, such as 'the income situation of hospitals indirectly invested by a equity investment fund'. All three search modes are based on semantic elements, and are concretely realized as follows:

1) Based on sentence semantic recall, the improvement of the method relative to the traditional RAG semantic recall method mainly consists in the enhancement of the semantic environment of the original text. On the basis of the line index, a markdown mode is used for organizing, and keyword semantic index information of the affiliated document, affiliated catalog and the knowledge base is supplemented. And then using an m3e vector model or a similar model to perform vector embedding, and using cosine similarity to perform comparison. Knowledge data of the pre-TopN is recalled according to the threshold set by the system.

2) Based on the keyword gold recall algorithm, the traditional knowledge retrieval mainly uses a bm25 algorithm, and the algorithm cannot assign the weight of the keyword. The realization steps of the whole gold investment recall algorithm consist of the following 6 steps.

Step 1, establishing a keyword to search two key variables, investment and channel. Here we simulate keyword semantic elements as enterprises (i.e. terms of interest to the problem) and knowledge content to be retrieved as channels.

And 2, setting two identities of an enterprise, namely an investor and a producer. Investors refer to the number of knowledge searches using key elements, and producers refer to knowledge searches that each time come back with valid knowledge content.

And 3, setting the inverse relation between channels and production, wherein each enterprise has a plurality of channels, the channels can be used for investment and production popularization, and the investment emphasis is smaller as the investment channels mastered by each enterprise are more, the investment funds are smaller. Conversely, the investment emphasis will be relatively large. The more channels each enterprise has, the more powerful the rendering capability will be.

And 4, circulating all keywords to perform information retrieval to help each enterprise to acquire the income of the previous investment from the channel, taking the income as the reserve funds of the current investment, and calculating the whole effective income including information of two aspects, and whether all enterprises acquire the income and the total income amount.

Step 5, selecting the enterprise with the largest investment intensity, immediately executing the investment activity (excluding the enterprises which have already done the investment, each person has only one investment opportunity), and if the proper company cannot be found, searching for the investment failure, and if the proper company cannot be found, recording the cost at the same time. The implementation of the above 6 steps is exemplified as follows: the user question is "how much is entity a and entity B's camping gap in a year? the two questions of" how much is entity a's camp gap in a certain year? and further decomposing into four semantic elements of entity A, entity B, a certain year and a camping gap. Here, "entity a" and "entity B" metaphors a channel, each time the search metaphors an investment. Assuming that the entity A is used for information retrieval, if the retrieved result comprises three elements of the entity A, the entity B and the year, the return rate of investment is considered to be high. The next search directly skips the three key elements, but directly starts searching from the 'trading gap', until the search is completed.

3) Recall based on fact maps the organized form of the fact maps is subject-relationship-object. The fact map obtained by decomposing the user problem is generalized by synonyms to obtain a new batch of facts. Recall the top N pieces of data of the full match, the matched subject, the matched object.

S4, rearrangement based on semantic element knowledge

Based on knowledge rearrangement of semantic elements, the comprehensive screen ordering, merging and filtering problems after different recall modes are mainly solved. Compared with the traditional ordering model scheme. The knowledge rearrangement of the invention comprehensively considers factors such as entity, key words, semantic features in the field and the like to carry out comprehensive sorting.

1) A layer of sequencing, namely comparing a keyword element set of the problem splitting with a keyword element set of a knowledge block to be searched, wherein a sequencing score searching formula is as follows

Rank score = coefficient-number of physical handoffs + number of keyword intersections/number of problem keywords-1

2) And (2) carrying out two-layer sequencing, namely continuously comparing the keyword set with the document directory semantic feature keywords where the knowledge blocks are located on the basis of the step (1). The score calculation formula is:

Two-layer ranking score = coefficient number of physical intersections + number of keyword intersections/number of problem keywords 0.1

3) And (3) three-layer sequencing, namely continuously comparing the key word set with the semantic feature key words of the knowledge base directory where the knowledge blocks are positioned on the basis of the step (2). The score calculation formula is:

Three-layer ranking score = coefficient x number of physical intersections + number of keyword intersections/number of problem keywords 0.0.1

The composite score is =one-layer rank score + two-layer rank score + three-layer rank score

S5, generating answer based on semantic matching

In the aspect of answer generation, the recall method for the sample questions is mainly optimized based on semantic elements. The most appropriate large model is recalled to prompt the sample of the project by means of semantic + keyword feature recall. 1) And setting a prompting word engineering sample rule, namely generating keywords by answers including markdown, a list, an attachment catalog and the like. 2) Recall using semantics matching top N instances with highest semantic relevance according to user questions. 3) And (4) performing secondary ranking by using the keyword features, namely calculating a comprehensive ranking score by using the ranking method belonging to the step S4. And obtaining a prompt word problem sample of the sequencing score.

The invention uses the ideas of the production line to improve the construction of a semantic element library, the understanding of the questions, the recall of knowledge, the sequencing evaluation and the answer generation by engineering the questions. After the problem of the user has been posed,

The semantic element library constructing part can realize a knowledge base in any field, can organize and represent the four layers of semantic elements of the field, classification, carrier and knowledge, the problem understanding part can decompose the problem of the user based on the semantic element stock to obtain a clear semantic element representation, the knowledge recall extracts the semantic elements obtained by the decomposition of the previous link, the knowledge can obtain the knowledge retrieval recall if the algorithm rule provided by the part ensures that the knowledge can obtain the knowledge retrieval recall, and the sorting evaluation refers to the step distribution comparison and sorting of the retrieved content and a four-level semantic system and finally returns the most suitable designated number record. And finally, providing matched user prompt words according to the questions of the user, and organizing and generating answers by using a large model, wherein the effects are shown in Table 1:

TABLE 1

The above detailed description is merely illustrative of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Various modifications, substitutions and improvements of the technical scheme of the present invention will be apparent to those skilled in the art from the description and drawings provided herein without departing from the spirit and scope of the invention. The scope of the invention is defined by the claims.

Claims

1. The knowledge question-answering accuracy rate improving method based on the semantic elements is characterized by comprising the following steps of:

Step 1, extracting information from text data, constructing semantic nodes comprising entities, attributes and relations, forming a semantic element library, organizing the semantic element library through four layers of fields, classifications, carriers and knowledge, and supporting multi-level knowledge representation;

Step 2, word segmentation processing is carried out on the problem text input by the user, a word sequence is obtained, syntactic dependency analysis is carried out by using a word segmentation tool in combination with a domain knowledge base, key semantic dependencies are extracted, and semantic element representations of the problem are formed;

the method comprises the steps of taking semantic elements as keywords, taking knowledge content in a knowledge base as channels, distributing initial funds for each keyword, and simulating keyword investment behaviors;

Defining investment priority and profit mechanism of the keywords according to the principle that the number of channels is inversely proportional to the return on investment;

circulating all keywords, searching one by one, and recording the benefits of each round of searching;

The keyword with the highest return rate is preferentially selected in each round of search to carry out the next search;

Stopping searching when the benefits of all the keywords reach the expected or no new benefits, and forming a candidate answer set;

step 4, matching the candidate answers with semantic elements of the user questions, scoring and sorting according to the keywords, selecting a plurality of most relevant answers through a hierarchical sorting mechanism, and simplifying;

And 5, constructing a prompt word of the large model according to the query intention of the user question and the ordered candidate answers, calling the large model to generate a final answer, and presenting the output answer according to a specified format.

2. The knowledge question-answering accuracy improving method based on semantic elements according to claim 1, wherein the step 1 is specifically as follows:

Step 1.4, extracting semantic nodes comprising entities, attributes and relations;

3. The knowledge question-answering accuracy improving method based on semantic elements according to claim 1, wherein the step 2 is specifically as follows:

step 2.3, marking the parts of speech of the segmented text, and identifying the grammar role of each word;

Step 2.5, extracting semantic elements of the problem according to the analysis result, wherein the semantic elements comprise entities, attributes and query intents;

4. The knowledge question-answering accuracy improving method based on semantic elements according to claim 1, wherein the step 3 is specifically as follows:

step 3.3, defining investment and income rules, setting keyword retrieval priority, and preferentially processing keywords with high channel income;

and 3.7, carrying out preliminary screening on the contents in the candidate answer set, and removing the knowledge blocks with low correlation or repetition.

5. The knowledge question-answering accuracy improving method based on semantic elements according to claim 1, wherein the step 4 is specifically as follows:

And 4.6, simplifying the ordered answers, reserving context information and removing redundant contents.

6. The knowledge question-answering accuracy improving method based on semantic elements according to claim 1, wherein the step 5 is specifically as follows:

step 5.2, formatting the prompt word;

step 5.4, carrying out grammar checking and consistency verification on the generated answers;

Step 5.5, processing the answer according to the format requirement of the user problem;

7. The knowledge question-answering accuracy improvement method based on semantic elements according to claim 1, further comprising user feedback, specifically comprising:

Step 6.1, before generating the answer, receiving the supplement answer of the user, and re-splitting the semantic elements and recalling the knowledge;

Step 6.2, after providing the answer, allowing the user to feed back the satisfaction and accuracy of the answer;