Intelligent prospecting question-answering system based on LLM and RAG
Technical Field
The invention relates to the technical field of prospecting and large models, in particular to an intelligent prospecting question-answering system based on LLM and RAG.
Background
The existing general large language models, such as ChatGPT, chatGLM, qwen, are rooted in a transducer architecture, and are integrated with deep learning neural networks of a self-attention mechanism. These models exhibit excellent language generation capabilities through the pre-training of massive knowledge data. However, while in the field of earth science there are currently geoGPT for geospatial data collection, processing and analysis, the antique model for meteorological research, and in the field of prospecting there is currently no more sophisticated model, due to limitations of training data, there are deviations or errors in the model in handling input tasks, maintaining output context consistency and keeping consistency with real world facts, the so-called "illusions".
Two common methods for constructing the large language model in the general vertical field are adopted, namely, on the basis of a pre-training model, data in the research field are adopted for secondary training, namely Fine-tuning (shown in figure 1), the method requires longer time and cost, the required hardware level is higher, the plasticity is poor, quick and happy updating can not be realized along with the updating of the data, and the other method is realized through the construction of a prompt word engineering of a search enhancement and general large language model through a plug-in professional field knowledge base (shown in figure 2), so that the construction method has low requirement on hardware, is accurate and practical, and can be updated and iterated rapidly along with the updating of the data. The retrieval enhancement refers to triggering the knowledge base query according to the content of the problem, and providing the query result to a general large model as knowledge enhancement, so that better answers are generated, and a plurality of retrieval enhancement technical schemes are common, such as recommending knowledge base data by constructing a special trigger mechanism, searching indexes by constructing a database, adopting a text similarity algorithm, reordering the query result by using machine learning, deep learning, an integrated learning algorithm and the like.
The most important construction technology of the knowledge base in the vertical field is data collection, data purification and data storage. The authenticity, accuracy and specificity of the data are particularly important. At present, the technology has no unified standard, particularly in the field of vertical disciplines, has a case of constructing by using a graph database neo4j and the like, and also has a case of constructing by using a relational database and a vector database, wherein an open source vector knowledge base milvus official provides a method for how to create and manage the vector database, but does not provide a method for purifying and constructing a data template from massive text data, and directly uses a knowledge base created by original data, and the problem of no knowledge emphasis or no context is caused, so that later inquiry is inaccurate, thereby influencing an answer.
Meanwhile, after the professional knowledge is stored, the most important is to establish accurate matching between the user's question and the database knowledge, and the question directly influences the accuracy of the answer, so that the method is the most important aspect for avoiding the illusion of a general large language model. The conventional method comprises the steps of recommending knowledge base data by constructing a special trigger mechanism, inquiring an index by constructing a database, adopting a text similarity algorithm, and reordering inquiry results by using machine learning, deep learning and integrated learning algorithms, wherein the schemes are adopted outside the data, and after the data is stored in the knowledge base, the knowledge text does not contain key information labels, so that inaccuracy of the inquiry results can be caused on the level of the data, and the relevance of problems and retrieval results is affected.
Disclosure of Invention
The invention aims to provide an intelligent ore finding and answering method based on LLM and RAG, solve the problem of illusion of the existing general large language model in the field of professional knowledge, provide an intelligent ore finding and answering system based on LLM and RAG, and provide a computer medium.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the first aspect of the invention provides an intelligent mining question-answering method based on LLM and RAG, which comprises the following steps:
extracting text content of documents in the field of geological deposits;
Performing sentence segmentation on the text content, and performing keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
Obtaining text embedded vectors containing keyword information labels and corresponding weights according to preset keyword weights;
Storing the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
according to the problem text input in the large language model, carrying out sentence segmentation and keyword recognition on the problem text, and vectorizing the recognized problem text;
Inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
And the large language model outputs a reply of the problem according to the problem text and the prompt word.
Further, the extracting text content of documents in the geological deposit field comprises:
Extracting long texts in the literature in the field of geological deposits;
The extracted long text is divided into paragraphs with specified size through RecursiveCharacterTextSplitter function of langchain library, then the non-printed characters, page numbers, headers and footers are removed by using regular expression, and then the redundant blank is replaced by a single blank, so that the text content of the literature in the geological deposit field is obtained.
Further, performing sentence segmentation on the text content, and performing keyword recognition on a result of the sentence segmentation to obtain text content containing keyword information labels, wherein the method comprises the following steps:
And performing sentence segmentation on the text content by using a pre-trained sentence segmentation model, and performing keyword recognition on a sentence segmentation result by using a pre-trained keyword recognition model, wherein the text content comprises keyword information marks.
Further, the pre-trained sentence word segmentation model includes:
Training the Bert model by using a Bert model as a base model and using a first preset label sample to obtain the pre-trained sentence word segmentation model, wherein the first preset label sample is created according to a description level of the geological mineral deposit field, the description level of the geological mineral deposit field comprises a mineral deposit type, a geological structure, a mineral rock composition, a geophysical prospecting abnormality, a mineralization type and a mineral formation time, and the first preset label sample comprises texts and corresponding word segmentation.
Further, the pre-trained keyword recognition model includes:
Training the Bert model by using a Bert model as a base model and using a second preset label sample to obtain the pre-trained keyword recognition model, wherein the second preset label sample is created according to a common keyword corpus in the geological mineral deposit field, the common keyword material in the geological mineral deposit field comprises 7 categories which are respectively a mineral name, a place name, a person name, a time name, a stratum name, a construction name and other nouns, and the second preset label sample comprises keywords and corresponding categories.
Further, the preset keyword weights include
And according to the category corresponding to the keyword, different weights are given to the keyword, wherein the category is the keyword corresponding to the mine name, the time name, the stratum name and the structural name, the weight of the keyword corresponding to the category is 0.2, the weight of the keyword corresponding to the person name is 0.1, and the weight of the keyword corresponding to the category is the place name and other nouns is 0.05.
Further, according to a preset keyword weight, obtaining a text embedded vector containing keyword information labels includes:
And according to the corresponding category of the keyword, different weights are given to the keyword, and then word embedding is carried out through a Bert model, so that a text embedding vector containing the keyword information label is obtained.
Further, according to the question text input in the large language model, performing sentence segmentation and keyword recognition on the question text, including:
And performing sentence segmentation on the problem text by using a pre-trained sentence segmentation model, and performing keyword recognition on the result of the sentence segmentation on the problem text by using a pre-trained keyword recognition model.
The second aspect of the invention provides an intelligent mining question-answering system based on LLM and RAG, which comprises the following steps:
the extraction module is used for extracting text content of documents in the field of geological deposits;
The first keyword recognition module is used for carrying out sentence segmentation on the text content and carrying out keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
the word embedding module is used for obtaining text embedding vectors containing keyword information labels and corresponding weights according to preset keyword weights;
The knowledge base module stores the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
The second keyword recognition module is used for vectorizing the recognized problem text after performing sentence segmentation and keyword recognition on the problem text according to the problem text input in the large language model;
the prompt word module is used for inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
and the output module is used for outputting the reply of the problem according to the problem text and the prompt word by using the large language model.
A third aspect of the present invention provides a computer medium, where a computer program is stored, where the computer program, when executed by a processor, implements the intelligent mining question-answering method based on LLM and RAG.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
The invention is triggered from the inside of the data, processes the context information and the keywords in the data, stores the context information and the keywords in the vector database, and stores the context information and the keywords to obtain the knowledge base in the geological deposit field, thereby realizing good matching between the user problem and the knowledge base, improving the retrieval accuracy and avoiding the illusion problem of the general large language model in the professional knowledge field.
Drawings
FIG. 1 is a diagram of a prior art fine-tuning vertical domain large language model architecture;
FIG. 2 is a diagram of another prior art fine-tuning vertical domain large language model architecture;
FIG. 3 is a schematic flow chart of an intelligent mining question-answering method based on LLM and RAG according to an embodiment of the present invention;
FIG. 4 is a schematic illustration of textual content of a document in the field of extracting geologic deposits, provided by an embodiment of the invention;
FIG. 5 is a model parameter diagram of a pre-trained sentence segmentation model provided by an embodiment of the present invention;
FIG. 6 is a model parameter diagram of a pre-trained keyword recognition model provided by an embodiment of the present invention;
FIG. 7 is a flow question and answer flow chart diagram of an intelligent mining question and answer method based on LLM and RAG according to an embodiment of the present invention;
FIG. 8 is a graph of a large language model answer evaluation parameter provided by an embodiment of the present invention;
fig. 9 is a schematic block diagram of an intelligent mining question-answering system based on LLM and RAG according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment of the invention provides an intelligent mining question-answering method based on LLM and RAG, which is shown in figure 3 and comprises the following steps:
extracting text content of documents in the field of geological deposits;
Performing sentence segmentation on the text content, and performing keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
Obtaining text embedded vectors containing keyword information labels and corresponding weights according to preset keyword weights;
Storing the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
according to the problem text input in the large language model, carrying out sentence segmentation and keyword recognition on the problem text, and vectorizing the recognized problem text;
Inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
And the large language model outputs a reply of the problem according to the problem text and the prompt word.
In a further embodiment, the extracting text content of documents in the field of geological deposits comprises:
Extracting long texts in the literature in the field of geological deposits;
The extracted long text is divided into paragraphs with specified size through RecursiveCharacterTextSplitter function of langchain library, then the non-printed characters, page numbers, page headers, page footers and other useless information are removed by using regular expressions, the redundant space is replaced by a single space, redundant space in the text is reduced, the text is more compact and easy to process, the text content of the literature in the geological deposit field is obtained through the processing, and as shown in fig. 4, the main content of the literature is obviously extracted in the extraction process, and data preparation is provided for the construction of the subsequent knowledge base.
In a further embodiment, performing sentence segmentation on the text content, and performing keyword recognition on a result of sentence segmentation to obtain keywords, including:
Because the information in the text sentence is often contained in the keywords, how to find the keywords is an important task, sentence segmentation is performed on the text content by using a pre-trained sentence segmentation model, and keyword recognition is performed on the results after sentence segmentation by using a pre-trained keyword recognition model, so that the keywords are obtained.
In a further embodiment, the pre-trained sentence segmentation model comprises:
Training the Bert model by using a Bert model as a base model and using a first preset label sample to obtain the pre-trained sentence word segmentation model, wherein the first preset label sample is created according to a description level of the geological mineral deposit field, the description level of the geological mineral deposit field comprises a mineral deposit type, a geological structure, a mineral rock composition, a geophysical prospecting abnormality, a mineralization type and a mineral formation time, and the first preset label sample comprises texts and corresponding word segmentation.
In this embodiment, a Bert model is used as a base model, a first preset label sample is used to fine tune the Bert model to obtain a sentence segmentation model, the information such as deposit type, geological structure, mineral rock combination, geophysical prospecting abnormality, mineralization type, mineral time and the like is considered fully, 900 label samples (see table 1) are created in total according to the description level commonly used in the field of the deposit science for fine tuning the sentence segmentation model, and because the bidirectional transducer architecture of Bert has high value in context understanding and segmentation, the accuracy of the sentence segmentation model obtained by fine tuning on the sentence segmentation is above 99%, and the model parameters are shown in fig. 5.
TABLE 1 sentence segmentation model training data display table
In a further embodiment, the pre-trained keyword recognition model comprises:
Training the Bert model by using a Bert model as a base model and using a second preset label sample to obtain the pre-trained keyword recognition model, wherein the second preset label sample is created according to a common keyword corpus in the geological mineral deposit field, the common keyword material in the geological mineral deposit field comprises 7 categories which are respectively a mineral name, a place name, a person name, a time name, a stratum name, a construction name and other nouns, and the second preset label sample comprises keywords and corresponding categories.
In this embodiment, 1400 keyword corpuses (see table 2) common to geological fields are used, and because the structure of the question-answer model is generally why a person/what place/what time/do (happen)/what happens, the keywords are classified into 7 categories in table 2, and finally, a keyword recognition model is obtained by fine tuning based on the Bert model, the recognition accuracy is above 95%, and other training parameters are shown in fig. 6.
TABLE 2 keyword recognition model training data display table
In a further embodiment, the preset keyword weights include
And according to the category corresponding to the keyword, different weights are given to the keyword, wherein the category is the keyword corresponding to the mine name, the time name, the stratum name and the structural name, the weight of the keyword corresponding to the category is 0.2, the weight of the keyword corresponding to the person name is 0.1, and the weight of the keyword corresponding to the category is the place name and other nouns is 0.05, which is shown in table 3.
TABLE 3 keyword category and weight information table thereof
In a further embodiment, obtaining the text embedded vector containing the keyword information label according to the preset keyword weight includes:
And according to the corresponding category of the keyword, different weights are given to the keyword, and then word embedding is carried out through a Bert model, so that a text embedding vector containing the keyword information label is obtained.
In this embodiment, through the processing of this embodiment, the word embedding vector of the keyword includes not only the context information but also the keyword information, and then the keyword information is stored in the vector database to be saved, so that the obtained knowledge base in the geological deposit field realizes good matching between the user problem and the knowledge base, improves the accuracy of retrieval, and avoids the illusion problem of the general large language model in the professional knowledge field.
In a further embodiment, performing sentence segmentation and keyword recognition on a question text input in a large language model according to the question text, including:
And performing sentence segmentation on the problem text by using a pre-trained sentence segmentation model, and performing keyword recognition on the result of the sentence segmentation on the problem text by using a pre-trained keyword recognition model.
In this embodiment, when a user asks a question, word segmentation and keyword recognition are also performed on the question, and then a knowledge base is queried according to a vector query index, and the specific flow is shown in fig. 7.
In a further embodiment, the method of the embodiment can be developed into a man-machine interactive intelligent prospecting robot based on a micro-service architecture.
In a specific embodiment, the constructed large model in the vertical field of prospecting is subjected to question answer verification, and is compared with the current large language models with real-time networking functions such as the most representative ChatGPT, kimi, the religion, and the like, 300 serious mining area prospecting field questions are respectively used, 300 reference answers of questions are written according to expert guidance opinions, then the models and the intelligent prospecting robots are respectively subjected to question asking, and answers of the models and the intelligent prospecting robots are evaluated by using a Bert-score index, and the result is shown in fig. 8, and the comparison finds that the questions of the intelligent prospecting answer method of the embodiment are obviously higher than other models in Precision, F1 and two indexes, and the models of the embodiment are in a dominant state on Recall, so that the 'phantom' questions of professional questions answer are effectively avoided from the answers of the questions, and the answers of the general large language models have both real data and 'illusion' data.
Example 2
The embodiment of the invention provides an intelligent mining question-answering system based on LLM and RAG, which realizes the question-answering method of the embodiment 1, as shown in figure 9, and comprises the following steps:
the extraction module is used for extracting text content of documents in the field of geological deposits;
The first keyword recognition module is used for carrying out sentence segmentation on the text content and carrying out keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
the word embedding module is used for obtaining text embedding vectors containing keyword information labels and corresponding weights according to preset keyword weights;
The knowledge base module stores the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
The second keyword recognition module is used for vectorizing the recognized problem text after performing sentence segmentation and keyword recognition on the problem text according to the problem text input in the large language model;
the prompt word module is used for inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
and the output module is used for outputting the reply of the problem according to the problem text and the prompt word by using the large language model.
Example 3
The present embodiment provides a computer medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the intelligent mining question-answering method based on LLM and RAG described in embodiment 1.
The same or similar reference numerals correspond to the same or similar components;
the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.