[go: up one dir, main page]

CN119938814A - An intelligent mineral prospecting question-answering system based on LLM and RAG - Google Patents

An intelligent mineral prospecting question-answering system based on LLM and RAG Download PDF

Info

Publication number
CN119938814A
CN119938814A CN202411689535.4A CN202411689535A CN119938814A CN 119938814 A CN119938814 A CN 119938814A CN 202411689535 A CN202411689535 A CN 202411689535A CN 119938814 A CN119938814 A CN 119938814A
Authority
CN
China
Prior art keywords
text
question
keyword
model
geological
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411689535.4A
Other languages
Chinese (zh)
Inventor
朱彪彪
周永章
于新慧
王郑哲
何陆灏
马建华
牛露佳
刘蕾
张玙情
帕拉特·肯节伯
张灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202411689535.4A priority Critical patent/CN119938814A/en
Publication of CN119938814A publication Critical patent/CN119938814A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开一种基于LLM和RAG的智能找矿问答系统,系统完成以下步骤:提取地质矿床领域的文献的文本内容;对所述文本内容进行语句分词,并对语句分词后的结果进行关键词识别,得到包含关键词信息标注的文本内容;根据预设的关键词权重,得到包含关键词信息标注和对应权重的文本嵌入向量;将所述文本嵌入向量存入向量数据库保存,得到地质矿床领域知识库;根据在大语言模型输入的问题文本,对所述问题文本进行语句分词和关键词识别后,将识别后的问题文本向量化;根据向量度量索引查询所述地质矿床领域知识库中和问题文本相关性最高的知识文本,得到提示词;所述大语言模型根据所述问题文本和提示词,输出问题的回复。本发明避免了通用大语言模型在专业知识领域的幻觉问题。

The present invention discloses an intelligent mineral prospecting question-answering system based on LLM and RAG, and the system completes the following steps: extracting text content of literature in the field of geological mineral deposits; performing sentence segmentation on the text content, and performing keyword recognition on the result after sentence segmentation to obtain text content containing keyword information annotations; obtaining a text embedding vector containing keyword information annotations and corresponding weights according to preset keyword weights; storing the text embedding vector in a vector database to obtain a knowledge base in the field of geological mineral deposits; performing sentence segmentation and keyword recognition on the question text according to a question text input into a large language model, and then vectorizing the recognized question text; querying the knowledge text with the highest correlation with the question text in the knowledge base in the field of geological mineral deposits according to a vector metric index to obtain a prompt word; the large language model outputs a reply to the question according to the question text and the prompt word. The present invention avoids the hallucination problem of the general large language model in the field of professional knowledge.

Description

Intelligent prospecting question-answering system based on LLM and RAG
Technical Field
The invention relates to the technical field of prospecting and large models, in particular to an intelligent prospecting question-answering system based on LLM and RAG.
Background
The existing general large language models, such as ChatGPT, chatGLM, qwen, are rooted in a transducer architecture, and are integrated with deep learning neural networks of a self-attention mechanism. These models exhibit excellent language generation capabilities through the pre-training of massive knowledge data. However, while in the field of earth science there are currently geoGPT for geospatial data collection, processing and analysis, the antique model for meteorological research, and in the field of prospecting there is currently no more sophisticated model, due to limitations of training data, there are deviations or errors in the model in handling input tasks, maintaining output context consistency and keeping consistency with real world facts, the so-called "illusions".
Two common methods for constructing the large language model in the general vertical field are adopted, namely, on the basis of a pre-training model, data in the research field are adopted for secondary training, namely Fine-tuning (shown in figure 1), the method requires longer time and cost, the required hardware level is higher, the plasticity is poor, quick and happy updating can not be realized along with the updating of the data, and the other method is realized through the construction of a prompt word engineering of a search enhancement and general large language model through a plug-in professional field knowledge base (shown in figure 2), so that the construction method has low requirement on hardware, is accurate and practical, and can be updated and iterated rapidly along with the updating of the data. The retrieval enhancement refers to triggering the knowledge base query according to the content of the problem, and providing the query result to a general large model as knowledge enhancement, so that better answers are generated, and a plurality of retrieval enhancement technical schemes are common, such as recommending knowledge base data by constructing a special trigger mechanism, searching indexes by constructing a database, adopting a text similarity algorithm, reordering the query result by using machine learning, deep learning, an integrated learning algorithm and the like.
The most important construction technology of the knowledge base in the vertical field is data collection, data purification and data storage. The authenticity, accuracy and specificity of the data are particularly important. At present, the technology has no unified standard, particularly in the field of vertical disciplines, has a case of constructing by using a graph database neo4j and the like, and also has a case of constructing by using a relational database and a vector database, wherein an open source vector knowledge base milvus official provides a method for how to create and manage the vector database, but does not provide a method for purifying and constructing a data template from massive text data, and directly uses a knowledge base created by original data, and the problem of no knowledge emphasis or no context is caused, so that later inquiry is inaccurate, thereby influencing an answer.
Meanwhile, after the professional knowledge is stored, the most important is to establish accurate matching between the user's question and the database knowledge, and the question directly influences the accuracy of the answer, so that the method is the most important aspect for avoiding the illusion of a general large language model. The conventional method comprises the steps of recommending knowledge base data by constructing a special trigger mechanism, inquiring an index by constructing a database, adopting a text similarity algorithm, and reordering inquiry results by using machine learning, deep learning and integrated learning algorithms, wherein the schemes are adopted outside the data, and after the data is stored in the knowledge base, the knowledge text does not contain key information labels, so that inaccuracy of the inquiry results can be caused on the level of the data, and the relevance of problems and retrieval results is affected.
Disclosure of Invention
The invention aims to provide an intelligent ore finding and answering method based on LLM and RAG, solve the problem of illusion of the existing general large language model in the field of professional knowledge, provide an intelligent ore finding and answering system based on LLM and RAG, and provide a computer medium.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the first aspect of the invention provides an intelligent mining question-answering method based on LLM and RAG, which comprises the following steps:
extracting text content of documents in the field of geological deposits;
Performing sentence segmentation on the text content, and performing keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
Obtaining text embedded vectors containing keyword information labels and corresponding weights according to preset keyword weights;
Storing the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
according to the problem text input in the large language model, carrying out sentence segmentation and keyword recognition on the problem text, and vectorizing the recognized problem text;
Inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
And the large language model outputs a reply of the problem according to the problem text and the prompt word.
Further, the extracting text content of documents in the geological deposit field comprises:
Extracting long texts in the literature in the field of geological deposits;
The extracted long text is divided into paragraphs with specified size through RecursiveCharacterTextSplitter function of langchain library, then the non-printed characters, page numbers, headers and footers are removed by using regular expression, and then the redundant blank is replaced by a single blank, so that the text content of the literature in the geological deposit field is obtained.
Further, performing sentence segmentation on the text content, and performing keyword recognition on a result of the sentence segmentation to obtain text content containing keyword information labels, wherein the method comprises the following steps:
And performing sentence segmentation on the text content by using a pre-trained sentence segmentation model, and performing keyword recognition on a sentence segmentation result by using a pre-trained keyword recognition model, wherein the text content comprises keyword information marks.
Further, the pre-trained sentence word segmentation model includes:
Training the Bert model by using a Bert model as a base model and using a first preset label sample to obtain the pre-trained sentence word segmentation model, wherein the first preset label sample is created according to a description level of the geological mineral deposit field, the description level of the geological mineral deposit field comprises a mineral deposit type, a geological structure, a mineral rock composition, a geophysical prospecting abnormality, a mineralization type and a mineral formation time, and the first preset label sample comprises texts and corresponding word segmentation.
Further, the pre-trained keyword recognition model includes:
Training the Bert model by using a Bert model as a base model and using a second preset label sample to obtain the pre-trained keyword recognition model, wherein the second preset label sample is created according to a common keyword corpus in the geological mineral deposit field, the common keyword material in the geological mineral deposit field comprises 7 categories which are respectively a mineral name, a place name, a person name, a time name, a stratum name, a construction name and other nouns, and the second preset label sample comprises keywords and corresponding categories.
Further, the preset keyword weights include
And according to the category corresponding to the keyword, different weights are given to the keyword, wherein the category is the keyword corresponding to the mine name, the time name, the stratum name and the structural name, the weight of the keyword corresponding to the category is 0.2, the weight of the keyword corresponding to the person name is 0.1, and the weight of the keyword corresponding to the category is the place name and other nouns is 0.05.
Further, according to a preset keyword weight, obtaining a text embedded vector containing keyword information labels includes:
And according to the corresponding category of the keyword, different weights are given to the keyword, and then word embedding is carried out through a Bert model, so that a text embedding vector containing the keyword information label is obtained.
Further, according to the question text input in the large language model, performing sentence segmentation and keyword recognition on the question text, including:
And performing sentence segmentation on the problem text by using a pre-trained sentence segmentation model, and performing keyword recognition on the result of the sentence segmentation on the problem text by using a pre-trained keyword recognition model.
The second aspect of the invention provides an intelligent mining question-answering system based on LLM and RAG, which comprises the following steps:
the extraction module is used for extracting text content of documents in the field of geological deposits;
The first keyword recognition module is used for carrying out sentence segmentation on the text content and carrying out keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
the word embedding module is used for obtaining text embedding vectors containing keyword information labels and corresponding weights according to preset keyword weights;
The knowledge base module stores the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
The second keyword recognition module is used for vectorizing the recognized problem text after performing sentence segmentation and keyword recognition on the problem text according to the problem text input in the large language model;
the prompt word module is used for inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
and the output module is used for outputting the reply of the problem according to the problem text and the prompt word by using the large language model.
A third aspect of the present invention provides a computer medium, where a computer program is stored, where the computer program, when executed by a processor, implements the intelligent mining question-answering method based on LLM and RAG.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
The invention is triggered from the inside of the data, processes the context information and the keywords in the data, stores the context information and the keywords in the vector database, and stores the context information and the keywords to obtain the knowledge base in the geological deposit field, thereby realizing good matching between the user problem and the knowledge base, improving the retrieval accuracy and avoiding the illusion problem of the general large language model in the professional knowledge field.
Drawings
FIG. 1 is a diagram of a prior art fine-tuning vertical domain large language model architecture;
FIG. 2 is a diagram of another prior art fine-tuning vertical domain large language model architecture;
FIG. 3 is a schematic flow chart of an intelligent mining question-answering method based on LLM and RAG according to an embodiment of the present invention;
FIG. 4 is a schematic illustration of textual content of a document in the field of extracting geologic deposits, provided by an embodiment of the invention;
FIG. 5 is a model parameter diagram of a pre-trained sentence segmentation model provided by an embodiment of the present invention;
FIG. 6 is a model parameter diagram of a pre-trained keyword recognition model provided by an embodiment of the present invention;
FIG. 7 is a flow question and answer flow chart diagram of an intelligent mining question and answer method based on LLM and RAG according to an embodiment of the present invention;
FIG. 8 is a graph of a large language model answer evaluation parameter provided by an embodiment of the present invention;
fig. 9 is a schematic block diagram of an intelligent mining question-answering system based on LLM and RAG according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment of the invention provides an intelligent mining question-answering method based on LLM and RAG, which is shown in figure 3 and comprises the following steps:
extracting text content of documents in the field of geological deposits;
Performing sentence segmentation on the text content, and performing keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
Obtaining text embedded vectors containing keyword information labels and corresponding weights according to preset keyword weights;
Storing the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
according to the problem text input in the large language model, carrying out sentence segmentation and keyword recognition on the problem text, and vectorizing the recognized problem text;
Inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
And the large language model outputs a reply of the problem according to the problem text and the prompt word.
In a further embodiment, the extracting text content of documents in the field of geological deposits comprises:
Extracting long texts in the literature in the field of geological deposits;
The extracted long text is divided into paragraphs with specified size through RecursiveCharacterTextSplitter function of langchain library, then the non-printed characters, page numbers, page headers, page footers and other useless information are removed by using regular expressions, the redundant space is replaced by a single space, redundant space in the text is reduced, the text is more compact and easy to process, the text content of the literature in the geological deposit field is obtained through the processing, and as shown in fig. 4, the main content of the literature is obviously extracted in the extraction process, and data preparation is provided for the construction of the subsequent knowledge base.
In a further embodiment, performing sentence segmentation on the text content, and performing keyword recognition on a result of sentence segmentation to obtain keywords, including:
Because the information in the text sentence is often contained in the keywords, how to find the keywords is an important task, sentence segmentation is performed on the text content by using a pre-trained sentence segmentation model, and keyword recognition is performed on the results after sentence segmentation by using a pre-trained keyword recognition model, so that the keywords are obtained.
In a further embodiment, the pre-trained sentence segmentation model comprises:
Training the Bert model by using a Bert model as a base model and using a first preset label sample to obtain the pre-trained sentence word segmentation model, wherein the first preset label sample is created according to a description level of the geological mineral deposit field, the description level of the geological mineral deposit field comprises a mineral deposit type, a geological structure, a mineral rock composition, a geophysical prospecting abnormality, a mineralization type and a mineral formation time, and the first preset label sample comprises texts and corresponding word segmentation.
In this embodiment, a Bert model is used as a base model, a first preset label sample is used to fine tune the Bert model to obtain a sentence segmentation model, the information such as deposit type, geological structure, mineral rock combination, geophysical prospecting abnormality, mineralization type, mineral time and the like is considered fully, 900 label samples (see table 1) are created in total according to the description level commonly used in the field of the deposit science for fine tuning the sentence segmentation model, and because the bidirectional transducer architecture of Bert has high value in context understanding and segmentation, the accuracy of the sentence segmentation model obtained by fine tuning on the sentence segmentation is above 99%, and the model parameters are shown in fig. 5.
TABLE 1 sentence segmentation model training data display table
In a further embodiment, the pre-trained keyword recognition model comprises:
Training the Bert model by using a Bert model as a base model and using a second preset label sample to obtain the pre-trained keyword recognition model, wherein the second preset label sample is created according to a common keyword corpus in the geological mineral deposit field, the common keyword material in the geological mineral deposit field comprises 7 categories which are respectively a mineral name, a place name, a person name, a time name, a stratum name, a construction name and other nouns, and the second preset label sample comprises keywords and corresponding categories.
In this embodiment, 1400 keyword corpuses (see table 2) common to geological fields are used, and because the structure of the question-answer model is generally why a person/what place/what time/do (happen)/what happens, the keywords are classified into 7 categories in table 2, and finally, a keyword recognition model is obtained by fine tuning based on the Bert model, the recognition accuracy is above 95%, and other training parameters are shown in fig. 6.
TABLE 2 keyword recognition model training data display table
In a further embodiment, the preset keyword weights include
And according to the category corresponding to the keyword, different weights are given to the keyword, wherein the category is the keyword corresponding to the mine name, the time name, the stratum name and the structural name, the weight of the keyword corresponding to the category is 0.2, the weight of the keyword corresponding to the person name is 0.1, and the weight of the keyword corresponding to the category is the place name and other nouns is 0.05, which is shown in table 3.
TABLE 3 keyword category and weight information table thereof
In a further embodiment, obtaining the text embedded vector containing the keyword information label according to the preset keyword weight includes:
And according to the corresponding category of the keyword, different weights are given to the keyword, and then word embedding is carried out through a Bert model, so that a text embedding vector containing the keyword information label is obtained.
In this embodiment, through the processing of this embodiment, the word embedding vector of the keyword includes not only the context information but also the keyword information, and then the keyword information is stored in the vector database to be saved, so that the obtained knowledge base in the geological deposit field realizes good matching between the user problem and the knowledge base, improves the accuracy of retrieval, and avoids the illusion problem of the general large language model in the professional knowledge field.
In a further embodiment, performing sentence segmentation and keyword recognition on a question text input in a large language model according to the question text, including:
And performing sentence segmentation on the problem text by using a pre-trained sentence segmentation model, and performing keyword recognition on the result of the sentence segmentation on the problem text by using a pre-trained keyword recognition model.
In this embodiment, when a user asks a question, word segmentation and keyword recognition are also performed on the question, and then a knowledge base is queried according to a vector query index, and the specific flow is shown in fig. 7.
In a further embodiment, the method of the embodiment can be developed into a man-machine interactive intelligent prospecting robot based on a micro-service architecture.
In a specific embodiment, the constructed large model in the vertical field of prospecting is subjected to question answer verification, and is compared with the current large language models with real-time networking functions such as the most representative ChatGPT, kimi, the religion, and the like, 300 serious mining area prospecting field questions are respectively used, 300 reference answers of questions are written according to expert guidance opinions, then the models and the intelligent prospecting robots are respectively subjected to question asking, and answers of the models and the intelligent prospecting robots are evaluated by using a Bert-score index, and the result is shown in fig. 8, and the comparison finds that the questions of the intelligent prospecting answer method of the embodiment are obviously higher than other models in Precision, F1 and two indexes, and the models of the embodiment are in a dominant state on Recall, so that the 'phantom' questions of professional questions answer are effectively avoided from the answers of the questions, and the answers of the general large language models have both real data and 'illusion' data.
Example 2
The embodiment of the invention provides an intelligent mining question-answering system based on LLM and RAG, which realizes the question-answering method of the embodiment 1, as shown in figure 9, and comprises the following steps:
the extraction module is used for extracting text content of documents in the field of geological deposits;
The first keyword recognition module is used for carrying out sentence segmentation on the text content and carrying out keyword recognition on a sentence segmentation result to obtain text content containing keyword information labels;
the word embedding module is used for obtaining text embedding vectors containing keyword information labels and corresponding weights according to preset keyword weights;
The knowledge base module stores the text embedded vector into a vector database for storage to obtain a knowledge base in the geological deposit field;
The second keyword recognition module is used for vectorizing the recognized problem text after performing sentence segmentation and keyword recognition on the problem text according to the problem text input in the large language model;
the prompt word module is used for inquiring a knowledge text with highest correlation with a problem text in a knowledge base in the geological deposit field according to vector measurement indexes to obtain a prompt word;
and the output module is used for outputting the reply of the problem according to the problem text and the prompt word by using the large language model.
Example 3
The present embodiment provides a computer medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the intelligent mining question-answering method based on LLM and RAG described in embodiment 1.
The same or similar reference numerals correspond to the same or similar components;
the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1.一种基于LLM和RAG的智能找矿问答方法,其特征在于,包括以下步骤:1. An intelligent prospecting question-answering method based on LLM and RAG, characterized by comprising the following steps: 提取地质矿床领域的文献的文本内容;Extracting textual content from literature in the field of geological deposits; 对所述文本内容进行语句分词,并对语句分词后的结果进行关键词识别,得到包含关键词信息标注的文本内容;Performing sentence segmentation on the text content, and performing keyword recognition on the result after the sentence segmentation to obtain text content marked with keyword information; 根据预设的关键词权重,得到包含关键词信息标注和对应权重的文本嵌入向量;According to the preset keyword weights, a text embedding vector containing keyword information annotations and corresponding weights is obtained; 将所述文本嵌入向量存入向量数据库保存,得到地质矿床领域知识库;Storing the text embedding vector in a vector database to obtain a geological mineral deposit domain knowledge base; 根据在大语言模型输入的问题文本,对所述问题文本进行语句分词和关键词识别后,将识别后的问题文本向量化;According to the question text input into the large language model, after performing sentence segmentation and keyword recognition on the question text, the recognized question text is vectorized; 根据向量度量索引查询所述地质矿床领域知识库中和问题文本相关性最高的知识文本,得到提示词;According to the vector metric index, the knowledge text with the highest correlation with the question text in the geological deposit field knowledge base is searched to obtain the prompt word; 所述大语言模型根据所述问题文本和提示词,输出问题的回复。The large language model outputs a response to the question based on the question text and the prompt word. 2.根据权利要求1所述的基于LLM和RAG的智能找矿问答方法,其特征在于,所述提取地质矿床领域的文献的文本内容,包括:2. The intelligent prospecting question-answering method based on LLM and RAG according to claim 1, characterized in that the text content of the literature in the field of geological deposits is extracted, including: 提取地质矿床领域的文献中的长文本;Extraction of long texts from literature in the field of geological deposits; 通过langchain库的RecursiveCharacterTextSplitter功能,将提取的长文本分割成指定大小的段落,然后利用正则表达去除包括非打印字符、页码、页眉和页脚的无用信息,再将多余空白替换为单个空格,得到地质矿床领域的文献的文本内容。The extracted long text is split into paragraphs of specified size through the RecursiveCharacterTextSplitter function of the langchain library. Then, regular expressions are used to remove useless information including non-printing characters, page numbers, headers and footers. The redundant blanks are replaced with single spaces to obtain the text content of the literature in the field of geological deposits. 3.根据权利要求1所述的基于LLM和RAG的智能找矿问答方法,其特征在于,对所述文本内容进行语句分词,并对语句分词后的结果进行关键词识别,得到包含关键词信息标注的文本内容,包括:3. The intelligent prospecting question-answering method based on LLM and RAG according to claim 1 is characterized in that sentence segmentation is performed on the text content, and keyword recognition is performed on the result after sentence segmentation to obtain text content marked with keyword information, including: 利用预训练好的语句分词模型对所述文本内容进行语句分词,并利用预训练好的关键词识别模型对语句分词后的结果进行关键词识别,包含关键词信息标注的文本内容。The text content is segmented using a pre-trained sentence segmentation model, and the results of sentence segmentation are identified using a pre-trained keyword recognition model, including the text content annotated with keyword information. 4.根据权利要求3所述的基于LLM和RAG的智能找矿问答方法,其特征在于,所述预训练好的语句分词模型,包括:4. The intelligent prospecting question-answering method based on LLM and RAG according to claim 3, characterized in that the pre-trained sentence segmentation model comprises: 使用Bert模型作为基模型,使用第一预设的标签样本训练所述Bert模型,得到所述预训练好的语句分词模型,其中,所述第一预设的标签样本根据地质矿床领域的描述层面进行创建,地质矿床领域的描述层面包括矿床类型,地质构造,矿物岩石组合,物探化探异常,矿化类型和成矿时间,所述第一预设的标签样本包括文本及对应的分词。A Bert model is used as a base model, and a first preset label sample is used to train the Bert model to obtain the pre-trained sentence segmentation model, wherein the first preset label sample is created according to the description level of the geological mineral deposit field, and the description level of the geological mineral deposit field includes the deposit type, geological structure, mineral rock combination, geophysical and geochemical anomalies, mineralization type and mineralization time, and the first preset label sample includes text and corresponding segmentation words. 5.根据权利要求4所述的基于LLM和RAG的智能找矿问答方法,其特征在于,所述预训练好的关键词识别模型,包括:5. The intelligent prospecting question-answering method based on LLM and RAG according to claim 4, characterized in that the pre-trained keyword recognition model comprises: 使用Bert模型作为基模型,使用第二预设的标签样本训练所述Bert模型,得到所述预训练好的关键词识别模型,其中,所述第二预设的标签样本根据地质矿床领域的常见的关键词语料进行创建,地质矿床领域的常见的关键词语料包括7个类别,分别为矿名、地名、人名、时间名、地层名、构造名和其它名词,所述第二预设的标签样本包括关键词及对应的类别。The Bert model is used as a base model, and the Bert model is trained using a second preset label sample to obtain the pre-trained keyword recognition model, wherein the second preset label sample is created based on common keyword corpus in the field of geological mineral deposits. The common keyword corpus in the field of geological mineral deposits includes 7 categories, namely mine names, place names, personal names, time names, stratum names, structure names and other nouns. The second preset label sample includes keywords and corresponding categories. 6.根据权利要求5所述的基于LLM和RAG的智能找矿问答方法,其特征在于,所述预设的关键词权重,包括6. The intelligent prospecting question-answering method based on LLM and RAG according to claim 5, characterized in that the preset keyword weights include 根据关键词对应的类别,对所述关键词赋予不同的权重,其中,类别为矿名、时间名、地层名和构造名对应的关键词的权重为0.2,类别为人名对应的关键词的权重为0.1,类别为地名和其它名词对应的关键词的权重为0.05。Different weights are assigned to the keywords according to their corresponding categories. The weight of keywords corresponding to the categories of mine names, time names, stratum names and structure names is 0.2, the weight of keywords corresponding to the categories of human names is 0.1, and the weight of keywords corresponding to the categories of place names and other nouns is 0.05. 7.根据权利要求6所述的基于LLM和RAG的智能找矿问答方法,其特征在于,根据预设的关键词权重,得到包含关键词信息标注的文本嵌入向量,包括:7. The intelligent prospecting question-answering method based on LLM and RAG according to claim 6 is characterized in that, according to a preset keyword weight, a text embedding vector containing keyword information annotation is obtained, including: 根据关键词对应类别,对所述关键词赋予不同的权重,再经过Bert模型进行词嵌入后,得到包含关键词信息标注的文本嵌入向量。Different weights are assigned to the keywords according to their corresponding categories, and then the keywords are embedded using the Bert model to obtain a text embedding vector containing keyword information annotations. 8.根据权利要求3所述的基于LLM和RAG的智能找矿问答方法,其特征在于,根据在大语言模型输入的问题文本,对所述问题文本进行语句分词和关键词识别,包括:8. The intelligent prospecting question-answering method based on LLM and RAG according to claim 3 is characterized in that, according to the question text input into the large language model, sentence segmentation and keyword recognition are performed on the question text, including: 利用预训练好的语句分词模型对所述问题文本进行语句分词,并利用预训练好的关键词识别模型对问题文本语句分词后的结果进行关键词识别。The question text is segmented using a pre-trained sentence segmentation model, and the question text is segmented using a pre-trained keyword recognition model. 9.一种基于LLM和RAG的智能找矿问答系统,其特征在于,包括以下步骤:9. An intelligent mineral prospecting question-answering system based on LLM and RAG, characterized by comprising the following steps: 提取模块,所述提取模块提取地质矿床领域的文献的文本内容;An extraction module, which extracts textual content of documents in the field of geological deposits; 第一关键词识别模块,所述第一关键词识别模块对所述文本内容进行语句分词,并对语句分词后的结果进行关键词识别,得到包含关键词信息标注的文本内容;A first keyword recognition module, which performs sentence segmentation on the text content and performs keyword recognition on the result after the sentence segmentation to obtain text content marked with keyword information; 词嵌入模块,所述词嵌入模块根据预设的关键词权重,得到包含关键词信息标注和对应权重的文本嵌入向量;A word embedding module, wherein the word embedding module obtains a text embedding vector including keyword information annotations and corresponding weights according to preset keyword weights; 知识库模块,所述知识库模块将所述文本嵌入向量存入向量数据库保存,得到地质矿床领域知识库;A knowledge base module, wherein the knowledge base module stores the text embedding vector into a vector database to obtain a knowledge base in the field of geological deposits; 第二关键词识别模块,所述第二关键词识别模块根据在大语言模型输入的问题文本,对所述问题文本进行语句分词和关键词识别后,将识别后的问题文本向量化;A second keyword recognition module, which performs sentence segmentation and keyword recognition on the question text according to the question text input into the large language model, and then vectorizes the recognized question text; 提示词模块,所述提示词模块根据向量度量索引查询所述地质矿床领域知识库中和问题文本相关性最高的知识文本,得到提示词;A prompt word module, wherein the prompt word module searches the knowledge text with the highest correlation with the question text in the geological deposit field knowledge base according to the vector metric index to obtain a prompt word; 输出模块,所述输出模块利用大语言模型根据所述问题文本和提示词,输出问题的回复。An output module uses a large language model to output a response to the question based on the question text and prompt words. 10.一种计算机介质,其特征在于,所述计算机介质上存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1至8任一项所述的基于LLM和RAG的智能找矿问答方法。10. A computer medium, characterized in that a computer program is stored on the computer medium, and when the computer program is executed by a processor, the intelligent mineral prospecting question and answer method based on LLM and RAG as described in any one of claims 1 to 8 is implemented.
CN202411689535.4A 2024-11-25 2024-11-25 An intelligent mineral prospecting question-answering system based on LLM and RAG Pending CN119938814A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411689535.4A CN119938814A (en) 2024-11-25 2024-11-25 An intelligent mineral prospecting question-answering system based on LLM and RAG

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411689535.4A CN119938814A (en) 2024-11-25 2024-11-25 An intelligent mineral prospecting question-answering system based on LLM and RAG

Publications (1)

Publication Number Publication Date
CN119938814A true CN119938814A (en) 2025-05-06

Family

ID=95545789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411689535.4A Pending CN119938814A (en) 2024-11-25 2024-11-25 An intelligent mineral prospecting question-answering system based on LLM and RAG

Country Status (1)

Country Link
CN (1) CN119938814A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120975247A (en) * 2025-10-20 2025-11-18 苏州元脑智能科技有限公司 Synthetic data set construction method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408797A (en) * 2017-08-18 2019-03-01 普天信息技术有限公司 A kind of text sentence vector expression method and system
CN117951274A (en) * 2024-01-29 2024-04-30 上海岩芯数智人工智能科技有限公司 RAG knowledge question-answering method and device based on fusion vector and keyword retrieval
CN118193708A (en) * 2024-04-08 2024-06-14 中国地质大学(北京) A mineral knowledge question-answering method and system based on a large language model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408797A (en) * 2017-08-18 2019-03-01 普天信息技术有限公司 A kind of text sentence vector expression method and system
CN117951274A (en) * 2024-01-29 2024-04-30 上海岩芯数智人工智能科技有限公司 RAG knowledge question-answering method and device based on fusion vector and keyword retrieval
CN118193708A (en) * 2024-04-08 2024-06-14 中国地质大学(北京) A mineral knowledge question-answering method and system based on a large language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HE LUHAO等: "Application of Target Detection Based on Deep Learning in Intelligent Mineral Identification", MINERALS, 27 August 2024 (2024-08-27), pages 1 - 20 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120975247A (en) * 2025-10-20 2025-11-18 苏州元脑智能科技有限公司 Synthetic data set construction method and electronic equipment

Similar Documents

Publication Publication Date Title
CN112100344B (en) Knowledge graph-based financial domain knowledge question-answering method
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN110674252A (en) High-precision semantic search system for judicial domain
CN119988588A (en) A large model-based multimodal document retrieval enhancement generation method
CN117453851B (en) Text index enhanced question-answering method and system based on knowledge graph
US20190065576A1 (en) Single-entity-single-relation question answering systems, and methods
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN119003639B (en) Material extraction and generation method based on large model and multi-storage technology
CN112328800A (en) System and method for automatically generating programming specification question answers
CN114118082B (en) A resume retrieval method and device
CN118245564B (en) Method and device for constructing feature comparison library supporting semantic review and repayment
CN118885565A (en) A BERT-enhanced ES retrieval knowledge base method
CN117349420A (en) Reply methods and devices based on local knowledge base and large language model
CN108875065B (en) A content-based recommendation method for Indonesian news pages
CN116340530A (en) Intelligent Design Method Based on Mechanical Knowledge Graph
CN120373407A (en) Intelligent question-answering system training method based on machine learning
CN119938814A (en) An intelligent mineral prospecting question-answering system based on LLM and RAG
CN115470319A (en) Structured document demand rapid identification and entry organization management method
CN113190692B (en) Self-adaptive retrieval method, system and device for knowledge graph
CN111241283A (en) Rapid characterization method for portrait of scientific research student
CN115238034A (en) Data search method, apparatus, computer equipment and storage medium
CN112989811B (en) A BiLSTM-CRF-based historical book reading assistance system and its control method
CN119760099A (en) Medical knowledge question-answering method based on medical text
CN102508920B (en) Information retrieval method based on Boosting sorting algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination