[go: up one dir, main page]

CN118796872A - Query statement generation method, device, equipment, storage medium and program product - Google Patents

Query statement generation method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN118796872A
CN118796872A CN202410372853.1A CN202410372853A CN118796872A CN 118796872 A CN118796872 A CN 118796872A CN 202410372853 A CN202410372853 A CN 202410372853A CN 118796872 A CN118796872 A CN 118796872A
Authority
CN
China
Prior art keywords
llm
structured query
question text
query language
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410372853.1A
Other languages
Chinese (zh)
Inventor
田江涛
温立志
刘永胜
张茜
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Hebei Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Hebei Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Hebei Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202410372853.1A priority Critical patent/CN118796872A/en
Publication of CN118796872A publication Critical patent/CN118796872A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种查询语句生成方法、装置、设备、存储介质及程序产品,该方法包括:响应于用户提交的第一问题文本,通过LLM对第一问题文本进行编码,得到第一句向量;基于第一句向量以及数据表的表注释,从预设元数据库中选取与第一问题文本的相似度排名靠前的N个数据表;基于各数据表的第一表相关信息,构建各数据表的第一提示信息,其中,第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将第一提示信息输入至LLM中,得到LLM输出的第一问题文本对应的第一结构化查询语言。本发明通过利用相关问题、思维链使得构建的提示信息更加丰富高效,通过提示学习对LLM进行微调实现了LLM生成的结构化查询语言的准确性。

The present invention provides a query statement generation method, device, equipment, storage medium and program product, the method comprising: in response to a first question text submitted by a user, encoding the first question text through LLM to obtain a first sentence vector; based on the first sentence vector and the table annotations of the data table, selecting N data tables with the highest similarity ranking with the first question text from a preset metadata database; based on the first table related information of each data table, constructing the first prompt information of each data table, wherein the first table related information includes table annotations, field names, field annotations, related questions and thought chains; inputting the first prompt information into LLM to obtain the first structured query language corresponding to the first question text output by LLM. The present invention makes the constructed prompt information richer and more efficient by using related questions and thought chains, and achieves the accuracy of the structured query language generated by LLM by fine-tuning LLM through prompt learning.

Description

查询语句生成方法、装置、设备、存储介质及程序产品Query statement generation method, device, equipment, storage medium and program product

技术领域Technical Field

本发明涉及计算机技术领域,尤其涉及一种查询语句生成方法、装置、设备、存储介质及程序产品。The present invention relates to the field of computer technology, and in particular to a query statement generation method, device, equipment, storage medium and program product.

背景技术Background Art

随着互联网和物联网技术的飞速发展,以及各种传感器、智能设备的普及,数据的产生和存储呈指数级增长。尤其是近年企业数字化的口号的提出,大数据的应用场景不断拓展,包括金融、医疗、教育、零售、制造等各个领域,涉及数据管理、数据分析、决策支持等多个方面。With the rapid development of the Internet and Internet of Things technologies, as well as the popularization of various sensors and smart devices, the generation and storage of data has increased exponentially. Especially in recent years, with the advent of the slogan of enterprise digitalization, the application scenarios of big data have continued to expand, including finance, medical care, education, retail, manufacturing and other fields, involving data management, data analysis, decision support and other aspects.

目前,企业的数据表通用开发流程为:用户先通过需要单描述需求分析内容,接着需求人员分析需求—SQL(Structured Query Language,结构化查询语言)开发人员根据需求描述开发SQL—测试应用—上线—用户消费需求。然而此种应用开发完成后只能应对固有SQL的查询。At present, the general development process of enterprise data tables is: users first describe the requirements analysis content through the requirements form, then the requirements personnel analyze the requirements - SQL (Structured Query Language) developers develop SQL according to the requirements description - test the application - go online - user consumption needs. However, after the development of this application is completed, it can only respond to the query of the inherent SQL.

相关技术中,主要采用NLP(Natural Language Processing,自然语言处理)中的NL2SQL技术将自然语言直接转化成为可直接执行的SQL,实现数据查询和统计。这些技术通常以调整深度神经网络结构来提高测试指标为主,在通用的测试集上表现不佳,也没有能够在企业落地实施的可行性方案。Among the related technologies, the NL2SQL technology in NLP (Natural Language Processing) is mainly used to directly convert natural language into directly executable SQL to achieve data query and statistics. These technologies usually focus on adjusting the deep neural network structure to improve test indicators, perform poorly on general test sets, and there is no feasible solution that can be implemented in enterprises.

因此,亟需提出一种查询语句生成方法、装置、设备、存储介质及程序产品来解决上述技术问题。Therefore, it is urgent to propose a query statement generation method, device, equipment, storage medium and program product to solve the above technical problems.

发明内容Summary of the invention

本发明提供一种查询语句生成方法、装置、设备、存储介质及程序产品,用以解决现有技术中调整深度神经网络结构来提高测试指标为主,在通用的测试集上表现不佳,导致结构化查询语言生成难以达到预期目标,精准度较差的缺陷,实现精准地获取结构化查询语言。The present invention provides a query statement generation method, device, equipment, storage medium and program product, which are used to solve the problem that the prior art mainly adjusts the deep neural network structure to improve the test index, performs poorly on the general test set, resulting in the structured query language generation being difficult to achieve the expected goal and the defect of poor accuracy, and realizes the accurate acquisition of structured query language.

本发明提供一种查询语句生成方法,包括:The present invention provides a query statement generating method, comprising:

响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;In response to a first question text submitted by a user, encoding the first question text through a large language model (LLM) to obtain a first sentence vector;

基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;Based on the first sentence vector and the table annotations of the data table, select N data tables with the highest similarity ranking with the first question text from a preset metadata database, where N is a positive integer;

基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;Based on the first table related information of each of the data tables, construct first prompt information of each of the data tables, wherein the first table related information includes table comments, field names, field comments, related questions and thought chains;

将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。The first prompt information is input into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

根据本发明提供的一种查询语句生成方法,所述预设元数据库中各个数据库表的所述相关问题和所述思维链是通过以下方式得到的:According to a query statement generation method provided by the present invention, the related questions and the thought chain of each database table in the preset metadata database are obtained in the following manner:

从训练数据库中获取第二问题文本,通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言;Acquire a second question text from a training database, and obtain a second structured query language matching the second question text through the LLM;

通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量;Encoding the second question text and the second structured query language by using the LLM to obtain a third sentence vector;

基于所述第二问题文本与所述第二结构化查询语言,生成第三提示信息,并将所述第三提示信息与所述第三句向量输入至所述LLM中,得到所述LLM输出的描述结果,其中,所述第三提示信息用于指示所述LLM描述出根据所述第二问题文本生成所述第二结构化查询语言的生成步骤;Based on the second question text and the second structured query language, generate third prompt information, and input the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used to instruct the LLM to describe the generation steps of generating the second structured query language according to the second question text;

对所述描述结果进行关键句提取,并将提取出的关键句对应存储至预设元数据库中与所述第二问题文本匹配的数据表的相关问题字段和思维链字段中,其中,所述相关问题字段用于描述数据表的相关问题与结构化查询语言,所述思维链字段用于记录所述结构化查询语言的生成步骤。Key sentences are extracted from the description results, and the extracted key sentences are stored correspondingly in the relevant question field and the thought chain field of the data table matching the second question text in the preset metadata database, wherein the relevant question field is used to describe the relevant questions and structured query language of the data table, and the thought chain field is used to record the generation steps of the structured query language.

根据本发明提供的一种查询语句生成方法,所述通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言,包括:According to a query statement generation method provided by the present invention, the second structured query language matching the second question text is obtained through the LLM, including:

从所述训练数据库中获取第二问题文本,通过所述LLM对所述第二问题文本进行编码,得到第二句向量;Acquire a second question text from the training database, and encode the second question text through the LLM to obtain a second sentence vector;

基于所述第二句向量以及数据表的表注释,从预设元数据库中选取与所述第二问题文本的相似度排名靠前的M个数据表,其中,M为正整数;Based on the second sentence vector and the table annotations of the data table, M data tables with the highest similarity ranking with the second question text are selected from a preset metadata database, where M is a positive integer;

基于各所述数据表的第二表相关信息,构建各所述数据表的第二提示信息,其中,所述第二表相关信息包括表注释、字段名和字段注释;Based on the second table related information of each of the data tables, construct second prompt information of each of the data tables, wherein the second table related information includes table comments, field names and field comments;

将所述第二提示信息输入至所述LLM中,得到所述LLM输出的所述第二问题文本对应的第二结构化查询语言。The second prompt information is input into the LLM to obtain a second structured query language corresponding to the second question text output by the LLM.

根据本发明提供的一种查询语句生成方法,所述通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量,还包括:According to a query sentence generation method provided by the present invention, encoding the second question text and the second structured query language by the LLM to obtain a third sentence vector further includes:

对所述第二结构化查询语言进行语法校验;Performing syntax check on the second structured query language;

在所述第二结构化查询语言通过语法校验的情况下,将所述第二结构化查询语言发送至人工处理终端,以供所述人工处理终端校验所述第二结构化查询语言是否符合业务要求;If the second structured query language passes the syntax check, sending the second structured query language to a manual processing terminal, so that the manual processing terminal checks whether the second structured query language meets the business requirements;

在接收到所述人工处理终端判定所述第二结构化查询语言符合业务要求的反馈结果的情况下,通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量。When receiving a feedback result that the manual processing terminal determines that the second structured query language meets the business requirements, the second question text and the second structured query language are encoded by the LLM to obtain a third sentence vector.

根据本发明提供的一种查询语句生成方法,所述方法还包括:According to a query statement generation method provided by the present invention, the method further includes:

继续执行所述从训练数据库中获取第二问题文本,通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言;Continue to execute the step of acquiring a second question text from the training database, and obtaining a second structured query language matching the second question text through the LLM;

继续执行所述通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量;Continue to perform the step of encoding the second question text and the second structured query language by using the LLM to obtain a third sentence vector;

继续执行所述基于所述第二问题文本与所述第二结构化查询语言,生成第三提示信息,并将所述第三提示信息与所述第三句向量输入至所述LLM中,得到所述LLM输出的描述结果,其中,所述第三提示信息用于指示所述LLM描述出根据所述第二问题文本生成所述第二结构化查询语言的生成步骤;Continue to execute the step of generating third prompt information based on the second question text and the second structured query language, and input the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used to instruct the LLM to describe a generation step of generating the second structured query language according to the second question text;

继续执行所述对所述描述结果进行关键句提取,并将提取出的关键句对应存储至预设元数据库中与所述第二问题文本匹配的数据表的相关问题字段和思维链字段中,直至所述预设元数据库中的数据库表的相关问题字段和思维链字段的更新占比达到预设阈值。Continue to extract key sentences from the description results, and store the extracted key sentences in the relevant question fields and thinking chain fields of the data table in the preset metadata database that matches the second question text, until the update ratio of the relevant question fields and thinking chain fields of the database table in the preset metadata database reaches a preset threshold.

根据本发明提供的一种查询语句生成方法,所述将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言之后,还包括:According to a query statement generation method provided by the present invention, after inputting the first prompt information into the LLM to obtain the first structured query language corresponding to the first question text output by the LLM, the method further includes:

根据经过用户验证为生成正确的若干个所述第一结构化查询语言与对应的所述第一问题文本对所述LLM进行精调训练,得到精调训练的LLM;Fine-tune the LLM according to the first structured query languages and the corresponding first question texts that are verified by the user to be correctly generated, to obtain a fine-tune LLM;

响应于用户提交的第三问题文本,将所述第三问题文本输入至所述精调训练的LLM,得到所述精调训练的LLM输出的第三结构化查询语言。In response to a third question text submitted by a user, the third question text is input into the fine-tuned LLM to obtain a third structured query language output by the fine-tuned LLM.

本发明还提供一种查询语句生成装置,包括:The present invention also provides a query statement generating device, comprising:

编码单元,用于响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;an encoding unit, configured to, in response to a first question text submitted by a user, encode the first question text through a large language model LLM to obtain a first sentence vector;

选取单元,用于基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;A selection unit, configured to select N data tables with the highest similarity ranking with the first question text from a preset metadata database based on the first sentence vector and the table annotations of the data table, wherein N is a positive integer;

构建单元,用于基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;A construction unit, configured to construct first prompt information of each of the data tables based on first table related information of each of the data tables, wherein the first table related information includes table comments, field names, field comments, related questions and thought chains;

生成单元,用于将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。A generating unit is used to input the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述查询语句生成方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, any one of the query statement generation methods described above is implemented.

本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述查询语句生成方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the query statement generation method described in any one of the above is implemented.

本发明还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述查询语句生成方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the query statement generating method described in any one of the above is implemented.

本发明提供的查询语句生成方法、装置、设备、存储介质及程序产品,包括响应于用户提交的第一问题文本,通过LLM对第一问题文本进行编码,得到第一句向量;基于第一句向量,从预设元数据库中选取与第一问题文本的相似度排名靠前的N个数据表;基于各数据表的第一表相关信息,构建各数据表的第一提示信息,其中,第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将第一提示信息输入至LLM中,得到LLM输出的第一问题文本对应的第一结构化查询语言。本发明通过利用相关问题、思维链使得构建的提示信息更加丰富高效,通过提示学习对LLM进行微调实现了LLM生成的结构化查询语言的准确性。The query statement generation method, device, equipment, storage medium and program product provided by the present invention include: in response to a first question text submitted by a user, encoding the first question text through LLM to obtain a first sentence vector; based on the first sentence vector, selecting N data tables with the highest similarity ranking with the first question text from a preset metadata database; based on the first table related information of each data table, constructing the first prompt information of each data table, wherein the first table related information includes table annotations, field names, field annotations, related questions and thought chains; inputting the first prompt information into LLM to obtain the first structured query language corresponding to the first question text output by LLM. The present invention makes the constructed prompt information richer and more efficient by using related questions and thought chains, and achieves the accuracy of the structured query language generated by LLM by fine-tuning LLM through prompt learning.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的查询语句生成方法的流程示意图之一;FIG1 is a flow chart of a query statement generation method provided by the present invention;

图2是本发明提供的查询语句生成方法的流程示意图之二;FIG2 is a second flow chart of the query statement generating method provided by the present invention;

图3是本发明提供的查询语句生成装置的结构示意图;FIG3 is a schematic diagram of the structure of a query statement generating device provided by the present invention;

图4是本发明提供的电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

图1为本发明提供的查询语句生成方法的流程示意图之一;如图1所示,该方法包括:FIG. 1 is a flow chart of a query statement generation method provided by the present invention; as shown in FIG. 1 , the method includes:

步骤110,响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;Step 110, in response to a first question text submitted by a user, encoding the first question text through a large language model LLM to obtain a first sentence vector;

此处的用户所提交的第一问题文本指的是针对某个企业特定业务场景下预设的数据库中的数据表进行查询的问题。The first question text submitted by the user here refers to a question about querying a data table in a preset database under a specific business scenario of an enterprise.

这种问题文本通常涉及到需要从数据库中提取特定信息或进行特定操作,以支持该企业在特定业务场景下的决策和运营需求。比如某家电商企业有一个预设的数据库,其中包括了商品表、订单表和用户表等数据表。用户则会提交此问题文本“请列出最近一个月内销量排名前十的商品及其销售额”。This type of question text usually involves the need to extract specific information from the database or perform specific operations to support the company's decision-making and operational needs in a specific business scenario. For example, an e-commerce company has a preset database that includes data tables such as product tables, order tables, and user tables. The user will submit this question text "Please list the top ten products in terms of sales volume and their sales volume in the past month."

另外说明的是,本实施例中的LLM部署在本地。即通过在本地部署LLM,通过提示学习对LLM进行微调,提高NL2SQL的效果,并通过文本交互的方式,可以为客户提供更鲁棒,更泛化的数据查询和统计服务。It should be noted that the LLM in this embodiment is deployed locally. That is, by deploying LLM locally, the LLM is fine-tuned through prompt learning to improve the effect of NL2SQL, and through text interaction, more robust and generalized data query and statistical services can be provided to customers.

本实施例中,将用户提交的第一问题文本输入到LLM中。LLM会将文本通过预训练的模型进行编码,将文本中的每个词语、短语或句子转换为对应的向量表示。LLM在编码过程中会考虑文本的语境、语法结构和语义信息,以捕捉文本的深层含义。最终得到的结果是一个表示整个问题文本语义信息的高维向量,即第一句向量。In this embodiment, the first question text submitted by the user is input into LLM. LLM will encode the text through a pre-trained model and convert each word, phrase or sentence in the text into a corresponding vector representation. LLM will consider the context, grammatical structure and semantic information of the text during the encoding process to capture the deep meaning of the text. The final result is a high-dimensional vector representing the semantic information of the entire question text, i.e., the first sentence vector.

步骤120,基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;Step 120, based on the first sentence vector and the table annotations of the data table, select N data tables with the highest similarity ranking with the first question text from a preset metadata database, where N is a positive integer;

其中,元数据是数据的数据,一般来说,元数据库中保存了企业特定业务场景下的业务数据表的表信息,表注释,表字段,字段注释,主键/外键等信息。Among them, metadata is data of data. Generally speaking, the metadata database stores table information, table comments, table fields, field comments, primary key/foreign key and other information of business data tables in specific business scenarios of the enterprise.

可选地,采用余弦相似度算法计算第一句向量与预设元数据库中的各个数据表的表注释之间的相似度。Optionally, a cosine similarity algorithm is used to calculate the similarity between the first sentence vector and the table annotations of each data table in the preset metadata database.

比如,将第一向量表示记为A,元数据库中的表注释的向量表示记为B。计算向量A和向量B之间的余弦相似度。将得到的相似度值作为第一问题文本与每个数据表的相似度的度量,数值越接近1表示相似度越高。根据相似度计算结果,将数据库中的数据表按照与第一句向量的相似度从高到低进行排序。从排名靠前的数据表中选择相似度排名靠前的N个数据表。For example, the first vector representation is recorded as A, and the vector representation of the table annotation in the metadata database is recorded as B. Calculate the cosine similarity between vector A and vector B. The obtained similarity value is used as a measure of the similarity between the first question text and each data table. The closer the value is to 1, the higher the similarity. According to the similarity calculation results, the data tables in the database are sorted from high to low according to the similarity with the first sentence vector. Select the top N data tables with the highest similarity from the top-ranked data tables.

步骤130,基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;Step 130, constructing first prompt information of each data table based on first table related information of each data table, wherein the first table related information includes table comments, field names, field comments, related questions and thought chains;

需要说明的是,本实施例中在预设元数据库中新增了相关问题和思维链两个字段。It should be noted that in this embodiment, two fields, namely, related questions and thought chain, are newly added to the preset metadata database.

其中,相关问题用来记录与数据表的相关问题和结构化查询语言。思维链用来记录该数据表的结构化查询语言的生成步骤。比如对于一个销售数据表,相关问题字段可以包含各种常见查询,比如“查询最近一个月内的销售额”、“列出销量排名前十的商品”等。这些问题可以帮助更快地了解如何使用该数据表进行查询和分析。对于一个订单数据表,思维链字段可以记录如何编写结构化查询语言来实现不同的数据分析需求。这些步骤可以帮助了解如何构建复杂的结构化查询语言,从而更好地利用数据表进行分析。Among them, related questions are used to record related questions and structured query languages for data tables. Thinking chains are used to record the steps for generating the structured query language for the data table. For example, for a sales data table, the related question field can contain various common queries, such as "query sales in the last month", "list the top ten products in sales", etc. These questions can help you understand how to use the data table for query and analysis more quickly. For an order data table, the thinking chain field can record how to write structured query languages to meet different data analysis needs. These steps can help you understand how to build complex structured query languages, so that you can better use data tables for analysis.

可选地,可以用一个Prompt Generator(提示生成器)来整合表注释、字段名、字段注释、相关问题和思维链。比如将表注释作为提示的开头,用于描述这张数据表的基本信息和用途。将字段名和字段注释整合到提示信息中,以展示数据表的各个字段以及它们的含义和用途。将与该数据表相关的常见问题整合到提示信息中,以引导LLM了解如何使用该数据表进行查询和分析。将数据表的结构化查询语言的生成步骤整合到提示信息中,以帮助LLM理解如何构建复杂的查询语句。Optionally, a Prompt Generator can be used to integrate table comments, field names, field comments, related questions, and thought chains. For example, use the table comment as the beginning of the prompt to describe the basic information and purpose of this data table. Integrate the field name and field comment into the prompt information to show the fields of the data table and their meaning and purpose. Integrate common questions related to the data table into the prompt information to guide the LLM to understand how to query and analyze the data table. Integrate the steps for generating the structured query language of the data table into the prompt information to help the LLM understand how to build complex query statements.

步骤140,将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。Step 140: input the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

本实施例中,将第一问题文本对应的第一提示信息再次输入至LLM中,即可生成第一问题文本对应的第一结构化查询语言。In this embodiment, the first prompt information corresponding to the first question text is input into the LLM again, so as to generate the first structured query language corresponding to the first question text.

比如,有一个销售数据表,用户提交的问题文本“查询最近一个季度的销售额”的第一提示信息如下:For example, there is a sales data table. The first prompt information for the question text "Query the sales volume of the most recent quarter" submitted by the user is as follows:

【表注释】:这张数据表记录了销售订单的相关信息。【Table Notes】:This data table records the relevant information of sales orders.

【字段名和字段注释】:【Field name and field comment】:

-订单号:唯一标识每个销售订单-Order number: uniquely identifies each sales order

-日期:订单的交易日期- Date: The transaction date of the order

-产品名称:被销售的产品名称- Product Name: The name of the product being sold

-金额:订单金额-Amount: Order amount

......

【相关问题】:Related questions:

-查询最近一个季度的销售额-Query the sales volume for the most recent quarter

-列出销量排名前十的商品- List the top ten products in terms of sales

......

【思维链】:【Thinking Chain】:

1.编写SQL语句实现按时间条件筛选1. Write SQL statements to filter by time conditions

2.根据筛选条件计算销售额2. Calculate sales based on filter conditions

3.对销售额进行排序并筛选出排名前十的商品3. Sort sales and filter out the top ten products

......

将这样的第一提示信息输入至LLM中,LLM可以生成对应的第一结构化查询语言,例如:Such first prompt information is input into LLM, and LLM can generate a corresponding first structured query language, for example:

SELECT SUM(金额)AS销售额SELECT SUM(amount) AS sales amount

FROM销售数据表FROM sales data table

WHERE日期>='开始日期'AND日期<='结束日期';WHERE date>='start date' AND date<='end date';

本实施例提供的方法,通过响应于用户提交的第一问题文本,通过LLM对第一问题文本进行编码,得到第一句向量;基于第一句向量,从预设元数据库中选取与第一问题文本的相似度排名靠前的N个数据表;基于各数据表的第一表相关信息,构建各数据表的第一提示信息,其中,第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将第一提示信息输入至LLM中,得到LLM输出的第一问题文本对应的第一结构化查询语言。本发明通过利用相关问题、思维链使得构建的提示信息更加丰富高效,通过提示学习对LLM进行微调实现了LLM生成的结构化查询语言的准确性。The method provided in this embodiment encodes the first question text through LLM in response to the first question text submitted by the user to obtain a first sentence vector; based on the first sentence vector, select N data tables with the highest similarity ranking with the first question text from the preset metadata database; construct the first prompt information of each data table based on the first table related information of each data table, wherein the first table related information includes table annotations, field names, field annotations, related questions and thought chains; input the first prompt information into LLM to obtain the first structured query language corresponding to the first question text output by LLM. The present invention makes the constructed prompt information richer and more efficient by using related questions and thought chains, and achieves the accuracy of the structured query language generated by LLM by fine-tuning LLM through prompt learning.

在一些实施例中,所述预设元数据库中各个数据库表的所述相关问题和所述思维链是通过以下方式得到的:In some embodiments, the related questions and the thought chains of each database table in the preset metadata database are obtained in the following manner:

从训练数据库中获取第二问题文本,通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言;Acquire a second question text from a training database, and obtain a second structured query language matching the second question text through the LLM;

通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量;Encoding the second question text and the second structured query language by using the LLM to obtain a third sentence vector;

基于所述第二问题文本与所述第二结构化查询语言,生成第三提示信息,并将所述第三提示信息与所述第三句向量输入至所述LLM中,得到所述LLM输出的描述结果,其中,所述第三提示信息用于指示所述LLM描述出根据所述第二问题文本生成所述第二结构化查询语言的生成步骤;Based on the second question text and the second structured query language, generate third prompt information, and input the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used to instruct the LLM to describe the generation steps of generating the second structured query language according to the second question text;

对所述描述结果进行关键句提取,并将提取出的关键句对应存储至预设元数据库中与所述第二问题文本匹配的数据表的相关问题字段和思维链字段中,其中,所述相关问题字段用于描述数据表的相关问题与结构化查询语言,所述思维链字段用于记录所述结构化查询语言的生成步骤。Key sentences are extracted from the description results, and the extracted key sentences are stored correspondingly in the relevant question field and the thought chain field of the data table matching the second question text in the preset metadata database, wherein the relevant question field is used to describe the relevant questions and structured query language of the data table, and the thought chain field is used to record the generation steps of the structured query language.

本实施例中,在LLM的训练阶段,预设元数据库中的数据表仅包含基础的表信息,比如表注释、字段名和字段注释。此阶段通过ICL(In-Content-Learning,上下文学习),可以利用这些数据来构建更丰富和高效的提示信息,从而实现小样本学习的目的。In this embodiment, during the training phase of LLM, the data table in the preset metadata database only contains basic table information, such as table comments, field names, and field comments. In this phase, through ICL (In-Content-Learning), these data can be used to construct richer and more efficient prompt information, thereby achieving the purpose of small sample learning.

具体地,构建一个训练数据库,其中包含标识有结构化查询语言标签的第二问题文本,以便让LLM学习如何将问题文本转换为查询语言。在利用LLM得到第二问题文本匹配的第二结构化查询语言后,在利用第二问题文本的结构化查询语言标签判定所生成的第二结构化查询语言符合预期结果后,利用LLM对第二问题文本和其对应的结构化查询语言进行编码,生成一个句子向量。基于第二问题文本和对应的结构化查询语言,生成第三提示信息。这个提示信息一种指导,告诉LLM如何根据输入的第二问题文本生成相应的第二结构化查询语言。然后,将这个第三提示信息和第三句向量输入到LLM中,得到LLM的输出结果。最后对描述结果进行关键句提取,并将提取出的关键句对应存储至预设元数据库中与第二问题文本匹配的数据表的相关问题字段和思维链字段中。Specifically, a training database is constructed, which contains a second question text marked with a structured query language tag, so that the LLM can learn how to convert the question text into a query language. After obtaining the second structured query language matching the second question text using the LLM, after determining that the generated second structured query language meets the expected result using the structured query language tag of the second question text, the second question text and its corresponding structured query language are encoded using the LLM to generate a sentence vector. Based on the second question text and the corresponding structured query language, a third prompt information is generated. This prompt information is a guide that tells the LLM how to generate the corresponding second structured query language based on the input second question text. Then, the third prompt information and the third sentence vector are input into the LLM to obtain the output result of the LLM. Finally, the description result is subjected to key sentence extraction, and the extracted key sentences are stored in the relevant question field and the thinking chain field of the data table matching the second question text in the preset metadata database.

比如,将第三提示信息和第二句向量输入到LLM中时,LLM的输出结果可能是一段文本描述,详细说明如何根据第二问题文本生成对应的第二结构化查询语言。例如,输出结果可能包括以下内容:For example, when the third prompt information and the second sentence vector are input into the LLM, the output result of the LLM may be a text description, which details how to generate the corresponding second structured query language according to the second question text. For example, the output result may include the following content:

对于第二问题文本“如何查询最近一个季度的销售额”,可以通过识别关键词‘查询’、‘最近一个季度’和‘销售额’来构建第二结构化查询语言。首先,根据‘最近一个季度’确定时间范围,并在销售表中计算销售额的总和,即使用SQL语句"SELECT SUM(SalesAmount)FROM Sales WHERE Quarter='最近一个季度'"。For the second question text "How to query the sales of the most recent quarter", the second structured query language can be constructed by identifying the keywords 'query', 'most recent quarter' and 'sales'. First, the time range is determined according to 'most recent quarter', and the sum of sales is calculated in the sales table, that is, the SQL statement "SELECT SUM(SalesAmount)FROM Sales WHERE Quarter='most recent quarter'" is used.

从这个输出结果中,可以提取相关问题和思维链两个字段:From this output, we can extract two fields: related questions and thought chain:

相关问题字段描述的是数据表的相关问题和SQL语句,例如:“如何查询最近一个季度的销售额”对应的SQL语句是“SELECT SUM(SalesAmount)FROM Sales WHERE Quarter=最近一个季度”。The Related Questions field describes the related questions and SQL statements of the data table. For example, the SQL statement corresponding to "How to query the sales amount of the most recent quarter" is "SELECT SUM(SalesAmount)FROM Sales WHERE Quarter=most recent quarter".

思维链字段记录了SQL语句生成步骤,比如:“从问题文本中提取关键词‘查询’、‘最近一个季度’和‘销售额’,并根据这些关键词构建了查询语句。然后根据‘最近一个季度’的时间范围,在销售表中计算销售额的总和”。The thinking chain field records the steps of generating SQL statements, such as: "The key words 'query', 'most recent quarter' and 'sales' are extracted from the question text, and a query statement is constructed based on these keywords. Then, based on the time range of 'most recent quarter', the total sales in the sales table is calculated."

本实施例提供的方法,通过以上方式在预设元数据库中增加相关问题和思维链两个字段,对每次训练过程进行分析,更新这两个字段,将这两个字段并结合上下文学习,用于提示信息的构建,实现后续对LLM的微调。The method provided in this embodiment adds two fields, namely related questions and thought chains, to the preset metadata database in the above manner, analyzes each training process, updates these two fields, and combines these two fields with context learning to construct prompt information, thereby achieving subsequent fine-tuning of the LLM.

在一些实施例中,所述通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言,包括:In some embodiments, obtaining the second structured query language matching the second question text through the LLM includes:

从所述训练数据库中获取第二问题文本,通过所述LLM对所述第二问题文本进行编码,得到第二句向量;Acquire a second question text from the training database, and encode the second question text through the LLM to obtain a second sentence vector;

基于所述第二句向量以及数据表的表注释,从预设元数据库中选取与所述第二问题文本的相似度排名靠前的M个数据表,其中,M为正整数;Based on the second sentence vector and the table annotations of the data table, M data tables with the highest similarity ranking with the second question text are selected from a preset metadata database, where M is a positive integer;

基于各所述数据表的第二表相关信息,构建各所述数据表的第二提示信息,其中,所述第二表相关信息包括表注释、字段名和字段注释;Based on the second table related information of each of the data tables, construct second prompt information of each of the data tables, wherein the second table related information includes table comments, field names and field comments;

将所述第二提示信息输入至所述LLM中,得到所述LLM输出的所述第二问题文本对应的第二结构化查询语言。The second prompt information is input into the LLM to obtain a second structured query language corresponding to the second question text output by the LLM.

本实施例中,首先从训练数据库中获取第二问题文本,将第二问题文本输入到LLM中。LLM会将文本通过预训练的模型进行编码,将文本中的每个词语、短语或句子转换为对应的向量表示。最终得到的结果是一个表示整个问题文本语义信息的高维向量,即第二向量。In this embodiment, the second question text is first obtained from the training database and input into the LLM. The LLM encodes the text through the pre-trained model and converts each word, phrase or sentence in the text into a corresponding vector representation. The final result is a high-dimensional vector representing the semantic information of the entire question text, i.e., the second vector.

基于第二句向量以及数据表的表注释,从预设元数据库中选取与第二问题文本的相似度排名靠前的M个数据表。可选地,采用余弦相似度算法计算第二句向量与预设元数据库中的各个数据表的表注释之间的相似度。根据相似度计算结果,将数据库中的数据表按照与第一句向量的相似度从高到低进行排序。从排名靠前的数据表中选择相似度排名靠前的M个数据表。Based on the second sentence vector and the table annotations of the data table, select the top M data tables with similarity ranking to the second question text from the preset metadata database. Optionally, the cosine similarity algorithm is used to calculate the similarity between the second sentence vector and the table annotations of each data table in the preset metadata database. According to the similarity calculation result, the data tables in the database are sorted from high to low according to the similarity with the first sentence vector. Select the top M data tables with similarity ranking from the top ranked data tables.

接着,可选地,可以用一个Prompt Generator(提示生成器)来整合表注释、字段名、字段注释。比如将表注释作为提示的开头,用于描述这张数据表的基本信息和用途。将字段名和字段注释整合到提示信息中,以展示数据表的各个字段以及它们的含义和用途。Next, optionally, a prompt generator can be used to integrate table comments, field names, and field comments. For example, the table comment is used as the beginning of the prompt to describe the basic information and purpose of this data table. The field name and field comment are integrated into the prompt information to show the fields of the data table and their meaning and purpose.

最后将第二提示信息输入至LLM中,即可得到LLM输出的第二问题文本对应的第二结构化查询语言。Finally, the second prompt information is input into the LLM to obtain the second structured query language corresponding to the second question text output by the LLM.

在一些实施例中,所述通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量,还包括:In some embodiments, encoding the second question text and the second structured query language by the LLM to obtain a third sentence vector further includes:

对所述第二结构化查询语言进行语法校验;Performing syntax check on the second structured query language;

在所述第二结构化查询语言通过语法校验的情况下,将所述第二结构化查询语言发送至人工处理终端,以供所述人工处理终端校验所述第二结构化查询语言是否符合业务要求;If the second structured query language passes the syntax check, sending the second structured query language to a manual processing terminal, so that the manual processing terminal checks whether the second structured query language meets the business requirements;

在接收到所述人工处理终端判定所述第二结构化查询语言符合业务要求的反馈结果的情况下,通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量。When receiving a feedback result that the manual processing terminal determines that the second structured query language meets the business requirements, the second question text and the second structured query language are encoded by the LLM to obtain a third sentence vector.

需要说明的是,数据库查询语言有特定的语法格式和规范,如果生成的查询语言不符合这些规范,就会导致在执行数据库查询时出现错误。因此本实施例中,首先对第二结构化查询语言进行语法校验,以确保其符合特定的数据库查询语言的语法规则。It should be noted that the database query language has a specific syntax format and specifications. If the generated query language does not conform to these specifications, errors will occur when executing the database query. Therefore, in this embodiment, the second structured query language is firstly syntax-checked to ensure that it conforms to the syntax rules of the specific database query language.

另外需要说明的是,通常情况下,仅仅通过语法校验还无法完全确保查询语言符合实际的业务场景和逻辑要求。因此在通过语法校验后,将它发送至人工处理终端,由人工处理终端校验该结构化查询语言是否符合业务要求。It should also be noted that, usually, syntax checking alone cannot fully ensure that the query language meets the actual business scenario and logic requirements. Therefore, after passing the syntax check, it is sent to the manual processing terminal, which verifies whether the structured query language meets the business requirements.

一旦人工处理终端判定第二结构化查询语言符合业务要求,那么我们就可以继续通过LLM对第二问题文本与第二结构化查询语言进行编码,得到第三句向量。Once the manual processing terminal determines that the second structured query language meets the business requirements, we can continue to encode the second question text and the second structured query language through LLM to obtain the third sentence vector.

在一些实施例中,所述方法还包括:In some embodiments, the method further comprises:

继续执行所述从训练数据库中获取第二问题文本,通过所述LLM得到所述第二问题文本匹配的第二结构化查询语言;Continue to execute the step of acquiring a second question text from the training database, and obtaining a second structured query language matching the second question text through the LLM;

继续执行所述通过所述LLM对所述第二问题文本与所述第二结构化查询语言进行编码,得到第三句向量;Continue to perform the step of encoding the second question text and the second structured query language by using the LLM to obtain a third sentence vector;

继续执行所述基于所述第二问题文本与所述第二结构化查询语言,生成第三提示信息,并将所述第三提示信息与所述第三句向量输入至所述LLM中,得到所述LLM输出的描述结果,其中,所述第三提示信息用于指示所述LLM描述出根据所述第二问题文本生成所述第二结构化查询语言的生成步骤;Continue to execute the step of generating third prompt information based on the second question text and the second structured query language, and input the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used to instruct the LLM to describe a generation step of generating the second structured query language according to the second question text;

继续执行所述对所述描述结果进行关键句提取,并将提取出的关键句对应存储至预设元数据库中与所述第二问题文本匹配的数据表的相关问题字段和思维链字段中,直至所述预设元数据库中的数据库表的相关问题字段和思维链字段的更新占比达到预设阈值。Continue to extract key sentences from the description results, and store the extracted key sentences in the relevant question fields and thinking chain fields of the data table in the preset metadata database that matches the second question text, until the update ratio of the relevant question fields and thinking chain fields of the database table in the preset metadata database reaches a preset threshold.

本实施例中,在LLM的训练阶段,以预设元数据库中的数据库表的相关问题字段和思维链字段的更新占比为目标进行迭代训练。In this embodiment, in the training phase of LLM, iterative training is performed with the update ratio of the relevant question fields and the thought chain fields of the database table in the preset metadata database as the target.

具体地,首先从训练数据库中获取第二问题文本,并通过LLM得到对应的第二结构化查询语言。然后将第二问题文本和第二结构化查询语言输入LLM以获得第三句向量,接着基于第二问题文本和第二结构化查询语言生成第三提示信息,并将该提示信息与第三句向量输入LLM中,得到LLM输出的描述结果,描述出根据第二问题文本生成第二结构化查询语言的生成步骤。接下来对描述结果进行关键句提取,并将提取出的关键句存储至预设元数据库中与第二问题文本匹配的数据表的相关问题字段和思维链字段中。重复执行上述步骤,直至预设元数据库中的数据库表的相关问题字段和思维链字段的更新占比达到预设阈值。通过这些步骤可以逐步积累并更新预设元数据库中的数据库表的相关问题字段和思维链字段。Specifically, first obtain the second question text from the training database, and obtain the corresponding second structured query language through LLM. Then input the second question text and the second structured query language into LLM to obtain the third sentence vector, then generate the third prompt information based on the second question text and the second structured query language, and input the prompt information and the third sentence vector into LLM to obtain the description result output by LLM, which describes the generation steps of generating the second structured query language according to the second question text. Next, extract the key sentences from the description result, and store the extracted key sentences in the relevant question fields and thinking chain fields of the data table matching the second question text in the preset metadata database. Repeat the above steps until the update proportion of the relevant question fields and thinking chain fields of the database table in the preset metadata database reaches the preset threshold. Through these steps, the relevant question fields and thinking chain fields of the database table in the preset metadata database can be gradually accumulated and updated.

为了便于理解,参考图2。图2为本发明提供的查询语句生成方法的流程示意图之二;如图2所示,查询语句生成方法中的训练阶段的步骤包括:For ease of understanding, refer to Figure 2. Figure 2 is a second flow chart of the query statement generation method provided by the present invention; as shown in Figure 2, the steps of the training phase in the query statement generation method include:

步骤1、用户提交问题文本,问题文本经过LLM的Embedding后转为句向量;Step 1: The user submits a question text, which is converted into a sentence vector after LLM embedding.

步骤2、将向量化后的问题与元数据库里的表注释进行比较,计算其相似度,具体算法可以选用余弦相似度算法;Step 2: Compare the vectorized question with the table annotations in the metadata database and calculate their similarity. The specific algorithm can be the cosine similarity algorithm;

步骤3、选取相似度高的Top N个表,通过Prompt Generator(提示生成器),用表注释,字段名,字段注释构建一个基础的提示信息;Step 3: Select the top N tables with high similarity, and use the Prompt Generator to build a basic prompt message using table comments, field names, and field comments.

步骤4、将步骤3生成的Prompt送入LLM,生成SQL。Step 4: Send the prompt generated in step 3 to LLM to generate SQL.

步骤5、通过基于元数据库建立的测试库,对生成的SQL进行语法验证,验证成功的转步骤6,不成功的转步骤3。Step 5: Verify the syntax of the generated SQL using the test library built based on the metadata library. If the verification is successful, go to step 6; if it is unsuccessful, go to step 3.

步骤6、人工确认生成的SQL是否符合业务要求,符合转步骤7,否则转步骤3;Step 6: Manually confirm whether the generated SQL meets the business requirements. If yes, go to step 7; otherwise, go to step 3;

步骤7、问题与SQL已经生成完成,也已经经过人工确认,此时再将问题和SQL经LLMEmbedding后转为句向量输入至LLM并在提示信息中增加一个问题“请一步步的描述答案生成的过程”;最后并LLM输出的结果转存到元数据库的相关问题,思维链两个字段里,其中相关问题字段描述的与表相关的问题和SQL语句,思维链字段SQL语句生成的过程。Step 7. The question and SQL have been generated and manually confirmed. At this time, the question and SQL are converted into sentence vectors after LLM embedding and input into LLM. A question "Please describe the process of answer generation step by step" is added to the prompt information. Finally, the results output by LLM are transferred to the metadata database in the related questions and thinking chain fields. The related questions field describes the questions and SQL statements related to the table, and the thinking chain field describes the process of SQL statement generation.

本实施例中的训练过程以预设元数据库中的数据表的相关问题,思维链两个字段更新占比作为标准,即预设元数据库中大部分数据表已经有了相关问题,思维链内容时才可停止更新训练。The training process in this embodiment uses the update ratio of the two fields of related issues of the data tables in the preset metadata database and the thinking chain as the standard, that is, most of the data tables in the preset metadata database already have related issues and the thinking chain content, and the update training can be stopped only when there is relevant issues.

在一些实施例中,所述将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言之后,还包括:In some embodiments, after inputting the first prompt information into the LLM and obtaining the first structured query language corresponding to the first question text output by the LLM, the method further includes:

根据经过用户验证为生成正确的若干个所述第一结构化查询语言与对应的所述第一问题文本对所述LLM进行精调训练,得到精调训练的LLM;Fine-tune the LLM according to the first structured query languages and the corresponding first question texts that are verified by the user to be correctly generated, to obtain a fine-tune LLM;

响应于用户提交的第三问题文本,将所述第三问题文本输入至所述精调训练的LLM,得到所述精调训练的LLM输出的第三结构化查询语言。In response to a third question text submitted by a user, the third question text is input into the fine-tuned LLM to obtain a third structured query language output by the fine-tuned LLM.

本实施例中,基于LLM的查询语句生成方法分为三个阶段:训练阶段-应用阶段以及优化阶段。In this embodiment, the query statement generation method based on LLM is divided into three stages: a training stage, an application stage, and an optimization stage.

本实施例中,在应用阶段持续一段时间后,根据用户所反馈的生成正确的第一结构化查询语言与对应的第一问题文本对LLM进行精调训练,以使让LLM具有长期固定记忆。其中,精调工具包括但不限于Lora,P-Tuning v2等。In this embodiment, after the application phase lasts for a period of time, the LLM is fine-tuned and trained based on the user's feedback on the generation of the correct first structured query language and the corresponding first question text, so that the LLM has long-term fixed memory. The fine-tuning tools include but are not limited to Lora, P-Tuning v2, etc.

在对LLM进行精调训练后,由于LLM具有了长期固定记忆,因此针对用户提交的第三问题文本,可以直接将第三问题文本输入至精调训练的LLM,由精调训练的LLM输出的第三结构化查询语言。After the LLM is fine-tuned and trained, since the LLM has a long-term fixed memory, the third question text submitted by the user can be directly input into the fine-tuned and trained LLM, and the fine-tuned and trained LLM outputs the third structured query language.

下面对本发明提供的查询语句生成装置进行描述,下文描述的查询语句生成装置与上文描述的查询语句生成方法可相互对应参照。The query statement generating device provided by the present invention is described below. The query statement generating device described below and the query statement generating method described above can be referred to each other.

图3为本发明提供的查询语句生成装置的结构示意图;如图3所示,该装置包括:FIG3 is a schematic diagram of the structure of a query statement generating device provided by the present invention; as shown in FIG3 , the device includes:

编码单元310,用于响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;The encoding unit 310 is used to encode the first question text through a large language model LLM in response to the first question text submitted by the user to obtain a first sentence vector;

选取单元320,用于基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;A selection unit 320 is used to select N data tables with the highest similarity to the first question text from a preset metadata database based on the first sentence vector and the table annotations of the data table, where N is a positive integer;

构建单元330,用于基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;A construction unit 330 is used to construct first prompt information of each data table based on first table related information of each data table, wherein the first table related information includes table comments, field names, field comments, related questions and thought chains;

生成单元340,用于将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。The generating unit 340 is configured to input the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

本发明提供的装置,响应于用户提交的第一问题文本,通过LLM对第一问题文本进行编码,得到第一句向量;基于第一句向量,从预设元数据库中选取与第一问题文本的相似度排名靠前的N个数据表;基于各数据表的第一表相关信息,构建各数据表的第一提示信息,其中,第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将第一提示信息输入至LLM中,得到LLM输出的第一问题文本对应的第一结构化查询语言。本发明通过利用相关问题、思维链使得构建的提示信息更加丰富高效,通过提示学习对LLM进行微调实现了LLM生成的结构化查询语言的准确性。The device provided by the present invention, in response to a first question text submitted by a user, encodes the first question text through LLM to obtain a first sentence vector; based on the first sentence vector, selects N data tables with the highest similarity ranking with the first question text from a preset metadata database; based on the first table related information of each data table, constructs the first prompt information of each data table, wherein the first table related information includes table annotations, field names, field annotations, related questions and thought chains; inputs the first prompt information into LLM to obtain the first structured query language corresponding to the first question text output by LLM. The present invention makes the constructed prompt information richer and more efficient by using related questions and thought chains, and achieves the accuracy of the structured query language generated by LLM by fine-tuning LLM through prompt learning.

图4示例了一种电子设备的实体结构示意图,如图4所示,该电子设备可以包括:处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440,其中,处理器410,通信接口420,存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令,以执行查询语句生成方法,该方法包括:响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。FIG4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG4 , the electronic device may include: a processor 410, a communication interface 420, a memory 430 and a communication bus 440, wherein the processor 410, the communication interface 420 and the memory 430 communicate with each other through the communication bus 440. The processor 410 may call the logic instructions in the memory 430 to execute the query statement generation method, which includes: in response to a first question text submitted by a user, encoding the first question text through a large language model LLM to obtain a first sentence vector; based on the first sentence vector and the table annotations of the data table, selecting N data tables with the highest similarity ranking with the first question text from a preset metadata database, wherein N is a positive integer; based on the first table related information of each data table, constructing the first prompt information of each data table, wherein the first table related information includes table annotations, field names, field annotations, related questions and thought chains; inputting the first prompt information into the LLM to obtain the first structured query language corresponding to the first question text output by the LLM.

此外,上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 430 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的查询语句生成方法,该方法包括:响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the query statement generation method provided by the above methods, the method including: in response to a first question text submitted by a user, encoding the first question text through a large language model LLM to obtain a first sentence vector; based on the first sentence vector and the table annotations of the data table, selecting N data tables with the top similarity ranking with the first question text from a preset metadata database, wherein N is a positive integer; based on the first table related information of each of the data tables, constructing the first prompt information of each of the data tables, wherein the first table related information includes table annotations, field names, field annotations, related questions and thinking chains; inputting the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的查询语句生成方法,该方法包括:响应于用户提交的第一问题文本,通过大型语言模型LLM对所述第一问题文本进行编码,得到第一句向量;基于所述第一句向量以及数据表的表注释,从预设元数据库中选取与所述第一问题文本的相似度排名靠前的N个数据表,其中,N为正整数;基于各所述数据表的第一表相关信息,构建各所述数据表的第一提示信息,其中,所述第一表相关信息包括表注释、字段名、字段注释、相关问题和思维链;将所述第一提示信息输入至LLM中,得到所述LLM输出的所述第一问题文本对应的第一结构化查询语言。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the query statement generation method provided by the above-mentioned methods, the method comprising: in response to a first question text submitted by a user, encoding the first question text through a large language model LLM to obtain a first sentence vector; based on the first sentence vector and the table annotations of the data table, selecting N data tables with the top similarity ranking with the first question text from a preset metadata database, wherein N is a positive integer; based on the first table related information of each of the data tables, constructing the first prompt information of each of the data tables, wherein the first table related information includes table annotations, field names, field annotations, related questions and thinking chains; inputting the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A query statement generation method, comprising:
Responding to a first question text submitted by a user, and encoding the first question text through a large language model LLM to obtain a first sentence vector;
selecting N data tables with top similarity ranking with the first question text from a preset metadata base based on the first sentence vector and the table annotation of the data tables, wherein N is a positive integer;
Constructing first prompt information of each data table based on first table related information of each data table, wherein the first table related information comprises table notes, field names, field notes, related problems and thinking chains;
And inputting the first prompt information into LLM to obtain a first structured query language corresponding to the first question text output by the LLM.
2. The query sentence generation method according to claim 1, wherein the related questions and the thought chain of each database table in the preset metadata base are obtained by:
Acquiring a second problem text from a training database, and acquiring a second structured query language matched with the second problem text through the LLM;
Encoding the second question text and the second structured query language through the LLM to obtain a third sentence vector;
Generating third prompt information based on the second question text and the second structured query language, and inputting the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used for indicating the LLM to describe a generating step of generating the second structured query language according to the second question text;
Extracting key sentences from the description result, and correspondingly storing the extracted key sentences into relevant question fields and thinking chain fields of a data table matched with the second question text in a preset metadata base, wherein the relevant question fields are used for describing relevant questions and structured query languages of the data table, and the thinking chain fields are used for recording the generation steps of the structured query languages.
3. The query term generation method of claim 2, wherein the obtaining, by the LLM, the second structured query language matching the second question text comprises:
Acquiring a second problem text from the training database, and encoding the second problem text through the LLM to obtain a second sentence vector;
selecting M data tables with top similarity ranking with the second problem text from a preset metadata base based on the second sentence vector and the table annotation of the data tables, wherein M is a positive integer;
constructing second prompt information of each data table based on second table related information of each data table, wherein the second table related information comprises table notes, field names and field notes;
And inputting the second prompt information into the LLM to obtain a second structured query language corresponding to the second question text output by the LLM.
4. The query term generation method as claimed in claim 2, wherein said encoding, by the LLM, the second question text and the second structured query language to obtain a third term vector, further comprises:
grammar checking is carried out on the second structured query language;
under the condition that the second structured query language passes grammar verification, the second structured query language is sent to a manual processing terminal, so that the manual processing terminal can verify whether the second structured query language meets service requirements;
And under the condition that the manual processing terminal receives a feedback result of judging that the second structured query language meets the service requirement, encoding the second problem text and the second structured query language through the LLM to obtain a third sentence vector.
5. The query term generation method of claim 2, wherein the method further comprises:
Continuing to execute the second problem text obtained from the training database, and obtaining a second structured query language matched with the second problem text through the LLM;
continuing to execute the encoding of the second question text and the second structured query language by the LLM to obtain a third sentence vector;
Continuing to execute the generation step of generating third prompt information based on the second question text and the second structured query language, and inputting the third prompt information and the third sentence vector into the LLM to obtain a description result output by the LLM, wherein the third prompt information is used for indicating the LLM to describe the generation step of generating the second structured query language according to the second question text;
And continuously executing the keyword extraction on the description result, and correspondingly storing the extracted keyword into the relevant problem field and the thinking chain field of the data table matched with the second problem text in the preset metadata base until the update duty ratio of the relevant problem field and the thinking chain field of the database table in the preset metadata base reaches a preset threshold value.
6. The query term generation method as claimed in claim 1, wherein after the first prompt message is input into LLM to obtain a first structured query language corresponding to the first question text output by LLM, further comprising:
Performing fine tuning training on the LLM according to a plurality of first structured query languages which are verified by a user to be generated correctly and the corresponding first problem texts to obtain a fine tuning trained LLM;
and responding to a third question text submitted by a user, inputting the third question text into the fine tuning training LLM, and obtaining a third structured query language output by the fine tuning training LLM.
7. A query term generation device, comprising:
The coding unit is used for responding to a first problem text submitted by a user, and coding the first problem text through a large language model LLM to obtain a first sentence vector;
A selecting unit, configured to select N data tables with top similarity ranks with the first question text from a preset metadata base based on the first sentence vector and table comments of the data tables, where N is a positive integer;
A construction unit, configured to construct first prompt information of each data table based on first table related information of each data table, where the first table related information includes a table annotation, a field name, a field annotation, a related problem, and a thinking chain;
the generating unit is used for inputting the first prompt information into the LLM to obtain a first structured query language corresponding to the first question text output by the LLM.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the query statement generation method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the query term generation method of any of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements a query statement generation method as claimed in any one of claims 1 to 6.
CN202410372853.1A 2024-03-29 2024-03-29 Query statement generation method, device, equipment, storage medium and program product Pending CN118796872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410372853.1A CN118796872A (en) 2024-03-29 2024-03-29 Query statement generation method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410372853.1A CN118796872A (en) 2024-03-29 2024-03-29 Query statement generation method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN118796872A true CN118796872A (en) 2024-10-18

Family

ID=93024516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410372853.1A Pending CN118796872A (en) 2024-03-29 2024-03-29 Query statement generation method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN118796872A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119226318A (en) * 2024-11-29 2024-12-31 朗新科技集团股份有限公司 Structured query language generation method, device, electronic device and storage medium
CN119248909A (en) * 2024-12-04 2025-01-03 浙商期货有限公司 A method and system for generating futures content based on large language model
CN119377255A (en) * 2024-12-30 2025-01-28 浙江大学计算机创新技术研究院 A method and device for generating SQL based on semantic alignment and hierarchical agent

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119226318A (en) * 2024-11-29 2024-12-31 朗新科技集团股份有限公司 Structured query language generation method, device, electronic device and storage medium
CN119226318B (en) * 2024-11-29 2025-05-30 朗新科技集团股份有限公司 Structured query language generation method and device, electronic equipment and storage medium
CN119248909A (en) * 2024-12-04 2025-01-03 浙商期货有限公司 A method and system for generating futures content based on large language model
CN119377255A (en) * 2024-12-30 2025-01-28 浙江大学计算机创新技术研究院 A method and device for generating SQL based on semantic alignment and hierarchical agent

Similar Documents

Publication Publication Date Title
US11086601B2 (en) Methods, systems, and computer program product for automatic generation of software application code
US11250033B2 (en) Methods, systems, and computer program product for implementing real-time classification and recommendations
US10705796B1 (en) Methods, systems, and computer program product for implementing real-time or near real-time classification of digital data
US10467122B1 (en) Methods, systems, and computer program product for capturing and classification of real-time data and performing post-classification tasks
CN118796872A (en) Query statement generation method, device, equipment, storage medium and program product
CN113254619A (en) Automatic reply method and device for user query and electronic equipment
US9672490B2 (en) Procurement system
CN113779062B (en) SQL statement generation method, device, storage medium and electronic device
CN111736804B (en) A method and device for identifying key functions of App based on user comments
CN111177307A (en) Test scheme and system based on semantic understanding similarity threshold configuration
CN117992791B (en) Sentence generation model training method, sentence generation method, system and device
CN115510193B (en) Query result vectorization method, query result determination method and related devices
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN118227766B (en) Tool enhancement-based intelligent question-answering method for financial field
CN117370190A (en) Test case generation method and device, electronic equipment and storage medium
CN113407677B (en) Method, apparatus, device and storage medium for evaluating consultation dialogue quality
CN117709358A (en) Dialogue response method, device, equipment and medium of insurance intelligent question-answering system
CN118779315A (en) Database query method, device, electronic device and storage medium
CN119719321A (en) Query statement generation method, device, equipment and storage medium
CN119441443A (en) Automatic prompt construction method based on human-computer dialogue history and semantic retrieval
CN118916618A (en) Network operation and maintenance analysis method, device and storage medium
Roychowdhury et al. ERATTA: Extreme RAG for Table To Answers with Large Language Models
CN119960938A (en) Interface calling method, device, computer equipment and storage medium
CN118627471B (en) Automatic government affair data labeling method and system based on dependency attention diagram convolution
CN114780577A (en) SQL statement generation method, device, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination