CN119201966A

CN119201966A - SQL statement generation method, device, equipment and medium based on large language model

Info

Publication number: CN119201966A
Application number: CN202410731907.9A
Authority: CN
Inventors: 余建学; 叶国栋; 易飞乐; 冯丰; 侯轶; 李苑兰; 徐超
Original assignee: China Merchants Shekou Digital City Technology Co ltd
Current assignee: China Merchants Shekou Digital City Technology Co ltd
Priority date: 2024-06-06
Filing date: 2024-06-06
Publication date: 2024-12-27

Abstract

The present invention discloses a method, device, equipment and medium for generating SQL statements based on a large language model. The method comprises: obtaining a user query text; inputting the user query text into a pre-trained Text2SQL model to obtain a SQL statement output by the Text2SQL model; inputting the SQL statement into a pre-trained SQL2Text model to obtain a target query text output by the SQL2Text model; calculating the semantic similarity between the user query text and the target query text; if the semantic similarity is greater than or equal to a preset target value, determining that the SQL statement is the SQL statement corresponding to the user query text. Through the above method, the problem of low accuracy in generating SQL statements is solved, and the accuracy and efficiency of generating SQL statements are improved.

Description

SQL sentence generation method, device, equipment and medium based on large language model

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for generating an SQL statement based on a large language model.

Background

In enterprise data analysis applications, business Intelligence (BI) report analysis has become an integral part of enterprise decision making and business management. In recent years, with the rapid development of artificial intelligence technology, particularly the advent of large-scale language models, natural Language Processing (NLP) technology has been widely used. SQL sentences (Structured QueryLanguage) are generated through natural language to realize report analysis, so that convenience and efficiency of data query and analysis are remarkably improved.

However, existing large models may guess, add or miss certain key field information by themselves when generating SQL statements, resulting in errors in the generated SQL statements. Therefore, how to improve the accuracy of generating the SQL statement is a current urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a large language model-based SQL sentence generation method, device, equipment and medium, which are used for solving the problem of low accuracy in generating SQL sentences.

A SQL sentence generation method based on a large language model comprises the following steps:

Acquiring a user query text;

inputting the user query Text into a pre-trained Text2SQL model to obtain an SQL sentence output by the Text2SQL model;

Inputting the SQL sentence into a pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model;

Calculating semantic similarity between the user query text and the target query text;

and if the semantic similarity is greater than or equal to a preset target value, determining the SQL sentence to be the SQL sentence corresponding to the user query text.

In an embodiment, the inputting the user query Text into a pre-trained Text2SQL model to obtain an SQL statement output by the Text2SQL model includes:

Preprocessing the user query text to obtain a standard user query text;

And inputting the standard user query Text into a pre-trained Text2SQL model to obtain SQL sentences output by the Text2SQL model.

In an embodiment, the inputting the SQL statement into a pre-trained SQL2Text model to obtain the target query Text output by the SQL2Text model includes:

Checking the SQL statement to obtain a checking result;

And if the verification result is verification passing, inputting the SQL sentence passing the verification into a pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model.

In an embodiment, the method further comprises:

If the verification result is that the verification is not passed, obtaining error information in the verification result;

splicing the error information with the standard user query text to obtain a target spliced text;

and re-inputting the target spliced Text to the Text2SQL model until the SQL sentence output by the Text2SQL model passes verification.

In an embodiment, the verifying the SQL statement to obtain a verification result includes:

carrying out grammar checking on the SQL sentence to obtain a grammar checking result;

performing semantic verification on the SQL statement to obtain a semantic verification result;

Performing security verification on the SQL statement to obtain a security verification result;

And performing performance verification on the SQL statement to obtain a performance verification result.

In an embodiment, the performing the grammar check on the SQL statement to obtain a grammar check result includes:

analyzing the grammar structure of the SQL sentence;

And comparing the grammar structure with a predefined SQL grammar rule to obtain the grammar checking result.

In one embodiment, the SQL2Text model is trained by:

Acquiring a training sample data set and a test sample data set, wherein the training sample data set comprises a plurality of training sample pairs, the test sample data set comprises a plurality of test sample pairs, the training sample pairs comprise a training prompt field and a training response field, and the test sample pairs comprise a test prompt field and a test response field;

Training the SQL2Text model to be trained based on the training prompt field and the training response field;

After the training of the SQL2Text model to be trained is completed, testing the SQL2Text model after the training is completed based on the test prompt field and the test response field to obtain a test result;

If the test result meets the expected target, determining that the trained SQL2Text model is a trained SQL2Text model;

And if the test result does not meet the expected target, adjusting hidden parameters of the SQL2Text model after training and/or modifying the training sample data set, and retraining the SQL2Text model after training until the test result meets the expected target.

An SQL statement generation device based on a large language model, comprising:

The acquisition module is used for acquiring a user query text;

the input module is used for inputting the user query Text into a pre-trained Text2SQL model to obtain an SQL sentence output by the Text2SQL model, and inputting the SQL sentence into the pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model;

the calculating module is used for calculating the semantic similarity between the user query text and the target query text;

And the determining module is used for determining the SQL sentence to be the SQL sentence corresponding to the user query text if the semantic similarity is greater than or equal to a preset target value.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-described large language model based SQL statement generation method when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the above-described large language model-based SQL statement generation method.

In one scheme provided by the large language model-based SQL sentence generation method, device, equipment and medium, a user query Text is acquired, the user query Text is input into a pre-trained Text2SQL model to acquire an SQL sentence output by the Text2SQL model, the SQL sentence is input into the pre-trained SQL2Text model to acquire a target query Text output by the SQL2Text model, semantic similarity between the user query Text and the target query Text is calculated, and if the semantic similarity is greater than or equal to a preset target value, the SQL sentence corresponding to the user query Text is determined. In the embodiment, whether the generated SQL sentence is accurate or not is judged by comparing the semantic similarity between the user query text and the target query text, the query intention of the user is accurately reflected, and when the semantic similarity is larger than or equal to a preset target value, the SQL sentence is determined to be the SQL sentence corresponding to the user query text, so that the accuracy and the efficiency of the SQL sentence generation are improved, the generated SQL sentence is ensured to be in accordance with the grammar of a database, the query intention of the user is also in accordance, and the user can acquire the required information from the database more conveniently and accurately.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of SQL2Text model training in an embodiment of the invention;

FIG. 2 is a flow chart of a large language model based SQL statement generation method in accordance with one embodiment of the invention;

FIG. 3 is another flow chart of a large language model based SQL statement generation method in an embodiment of the invention;

FIG. 4 is another flow chart of a large language model based SQL statement generation method in an embodiment of the invention;

FIG. 5 is a schematic diagram of a large language model based SQL statement generation device according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The SQL sentence generation method based on the large language model provided by the embodiment of the invention can be applied to a plurality of fields, and particularly comprises but is not limited to the fields of real estate development, business operation, building operation and maintenance and the like. In addition, the large language model-based SQL sentence generation method of the present invention can be applied to various systems which may exist in various terminal devices, including, but not limited to, personal computers, notebook computers, smart phones, tablet computers, etc.

In order to facilitate understanding of the embodiments of the present invention, the following explanation is given here to the SQL2Text model, the Text2SQL model, the Prompt preprocessing module, the Embeding module, the verification module, the execution module, and the data presentation module related to the present invention:

SQL2Text model:

The SQL2Text model is a model that converts Structured Query Language (SQL) into natural language (Text). The main task of the SQL2Text model is to convert SQL statements in database queries into natural language descriptions that can be understood by humans. The essence is that key information is extracted by analyzing the structure and the semantics of SQL sentences, and the information is re-expressed in a natural language mode. Specifically, the SQL2Text model in the invention can be obtained by training and fine tuning on the basis of the base large model, and can also be obtained by training and fine tuning of other large models, so long as the large models supporting fine tuning training and transfer learning are all available, and the invention is not limited. Wherein, the base large model refers to a machine learning model with huge parameters and a complex network architecture, can process a plurality of different tasks, supports fine-tuning training, and can be migrated to a large model used in other task fields similar to the target task field, for example, a neural network model with billions to billions of parameters.

As shown in fig. 1, the SQL2Text model may be obtained by training the following steps:

s1, acquiring a training sample data set and a test sample data set, wherein the training sample data set comprises a plurality of training sample pairs, the test sample data set comprises a plurality of test sample pairs, the training sample pairs comprise training prompt fields and training response fields, and the test sample pairs comprise test prompt fields and test response fields.

It is to be understood that the training sample dataset may include, but is not limited to, thousands, tens of thousands or hundreds of thousands of training sample pairs, and the particular invention is not limited thereto. Wherein a training sample pair (e.g., a QA pair) includes a training Prompt field (e.g., a Prompt field) and a training Response field (e.g., a Response field). Specifically, the training prompt field may include a task instruction, an SQL statement, and metadata of a data table, and the training response field includes a user query text corresponding to the SQL statement in the training prompt field. The task instruction is used for describing a specific task which is input to the SQL2Text model to execute. For example, please generate user query text from SQL statements. SQL statement is an SQL statement conforming to the SQL2000 standard. For example, "SELECT amount FROM sale _ table WHERE project = 'a item' AND year_monta= '2021-01'". Metadata of the data table is metadata information defining a table structure, and comprises names of the data tables in a database, field names of all columns, data types, value ranges, units, constraints, field usage descriptions and the like. It should be noted that the foregoing is merely an example, and the present invention is not limited thereto. And the test sample dataset may also include, but is not limited to, thousands, tens of thousands, or hundreds of thousands of test sample pairs. The test sample pair includes a test hint field and a test response field. The test prompt field and the test response field in the test sample data set may be the same as or different from the training prompt field and the training response field in the training sample data set, and the specific invention is not limited thereto.

S2, training the SQL2Text model to be trained based on the training prompt field and the training response field.

It can be understood that after the training sample set is obtained, the training fine adjustment is performed on the SQL2Text model to be trained through the training prompt field and the training response field in the training sample set, so that the fine adjustment process enables the SQL2Text model to be trained to learn how to extract key information from the test prompt field (for example, extract key information from the SQL sentence) and convert the key information into an accurate and smooth natural language description, thereby enabling the SQL2Text model to have the capability of generating natural language through the SQL sentence. The fine Tuning method may use full-scale fine Tuning or incremental fine Tuning, and specifically, the incremental fine Tuning may include, but is not limited to, incremental fine Tuning techniques such as LoRA and promt Tuning.

And S3, testing the trained SQL2Text model based on the test prompt field and the test response field after the training of the SQL2Text model to be trained is completed, and obtaining a test result.

It can be appreciated that after the training of the SQL2Text model to be trained is completed, the trained SQL2Text model is tested based on the test prompt field and the test response field in the test sample data set, for example, the trained SQL2Text model is tested based on the task instructions in the test prompt field, the SQL statements, and the metadata of the data table, so as to obtain the natural language description generated by the output. These generated natural language descriptions are then compared to standard descriptions (e.g., user query Text) in the test sample dataset to derive test results, which may be calculated, for example, by calculating similarity or accuracy metrics between the generated natural language descriptions and the standard descriptions to assess the accuracy of the trained SQL2Text model, as the invention is not limited in detail.

S4, if the test result meets the expected target, determining that the trained SQL2Text model is the trained SQL2Text model;

And S5, if the test result does not meet the expected target, adjusting hidden parameters of the SQL2Text model after training and/or modifying the training sample data set, and retraining the SQL2Text model after training until the test result meets the expected target.

It can be understood that after the test is completed, whether the trained SQL2Text model accords with the expected or not is judged by comparing the test result with the expected target, so that the obtained SQL2Text model has high performance and stability, and high-quality SQL statement to natural language description conversion service can be provided for a user in practical application. If the test result shows that the SQL2Text model achieves the expected targets on the key indexes, such as the accuracy, the recall and the F1 score all accord with the expected standard, the trained SQL2Text model is determined to be a trained model. If the test result shows that the performance of the SQL2Text model does not accord with the expected target, adjusting hidden layer parameters of the trained SQL2Text model, such as changing learning rate, adjusting hidden layer neuron number, changing activation function and the like, so as to optimize the performance of the SQL2Text model. In addition, if the performance of the SQL2Text model after the parameters are adjusted is not improved obviously, the training sample data set is modified, for example, the number of samples is increased, the quality of the samples is improved, or the sample data which is more in line with the actual application scene is collected again. After parameter adjustment or data set modification is completed, training is conducted on the trained SQL2Text model again, and testing is conducted again through the test sample data set. This process is repeated until the test results reach the desired targets. By the method, the SQL2Text model learns the semantic mapping relation among the metadata of the database table, the SQL sentences and the user query Text, so that the SQL sentences are supported to be converted into the user query Text in a natural language form, the SQL2Text model has high performance and stability, and high-quality SQL sentence-to-natural language description conversion service can be provided for users in practical application.

An example of each sample of the training sample dataset, the test sample dataset of the SQL2Text model is as follows:

{"prompt":

and a task instruction, namely please generate a user query text according to the SQL sentence.

SQL statement SELECT amount FROM sale-table WHERE project = 'A item' ANDyear-montath= '2021-01'

Metadata of the data table:

CREATE TABLE sale_table(

year _ mole VARCHAR (100) com 'time dimension: data format: yyyy-mm',

Project VARCHAR (90) NOT NULL COMMENT 'organization dimension item name',

The amountDECIAL (24, 6) NOT NULL COMMENT' sales amount, units of ten thousand yuan,

) Comment 'sales data sheet'

"

,

Response user query text query A sales amount of item A1 month 2021 "

“}

Text2SQL model:

The Text2SQL (also called NL2 SQL) model is a model that converts Natural Language (NL) problems in the database domain into a structured query Language (Structured Query Language, SQL) that can be executed in a relational database. The essence of this is that the natural language sentence of the user is converted into a canonical semantic representation that the computer can understand and execute. Specifically, the Text2SQL model in the invention is obtained by training and fine tuning on the basis of the base large model, and the specific training process is similar to the SQL2Text model training process, for example, the steps S1-S5 are described, and in order to avoid repetition, the description will not be repeated here. The difference is that the training sample dataset and the test sample dataset differ in the training sample pair and the test sample pair. Specifically, as follows, the training hint field (e.g., the Prompt field) may include the task instructions, the user query text, metadata of the data table, and sample data of the data table, while the training Response field (e.g., the Response field) includes the SQL statement corresponding to the user query text in the training hint field. The task instruction is used for describing a specific task which is input to the Text2SQL model to execute. For example, please generate an SQL statement according to the SQL2000 standard according to the user question. The user inquires the text, which is the user question described in the text natural language form, including Chinese, english, number, character, etc. For example: what is the sales amount of item a, item 2023, month 1? metadata of a data table is metadata information defining a table structure, and includes names of the data tables in a database, field names of columns, data types, value ranges, value units, value constraints, field usage descriptions and the like. Sample data of the data table to help the Text2SQL model understand the structure and content of the data table, a small amount of sample data in the data table needs to be provided. The sample data and the real data of the data table may not be identical, or may be identical, as long as the expression mode, the value range and the data granularity of the sample data and the data table are basically identical, and the specific invention is not limited. In order to ensure the quality of the training sample pair of the training sample data set, the correctness of the SQL statement needs to be ensured, for example, the SQL statement can be checked manually or in an automatic tool mode, so that the grammar is correct, the semantics are clear, and the service requirement is met, so that errors are avoided. Through the method, the Text2SQL model learns the mapping relation among the hidden Text semantics, the metadata of the database table and the SQL sentences in the user query Text (Text), so that the user query Text in the natural language form is converted into the corresponding SQL sentences. It should be noted that the foregoing is merely an example, and the present invention is not limited thereto.

An example of each sample of the training sample dataset, the test sample dataset of the Text2SQL model is as follows:

The Prompt preprocessing module:

The Prompt preprocessing module is part of a Natural Language Processing (NLP) system that preprocesses the input user query text data by designing specific hints (promts) to optimize the output of the model to be able to accurately recognize the user's intent and to correctly generate SQL statements. For example, the user query Text input by the user is preprocessed by the Prompt preprocessing module before submitting the user query Text (e.g., user question Text) to the Text2SQL model, so as to obtain a standard user query Text with complete semantics and standard specification.

Embedding module:

The Embedding module (also referred to as a word embedding module) is a technique for representing text data that maps high-dimensional data (e.g., text, pictures, video, etc.) to a low-dimensional space, thereby representing the input text as points in a continuous numerical space, resulting in a Embedding vector. The Embedding vector can accurately represent the inherent semantics of the natural language text, the Embedding vectors with similar geometric distances, and the corresponding natural language text has similar meanings. The Embedding module may be built by Word2Vec, gloVe or FastText algorithm, and is used to calculate the geometric distance of the corresponding Embedding vectors of the two text phrases or sentences, so as to determine the semantic similarity of the two text phrases or sentences.

And a verification module:

The method is mainly used for checking the correctness, the safety and the performance of SQL sentences.

The execution module:

the method is mainly used for establishing connection between the natural language processing system and the database, managing a connection pool and executing SQL scripts.

And the data display module is used for:

the method is mainly used for receiving the query result data transmitted by the execution module, analyzing the query result, extracting a data set required for graph display, and providing various graph types for users to select, such as a histogram, a line graph, a pie chart, a scatter chart, an area chart and the like, so that corresponding graphs are displayed according to the characteristics of the data and the requirements of the users.

The method for generating SQL sentences based on the large language model provided by the invention is described in detail below through various embodiments.

In one embodiment, as shown in fig. 2, there is provided a method for generating an SQL statement based on a large language model, including the steps of:

s10, acquiring a user query text.

It will be appreciated that the user query Text may be user question Text, e.g., "what is the sales amount of item a, month 2023, 1.

S20, inputting the user query Text into a pre-trained Text2SQL model to obtain SQL sentences output by the Text2SQL model.

It can be understood that after the user query Text is obtained, the obtained user query Text is input into the Text2SQL model trained in advance to obtain SQL sentences output by the Text2SQL model, so that the user can query in a natural language form without directly writing complex SQL sentences, and the database query becomes more visual and convenient. As shown in fig. 3, in inputting the user query Text to the Text2SQL model trained in advance to obtain the SQL statement output by the Text2SQL model, the method specifically further includes the following steps:

s21, preprocessing the user query text to obtain a standard user query text.

It can be appreciated that, in order to enable the Text2SQL model to accurately identify the user intention and accurately generate the SQL statement, before inputting the user query Text, that is, the user question Text, to the Text2SQL model, the user query Text needs to be preprocessed by the Prompt preprocessing module, for example, the user query Text input by the user may be preprocessed based on a Natural language processing (Natural LanguageProcessing, NLP) technology, so as to obtain a standard user query Text with complete semantics and standard specification, so as to eliminate noise and redundant information in the user query Text, and improve accuracy and efficiency of subsequent processing.

The specific pretreatment process is as follows:

Splitting a user query Text input by a user into single words or entries so that the Text2SQL model can understand and process the meaning of each word one by one;

Further, the split vocabulary or vocabulary entries are deleted or filtered, for example, vocabulary (i.e., stop words) which frequently appears in the text of the user query but has little effect on information retrieval and text semantic processing are filtered, for example, "yes", "in", etc., so as to save storage space and improve the efficiency of searching and processing.

And then, establishing a knowledge base of standard vocabulary explanation and industry term definition according to business requirements and industry knowledge. Through semantic searching of the knowledge base, certain spoken words are replaced with standardized terms. For example, "how good a house please inquire about item a is sold" is replaced with "please inquire about the sales amount of item a". For another example, "how many clients see the house please query the a item" is replaced with "visit amount to query the a item", etc.

Finally, common problems and spoken language expression habits of the user are collected, an abbreviation mapping table is established, and common abbreviations in the business field are mapped into complete text descriptions. For example, "A project" is replaced with "XX company A project". Through the processing process, the user query text is converted into the standard user query text, and the accuracy and the efficiency of subsequent processing are improved.

It should be noted that the foregoing is merely an example, and the present invention is not limited thereto.

S22, inputting the standard user query Text into a pre-trained Text2SQL model to obtain SQL sentences output by the Text2SQL model.

It will be appreciated that after the standard user query Text is obtained, the standard user query Text is passed as input data to the Text2SQL model. And calling an reasoning or prediction function of the Text2SQL model, and executing a complex calculation process, wherein the complex calculation process comprises steps of Text coding, semantic understanding, SQL structure generation and the like to generate an SQL sentence.

By way of example, assume that the user query text entered by the user is "I want to find all employees in 'Beijing' and wages exceeding 10000 yuan. "in order to convert this piece of user query Text into a format that can be understood by the Text2SQL model and output the corresponding SQL statement, the following steps are required:

First, the text is cleaned to remove redundant information such as punctuation marks and unnecessary spaces. Next, word segmentation is performed to split the text into separate words or phrases, "I," want, "" find, "" all, "" in, "" Beijing, "" and, "" payroll, "" excess, "" 10000 yuan, "" employee, "" information. The stop words, i.e., those that do not contribute much to the query intent, such as "I", "want", etc., are then removed. Thus, the obtained core words are "find", "all", "in", "Beijing", "wage", "exceed", "10000 yuan", "employee", "information". Next, these words are recombined into a simple and clear query sentence, i.e. standard user query text, "find all staff information in Beijing and wages exceeding 10000 yuan". Finally, this standard user query Text is entered into a pre-trained Text2SQL model. The Text2SQL model generates a corresponding SQL sentence by analyzing semantic AND grammar information in standard user query Text, wherein ' SELECT FROM employee list WHERE wage >10000AND locates = ' Beijing ' ". Through the preprocessing and the application of the Text2SQL model, the conversion from natural language query to SQL sentence is realized, and the query efficiency and accuracy are improved. It should be noted that the foregoing is merely an example, and the present invention is not limited thereto.

S30, inputting the SQL sentence into a pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model.

It can be appreciated that after the SQL statement corresponding to the user query Text output by the Text2SQL model is obtained, as in the above example, the obtained SQL statement is "SELECT FROM employee table WHERE wage >10000AND location= 'Beijing'". Further, this SQL statement needs to be input into a pre-trained SQL2Text model. The SQL2Text model generates a corresponding natural language Text (namely, target query Text) by analyzing the structure and meaning of the SQL sentence, for example, "search employee information in Beijing and wage exceeding 10000 yuan" aiming at comparing with the user query Text for the follow-up to further refer to the accuracy of the SQL sentence corresponding to the sentence to be queried.

As shown in fig. 4, the method inputs the SQL statement into the pre-trained SQL2Text model to obtain the target query Text output by the SQL2Text model, and specifically further includes the following steps:

S31, checking the SQL sentence to obtain a checking result.

It can be understood that the obtained SQL statement is verified by the verification module, and specifically, the grammar, the semantics, the security and the performance of the SQL statement can be verified to obtain a corresponding verification result, where the verification result may include that the verification result passes and that the verification result fails, and the specific invention is not limited. After the SQL sentence is checked, the SQL sentence is input into the SQL2Text model, so that the quality, safety and performance of the SQL sentence are further improved, and risks and problems caused by SQL sentence errors are reduced. The specific verification process is as follows in steps S311-S314:

s311, performing grammar checking on the SQL sentence to obtain a grammar checking result.

It can be understood that before the grammar checking of the SQL statement, the SQL statement needs to be parsed to obtain a grammar structure corresponding to the SQL statement, and the grammar structure corresponding to the parsed SQL statement is further compared with a predefined SQL grammar rule to obtain a corresponding grammar checking result. Specifically, the grammar correctness of the SQL sentence can be verified, including checking whether the use of keywords, identifiers, data types, operators and the like accords with the predefined SQL grammar rules, so as to obtain a grammar checking result. The grammar check result may include "grammar check pass" and "grammar check fail", and when the grammar check result is "grammar check fail", an error log, an error location, an error type and the like recorded when the grammar error is detected are obtained. Otherwise, if all the checks pass, a semantic check result of 'grammar check pass' is returned.

S312, performing semantic verification on the SQL sentence to obtain a semantic verification result.

It can be understood that after the syntax checking is performed on the SQL statement, the semantic checking is further performed on the SQL statement to obtain a semantic verification result. Specifically, the semantic correctness of the SQL statement can be verified, including checking whether the referenced tables, columns, functions, etc. exist, and whether the logic in the SQL statement is correct, etc. For example, check if the condition of the WHERE clause is reasonable, if the ordering field in the ORDER BY clause is valid, etc. If any logical errors are found in the verification process, such as the column name does not exist or the wrong table name is used, a semantic verification result of 'semantic verification failed' is generated, and the position and the type of the errors are explicitly pointed out. If all the checks pass, a semantic check result of "semantic check pass" is returned, indicating that the SQL statement is logically correct.

Illustratively, when the SQL statement is "SELECT amount FROM sale _ table WHERE project = 'item a'". When performing semantic checking, it is first checked whether the "sample_table" table exists and whether the "sample" column belongs to the sample_table. Next, it is verified whether "product" is a valid column name and it is checked whether there is such a column in the "table. It should be noted that the foregoing is merely an example, and the present invention is not limited thereto.

S313, carrying out security verification on the SQL sentence to obtain a security verification result.

It can be appreciated that after the syntax check and the semantic check are performed on the SQL statement, the security check needs to be further performed on the SQL statement to obtain a security check result. Specifically, it can verify whether the SQL injection risk exists in the SQL statement and whether the SQL injection risk complies with the minimum authority principle, so as to ensure that the SQL statement only executes the required operation and does not execute redundant or sensitive operation. If the security risk is detected, a security check result of 'security check failed' is generated, and the position and type of the security vulnerability risk are clearly indicated. And if all the checks pass, returning a security check result of the security check pass.

For example, when the SQL statement is "SELECT FROM users WHERE username = '" + userInput + "'", if userInput is not properly processed and escape, malicious code may be included, such as 'OR'1 '=' 1, which will result in the query returning all user information rather than just specifying the user, OR the potential risk of SQL injection is detected, such as the inclusion of SQL special characters in the user input, security check results will be generated, and a security check result of "security check failed" will be generated. If the verification is passed, a security verification result of 'security verification passed' is returned. It should be noted that the foregoing is merely an example, and the present invention is not limited thereto.

S314, performing performance verification on the SQL statement to obtain a performance verification result.

It can be understood that after the grammar, the semantics and the security of the SQL sentence are verified, the performance of the SQL sentence needs to be further verified to obtain a performance verification result. Specifically, it can verify whether the SQL statement lacks an index, whether there is an unnecessary full table scan, an unreasonable join operation, etc., and if a performance risk is detected, a performance verification result of "performance verification failed" is generated, and the location and type of the performance risk are explicitly indicated. If all the checks pass, a performance check result of "performance check pass" is returned.

In summary, the user query text entered by the user is taken as an example of "the sales amount of the query a item 2023, month 1". When the natural language processing system receives the user query text, it is first pre-processed using Natural Language Processing (NLP) techniques. The preprocessing process includes text segmentation, stop word filtering, term replacement and shorthand mapping to generate a standard user query text with complete semantics and standard specifications. Through the preprocessing, the user query text of the user is standardized as "please query the sales amount of the XX company A project 2023, 1 month. Then, the standard user query Text is input into a pre-trained Text2SQL model to generate an SQL sentence. For example, the SQL statement generated by the Text2SQL model is "SELECT SUM (current) FROM samples_ table WHERE project = 'A project' ANDdate = '2023-01'". In order to ensure the accuracy of the generated SQL statement, the generated SQL statement is further required to be comprehensively checked, and the checking process comprises grammar checking, semantic checking, safety checking, performance checking and the like so as to obtain a checking result.

S32, if the verification result is verification passing, inputting the SQL sentence passing the verification into the pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model.

It can be understood that after the above SQL statement passes the verification, the verified SQL statement is input into a pre-trained SQL2Text model to obtain a corresponding target query Text, so as to ensure accuracy and efficiency of the SQL statement. If the verification result is that the verification is not passed, executing steps S33-S35:

s33, obtaining error information in the verification result;

s34, splicing the error information with the standard user query text to obtain a target spliced text;

S35, re-inputting the target spliced Text into the Text2SQL model until the SQL sentence output by the Text2SQL model passes the verification.

It will be appreciated that during the SQL verification process, if any errors are found, the error information in the verification result will be obtained. It is assumed that in the above example, the format error of the "date" field is found during the syntax checking process, and error information is generated that the error type is "syntax error", the error position is in the "WHERE clause", and the format of the error description "date" field should be 'YYYY-MM' ". And then splicing the error information with the standard user query text to generate a target spliced text. The splicing process may be as follows:

First, the standard user query text "please query the sales amount of the XX company A project 2023, 1 month" is spliced with the error information to generate the target spliced text, wherein the splicing process can be randomly combined or other ways, and the specific invention is not limited. For example, the target splice text is generated as "please query the sales amount of XX company A project 2023, month 1. Note that the format of the 'date' field should be 'YYYY-MM'. And inputting the target spliced Text into the Text2SQL model again to regenerate the SQL sentence. Since the natural language processing system already contains error information in the target splice Text, the Text2SQL model will adjust the generation logic based on this information, avoiding making the same error again. For example, in regenerating an SQL statement, the target splice text may generate a corrected SQL statement of "SELECT SUM (amount) FROM samples_ table WHERE project = 'A project' ANDdate _format (date, '% Y-%m')= '2023-01'". And then, comprehensively checking the newly generated SQL sentence again, and checking the grammar, the semantics, the safety and the performance of the newly generated SQL sentence. And if the verification is passed, inputting the verified SQL sentence into a pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model. For example, the SQL2Text model may convert the generated SQL statement back into query Text in natural language "please query the sales amount of XX company A project 2023, month 1 years".

By the method, the natural language processing system can effectively improve the accuracy of large model generation of SQL sentences, ensure the accuracy and reliability of the generated SQL sentences in all aspects, thereby avoiding analysis errors caused by SQL errors, reducing adverse effects on enterprise business decisions and providing accurate and reliable data support for enterprises. It should be noted that if the verification result is that the verification is not passed, and the verification result is input to the Text2SQL model for the second time, the SQL sentence is regenerated, and the process cycle number of the verification module is called again to verify exceeds the preset number of times, the appearance of the abnormality is represented, and at the moment, the user is prompted to input the user query Text again so as to avoid consuming excessive resources.

S40, calculating semantic similarity between the user query text and the target query text;

s50, if the semantic similarity is greater than or equal to a preset target value, determining that the SQL sentence is the SQL sentence corresponding to the user query text;

And S60, if the semantic similarity is smaller than a preset target value, repeating the steps S10-S40 until the semantic similarity is larger than or equal to the preset target value.

It will be appreciated that after the target query text is obtained, in order to verify whether the generated SQL statement accurately reflects the query intent of the user, the obtained target query text needs to be compared with the user query text, and the process is completed by Embedding module calculating the semantic similarity between the user query text and the target query text. For example, the user query Text is "query a item 2023 for sales amount of 1 month in year", the generated SQL statement is "SELECT SUM (amont) FROMsales _ table WHERE project = 'a item' AND date_format (date, '% Y-% m')= '2023-01'", AND the SQL2Text model converts the SQL statement back to the target query Text "please query a item 2023 for sales amount SUM of 1 month in year". In order to determine the semantic similarity of the user query text and the target query text, the semantic similarity can be calculated as follows:

Firstly, the user query text and the target query text are respectively processed through Embedding modules to generate two corresponding embedded vectors. For example, the user query text "query a item 2023, 1 month sales amount" is mapped to vector a, and the target query text "query a item 2023, 1 month sales amount sum" is mapped to vector B. Then, the geometric distance between the vector a and the vector B is calculated by using cosine similarity (cosine similarity) or euclidean distance (Euclidean distance) or the like. The cosine similarity is an index for measuring the similarity of the directions of two vectors, the value of the cosine similarity is between-1 and 1, and the closer the value is to 1, the more similar the two vectors are. For example, the computed cosine similarity is 0.98, indicating that the semantics of the two texts are very close. And finally, comparing the calculated semantic similarity value with a preset target value. If the semantic similarity value is greater than a preset target value (for example, 0.95), the generated SQL sentence is considered to accurately reflect the query intention of the user, and the query process is ended. If the semantic similarity value is smaller than the preset target value, the generated SQL sentence is considered to not accurately reflect the query intention of the user, the steps of SQL generation and verification are returned to, the SQL sentence is regenerated, and the processes of the steps S10-S40 are repeated until the semantic similarity value is larger than or equal to the preset target value. By the method, the problem of inaccurate query results caused by errors of SQL sentences generated by the model can be effectively avoided, and the trust and satisfaction of enterprises in using data analysis and business intelligent tools are further improved.

It should be noted that, in order to avoid the occurrence of an abnormality, the steps S10 to S40 are repeated and circulated for a plurality of times, so that excessive system resources are consumed, a maximum number of times may be set, and if the maximum number of times is exceeded, the user is prompted to reenter the query text, where the maximum number of times may be 30 times, 40 times or 50 times, and the specific invention is not limited.

In summary, in one scheme provided by the method, the device, the equipment and the medium for generating the SQL sentence based on the large language model, the user query Text is acquired, the user query Text is input into a Text2SQL model trained in advance to acquire the SQL sentence output by the Text2SQL model, the SQL sentence is input into the SQL2Text model trained in advance to acquire the target query Text output by the SQL2Text model, the semantic similarity between the user query Text and the target query Text is calculated, and if the semantic similarity is greater than or equal to a preset target value, the SQL sentence corresponding to the user query Text is determined. In the embodiment, whether the generated SQL sentence is accurate or not is judged by comparing the semantic similarity between the user query text and the target query text, the query intention of the user is accurately reflected, and when the semantic similarity is greater than or equal to a preset target value, the SQL sentence is determined to be the SQL sentence corresponding to the user query text, so that the accuracy and the efficiency of the SQL sentence generation are improved, the generated SQL sentence is ensured to be in accordance with the grammar of a database, the query intention of the user is also in accordance, and the user can acquire the required information from the database more conveniently and accurately.

In an embodiment, after step S50, that is, after determining that the SQL statement is the SQL statement corresponding to the user query text if the semantic similarity is greater than or equal to the preset target value, the method further includes the following steps:

S70, inputting SQL sentences into a target database to acquire corresponding query result data;

S80, analyzing the query result data to obtain a data set for chart display, wherein the data set comprises data values, data labels and classification information;

And S90, receiving a chart type selected by a user, and displaying the data set based on the chart type, wherein the chart type comprises a histogram, a line graph, a pie chart, a scatter chart and an area chart.

It can be understood that after the SQL statement is obtained, the SQL statement is directly input into the target database, and the corresponding query result data is obtained by the execution module for query, for example, the query result data includes detailed information such as the name, sales amount, sales time and the like of the commodity, and is sent to the data display module. Wherein the target database is a database storing relevant query data. And then, analyzing the query result data through the data display module, wherein the query result data comprises the steps of extracting commodity names in the query result data as data labels, selling sales as data values and selling time as classification information, so that a data set special for chart display is formed. In addition, the data presentation module provides a plurality of chart types for the user to select, such as a histogram, a line graph, a pie chart, a scatter graph, an area graph and the like. The data presentation module then presents the parsed data set according to the user selected chart type (e.g., histogram). In the chart, names of commodities are clearly marked, sales are represented by columnar heights, and sales time is distinguished by different colors or groups. Through the chart, the user can intuitively know the sales condition of each commodity in the last year, and powerful data support is provided for subsequent decisions. By the method, the problem that the user cannot distinguish whether the SQL statement is output correctly or not by the model, so that serious loss is caused to enterprise business decision due to incorrect analysis of the statement is avoided.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In an embodiment, an SQL statement generating device based on a large language model is provided, where the SQL statement generating device based on the large language model corresponds to the SQL statement generating method based on the large language model in the above embodiment one by one. As shown in fig. 5, the SQL statement generation device based on the large language model comprises an acquisition module, an input module, a calculation module and a determination module. The functional modules are described in detail as follows:

The acquisition module is used for acquiring a user query text;

In one embodiment, the large language model based SQL sentence generating device further comprises:

The preprocessing module is used for preprocessing the user query text to obtain a standard user query text;

and the input module is used for inputting the standard user query Text into a pre-trained Text2SQL model to acquire SQL sentences output by the Text2SQL model.

And if the verification result is that the verification is passed, inputting the SQL sentence passed by the verification into a pre-trained SQL2Text model to obtain a target query Text output by the SQL2Text model.

In an embodiment, the verification module is further configured to:

analyzing the grammar structure of the SQL sentence;

For specific limitations regarding the large language model-based SQL statement generation apparatus, reference may be made to the above limitations regarding the large language model-based SQL statement generation method, and details thereof will not be repeated herein. The above-described modules in the large language model-based SQL statement generation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data required by an SQL sentence generating method based on a large language model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a large language model based SQL statement generation method.

In an embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for generating an SQL statement based on a large language model in the foregoing embodiment when executing the computer program, for example, the contents of the embodiments in steps S10-S90 are not repeated here. Or when executing the computer program, the processor implements the functions of each module/unit in the embodiment of the large language model-based SQL statement generating device, for example, the functions of the large language model-based SQL statement generating described in the embodiments of steps S10-S90 are not repeated here.

In an embodiment, a computer readable storage medium is provided, and a computer program is stored on the computer readable storage medium, where the computer program is executed by a processor to implement the method for generating an SQL statement based on a large language model in the foregoing embodiment, for example, the content described in the embodiments of steps S10 to S90 is omitted for avoiding repetition. Or when the computer program is executed by the processor, the functions of the modules/units in the embodiment of the large language model-based SQL statement generating device, for example, the functions of the large language model-based SQL statement generating in the embodiment of steps S10-S90, are not repeated here.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The foregoing embodiments are merely illustrative of the technical solutions of the present invention, and not restrictive, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that modifications may still be made to the technical solutions described in the foregoing embodiments or equivalent substitutions of some technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The SQL sentence generation method based on the large language model is characterized by comprising the following steps:

Acquiring a user query text;

2. The large language model based SQL statement generation method according to claim 1, wherein the inputting the user query Text to a pre-trained Text2SQL model to obtain the SQL statement output by the Text2SQL model comprises:

Preprocessing the user query text to obtain a standard user query text;

3. The method for generating SQL statement based on large language model according to claim 2, wherein the inputting the SQL statement into a pre-trained SQL2Text model to obtain the target query Text output by the SQL2Text model comprises:

Checking the SQL statement to obtain a checking result;

4. The large language model based SQL statement generation method of claim 3, further comprising:

5. The method for generating SQL statement based on large language model according to claim 3, wherein the verifying the SQL statement to obtain the verification result comprises:

6. The method for generating SQL statement based on large language model according to claim 5, wherein said performing syntax checking on said SQL statement to obtain syntax checking result comprises:

analyzing the grammar structure of the SQL sentence;

7. The large language model based SQL statement generation method according to claim 1, wherein the SQL2Text model is trained by:

8. An SQL statement generation device based on a large language model, comprising:

The acquisition module is used for acquiring a user query text;

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the large language model based SQL statement generation method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the large language model-based SQL statement generation method according to any one of claims 1 to 7.