CN119884167A

CN119884167A - Method for converting natural language into SQL (structured query language) sentence

Info

Publication number: CN119884167A
Application number: CN202510081477.5A
Authority: CN
Inventors: 季白杨; 陈龙
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2025-01-20
Filing date: 2025-01-20
Publication date: 2025-04-25

Abstract

A method for converting natural language into SQL statements comprises the following steps: S1: constructing a vector database and an industry knowledge graph according to a table description file; S2: constructing an entity data set in the industry field for training a BERT model; the entity data set comprises entity names and entity classifications; S3: annotating the entity content of natural statements through a BERT model to obtain annotated statements; querying the knowledge graph, performing text replacement on the annotated statements to obtain replaced statements; S4: vectorizing and representing the replaced statements to obtain statements to be retrieved; retrieving the statements to be retrieved in the vector database and obtaining corresponding table structure information; S5: generating SQL statements through a large language model according to the table structure information. The present invention improves the accuracy of converting natural language into SQL statements in a specific field.

Description

Method for converting natural language into SQL (structured query language) sentence

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for converting natural language into SQL sentences.

Background

In the present day of the internet, various industries have accumulated massive amounts of data that are stored in various databases. Users interact with this data in large amounts each day to obtain the knowledge and information they need. Structured SQL statement implementations are required in this process. SQL is a standard programming language specifically used for managing and operating relational databases, and can help developers manually write SQL statements in the related art to query and operate data in the databases in an efficient and accurate manner, thereby processing various business requirements expressed by natural language of users.

The prior patent with the application number 202311178152.6 discloses a method for converting natural language sentences into SQL sentences, which comprises the following steps of finely training a large language model, obtaining natural language sentences to be converted, and inputting the natural language sentences into the finely trained large language model to obtain converted SQL sentences.

The prior art has the defects that the accuracy of the conventional method for converting natural language into SQL sentences in a specific field is low, and the condition that the table names and column names of the converted SQL sentences are inconsistent with a database is easy to occur. Firstly, the database modes in different fields are greatly different, and the database structures and the query requirements in some professional fields are greatly different from those in general fields, so that the models trained in the general fields cannot well solve the problems in other fields. Secondly, databases in specific fields have unique structures and terms, further increasing the difficulty of generating accurate SQL statements. Moreover, the lack of domain-specific training corpus results in high quality training data often being difficult to obtain, and these reasons all result in low accuracy of the existing method for converting natural language into SQL sentences in the specific domain.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention provides a method for converting natural language into SQL sentences, which improves the accuracy of converting natural language into SQL sentences in specific fields.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for converting natural language into SQL statements, comprising the steps of:

S1, constructing a vector database and an industry knowledge graph according to a table description file;

S2, constructing an entity data set in the industry field for training a BERT model, wherein the entity data set comprises entity names and entity classifications;

s3, marking entity content of the natural sentence through the BERT model to obtain a marked sentence; inquiring the knowledge graph, and performing text replacement on the labeled sentence to obtain a replaced sentence;

s4, vectorizing the representation of the replaced sentences to obtain sentences to be retrieved; searching the statement to be searched in the vector database, and obtaining corresponding table structure information;

s5, generating SQL sentences through a large language model according to the table structure information.

Industry information is stored in tables in the original database, each table having a table description file. A table profile is a document that details the structure, content, and use of each table in a database, providing a database administrator, developer, and data analyst with comprehensive information about the table that facilitates better understanding and use of the data in the database. The table description file generally contains 1. Table basic information such as table names, general functions of the table, and brief description of the use. 2. Field information such as field name, field data type, field description, default value, index information, etc. 3. Relationship information such as foreign keys, association tables, etc. 4. Constraint information such as primary key constraints, unique constraints, check constraints, etc.

The table description file in the table data is converted into vectorization representation, namely, text information is converted into numerical type vector which can be understood and processed by a computer. And storing the vectorized table description file information into a vector database, so that efficient similarity searching and retrieving operation can be conveniently carried out subsequently.

The industry knowledge graph is a knowledge graph specially constructed for the specific industry field, and provides support for data integration, semantic understanding and intelligent application of the industry by representing related concepts, entities, attributes and relations of the industry in a structured form. Entity nouns are non-logical words and connective words. Entity nodes are basic units in the knowledge graph and represent specific objects or concepts in the industry. The entity nodes may be table structure information entity nodes, field information entity nodes, relationship nodes, etc. For example, the table structure information entity node uses the table in the original database as an entity node, and the salary table (salary) as an entity node, and the node information is:

{"type":"Table",

"properties":{

"table_name":"salary",

"table_description" employee payroll, record company employee payroll data "}

}

Each attribute column of the table is also used as an entity node, and the node information is as follows:

{"type":"Field",

"properties":{

"field_name":"development_salary",

"field_description": "employee payroll under development fee" }

}

If the attribute relationship is also used as an entity node, constructing a relationship between the salary and the development_salary:

{"type":"HAS_FIELD",

"from":"salary",

"to":"development_salary",

"properties":{}

}

And classifying the industry data in the original database according to different types according to the specific content of the industry field. For example, if an entity data set l= { < Ei, ci > } of a vertical is constructed, where Ei represents an entity name, ci represents an entity class, e.g., l= { a Company, company }, if "a Company" is included in a natural sentence, then "a Company" is labeled as "Company". By classifying the entities, the data sets can be better organized and managed, making subsequent model training and application more targeted and efficient. Classification is also helpful for the model to better understand and process different types of entities, and improves the accuracy and relevance of sentence extraction. The BERT model is a method model for pre-training language representation, is widely applied to various tasks such as text classification, named entity recognition, question-answering system and the like, learns rich language representation and semantic information through pre-training of a large amount of text data, and the training method is the prior art. The entity data set is used for training the BERT model, so that the BERT model can identify and extract professional entity nouns in the natural sentences, and the entity content marking is carried out on the natural sentences according to the classification of the entities.

Inquiring the knowledge graph, and carrying out text replacement on the labeled sentence to obtain a replaced sentence. There are various forms of post-substitution statements. If the entity nouns of each non-logical word and the connective word are searched through the knowledge graph, the corresponding table names and field names are found, and the entity nouns are replaced with the corresponding table names and field names according to the knowledge graph search result, wherein the format is Table. For example, "developer payroll" is replaced with "salary.

And carrying out vectorization representation on the replaced sentences to obtain vector representation of the sentences to be searched. Then, the vector is input into a vector database for similarity retrieval, and the table structure information most similar to the sentence to be retrieved is obtained, so that support is provided for generating the SQL sentence.

And S5, generating a final SQL sentence through a large language model by using the table structure information obtained through preprocessing and vectorization retrieval. A large language model (e.g., qwen 1.5.5-7B) is a pre-trained deep learning model that can understand and generate natural language text. These models are pre-trained with large amounts of text data, learning rich language representations and semantic information. In the task of converting natural language into SQL, the large language model can understand the intention of natural language query and generate corresponding SQL sentences. The table structure information refers to table structure information retrieved from the vector database, including table names, field names, table descriptions, and the like. This information helps the large language model understand the specific data structure involved in the query, thereby generating an accurate SQL statement.

Preferably, the step S1 comprises converting the table description file into a unified JSON format.

The obtained table description file is subjected to standardization processing and is converted into a unified JSON format, and the standardization processing is beneficial to subsequent data processing and analysis. First, a table description file is read from an original database, and key information including a table name, a table description, a field name, and a field description is extracted from the table description file. This information will be used to construct a standardized JSON format. And converting the extracted key information into a unified JSON format. The structure of JSON format is set according to the actual application scenario, such as: in "Table: the formal conversion of JSON of { (descriptions, fields [ Columni, commenti ],......) } }", table is a Table name, description is a Description of a Table, fields is a field array of a Table in which Columni for each element is the column field, the meaning of the column represented by Commenti.

Preferably, the step S3 comprises correcting the natural sentence and normalizing the number and time in the natural sentence.

Modifying the natural sentence includes modifying spellings, grammars, punctuations, etc. that may occur in the input sentence. And unified standardization is carried out on the numbers and the time years, and the storage format of the numbers and the time years is consistent with that of an original database. By correcting the input natural sentences and carrying out standardized processing on numbers and time years, the quality and consistency of the input sentences are ensured, high-quality input is provided for subsequent entity identification, knowledge graph query and SQL sentence generation, and the accuracy and reliability of the whole system are improved.

Preferably, the step S3 comprises the steps of segmenting the natural language into a plurality of words and performing BIO labeling on the words.

And performing word segmentation processing on natural language content by using a word segmentation tool, and segmenting continuous texts into independent vocabulary units. Word segmentation is an important step in chinese natural language processing because there are no obvious separators between words in chinese text. BIO labeling is carried out on the text after word segmentation, which is a common sequence labeling method used for labeling information such as entities, logical words and the like in the text.

Preferably, the step S3 includes splitting the natural language into a plurality of clauses according to the logic word marked by the BIO, and obtaining a clause set.

And carrying out sentence splitting on the marked entity according to the logic words, so as to obtain clause sets, thereby being beneficial to simplifying complex inquiry and improving the efficiency and accuracy of subsequent processing. Logical words refer to words, such as "and", "or", "but" and the like, used in natural language queries to connect multiple clauses. These logical terms generally correspond to logical operators, such as "AND, OR," etc., in SQL queries. The logical word is commonly labeled "B-Logic" or "I-Logic". The marked sentences are split into a plurality of clauses according to the logical words, and each clause can be independently processed, so that the complex query is simplified, and the complex query is more easily converted into SQL sentences. And collecting the split clauses into a set to form a clause set, wherein each clause can be independently processed to generate a corresponding SQL fragment.

Preferably, the step S3 includes vectorizing the entity nouns in the labeled sentence to obtain a text to be matched, calculating cosine similarity between the text to be matched and the entity nouns in the vector database, and finding out the most similar entity nouns for replacement.

In order to solve the problem that the Chinese input content is not matched with the entity in the knowledge graph, the entity in the vector database which is most similar to the input entity can be found through vectorization characterization and cosine similarity calculation, so that the accuracy of entity matching is improved. And vectorizing and characterizing the marked entity nouns to obtain texts to be matched, and calculating cosine similarity between the texts to be matched and entity vectors in a vector database so as to find the most similar entity and replace the most similar entity. Through the steps, vectorization characterization can be carried out on each marked entity noun, cosine similarity between the marked entity noun and the entity in the vector database is calculated, the most similar entity is found, the problem that Chinese input content is not matched with the entity in the knowledge graph is solved, and accuracy and reliability of entity matching are improved.

Preferably, the step S4 includes extracting common parts and different parts for the entity nouns in the clauses of the replaced sentence, and forming attribute-state pairs.

For each sub-sentence after the knowledge graph conversion, when the entities belong to attribute columns of different tables, the common part and different parts need to be extracted. The common part refers to entities that co-occur in multiple clauses, typically a table name or a common field. The different parts refer to entities unique to each clause, typically fields of a table. Attribute-state pairs, one tuple containing a common part and a different part. By the above steps, accurate SQL statements are facilitated to be generated, especially when complex queries involving multiple tables are handled.

Preferably, the step S5 comprises constructing corresponding query statement pairs for each table, and fine-tuning the large language model in QLora mode.

For the content of each table, a corresponding query statement pair is constructed, and the problem pair of the query statement pair comprises single-table field query, connection field query and the like and is used for fine tuning the training model. Fine tuning was performed using QLora mode. QLora is a fine tuning method that combines LoRA with quantization techniques to reduce the memory resource consumption of the model while maintaining or improving the model performance. Through the steps, fine tuning training is carried out on the large language model, a QLora method is adopted, so that the memory resource consumption of the large language model is reduced, meanwhile, the performance of the model is maintained or improved, the accuracy and efficiency of the model in a specific field are improved, and the generated SQL sentence is ensured to meet the database structure and the query requirement.

Preferably, the method comprises the step S6 of extracting a Select statement of the large language model through a regular expression and checking the form of the SQL text through sqlparse.

In order to ensure that the generated SQL statement conforms to the correct SQL text form, the content of the Select statement in the large language model answer is extracted through the regular expression, and a sqlparse library is used for checking. Since the SQL statement generated by the large language model may contain some additional content, the valid Select statement content needs to be extracted by the regular expression. sqlparse is a Python library that can parse and format SQL statements to help check if the syntax of the SQL statement is correct. Verifying the form of the SQL text is helpful to improve the accuracy and reliability of generating the SQL sentence and ensure that the SQL sentence can be correctly executed in the database.

Compared with the prior art, the invention has the beneficial effects that:

By constructing an industry field entity data set containing entity names and entity classifications and training the BERT model, professional entity nouns in natural language can be identified and extracted, and the identification capacity of the model on industry terms is improved, so that accuracy is improved. The entities are classified and managed, so that the model is helped to better understand and process different types of entities, and correct tables and fields can be selected when SQL sentences are generated. And marking entity contents of the natural language query by using the BERT model, and identifying key entities and classes thereof in the query. And replacing entity nouns in the processed sentences with corresponding table names and field names through the query knowledge graph, so that the generated SQL sentences can accurately reflect the query intention of the user. Through the structured information of the knowledge graph, ambiguity in natural language is reduced, and the most suitable table and field can be selected by the model when the SQL sentence is generated. Through the steps and the technical means, the technical scheme of the invention can effectively improve the accuracy of converting natural language into SQL sentences in the specific field, and the links together ensure that a system can accurately understand the query intention of a user, avoid the industry field of query errors and generate the SQL sentences conforming to the database structure. By constructing a knowledge graph of a specific industry and training a BERT model by using an entity data set specific to the industry, the system can more accurately understand and identify terms and concepts of the specific field, thereby improving the accuracy of converting natural language into SQL sentences in the specific field. Meanwhile, by utilizing the powerful semantic understanding capability of the large language model and combining the table structure information in the vector database, SQL sentences conforming to the user intention can be generated more accurately, and ambiguity and errors are reduced.

Drawings

FIG. 1 is a schematic diagram of the process of example 1;

FIG. 2 is a process flow diagram of example 1;

fig. 3 is a step schematic diagram of step S3 of example 1.

Detailed Description

The invention will be further described with reference to the specific drawings in order to make the technical means, the inventive features, the achievement of the purpose and the effect of the implementation of the invention easily understood. The present invention is not limited to the following examples.

It should be understood that the structures, proportions, sizes, etc. shown in the drawings are for illustration purposes only and should not be construed as limiting the invention to the extent that it can be practiced, since modifications, changes in the proportions, or otherwise, used in the practice of the invention, are not intended to be critical to the essential characteristics of the invention, but are otherwise, required to achieve the objective and effect taught by the invention.

Example 1:

the method for converting natural language into SQL sentences shown in fig. 1 and 2 comprises the following steps:

And S1, constructing a vector database and an industry knowledge graph according to the table description file.

S1.1, when an application program is started, all table structure information is obtained through codes according to a connected original database, and a table description file is obtained.

S1.2, the vectorization characterization of the Table is obtained by utilizing a text embedding model to various Table Description files, the embodiment uses an NLP Chinese text steering amount model in the Ari, sets the maximum text length to be 512, then stores the text into a Milvus vector database, and creates a Collection by taking (Id, table, description) as a storage field when storing, wherein the Table represents the Table name, the Description represents the structural information of the Table, and the query speed is accelerated by taking "Table" as an index. The table description file in the table data is converted into vectorization representation, namely, text information is converted into numerical type vector which can be understood and processed by a computer. And storing the vectorized table description file information into a vector database, so that efficient similarity searching and retrieving operation can be conveniently carried out subsequently.

S1.3, converting the table description file into a unified JSON format.

S1.4, because the entity names in the input text may have mismatch, for example, wrongly written words or representing the same meaning but different field names, extracting the column attribute of each table, performing de-duplication, performing vectorization characterization through a text embedding model, and storing the vectorized representation in a vector database.

S1.5, constructing a corresponding industry knowledge graph according to a standardized table JSON file, taking table structure information and field information as entity nodes, and creating relation nodes according to the relation between the tables and the fields and the relation between the tables. There are two types of relationships, one is HAS_FIELD, representing the relationship between tables and FIELDs, and one is HAS_ RELATION, representing the relationship between tables. The industry knowledge graph is a knowledge graph specially constructed for the specific industry field, and provides support for data integration, semantic understanding and intelligent application of the industry by representing related concepts, entities, attributes and relations of the industry in a structured form. Entity nouns are non-logical words and connective words. Entity nodes are basic units in the knowledge graph and represent specific objects or concepts in the industry. The entity nodes may be table structure information entity nodes, field information entity nodes, relationship nodes, etc. For example, the table structure information entity node uses the table in the original database as an entity node, and the salary table (salary) as an entity node, and the node information is:

{"type":"Table",

"properties":{

"table_name":"salary",

"table_description" employee payroll, record company employee payroll data "}

}

{"type":"Field",

"properties":{

"field_name":"development_salary",

"field_description": "employee payroll under development fee" }

}

{"type":"HAS_FIELD",

"from":"salary",

"to":"development_salary",

"properties":{}

}

S2, constructing an entity data set in the industry field for training the BERT model, wherein the entity data set comprises entity names and entity classifications.

And S3, marking the entity content of the natural sentence through the BERT model to obtain a marked sentence, inquiring the knowledge graph, and carrying out text replacement on the marked sentence to obtain a replaced sentence.

S3.1, correcting the natural sentence, and carrying out standardization processing on numbers and time in the natural sentence.

As shown in fig. 3, modifying the natural sentence includes modifying spellings, grammars, punctuation, etc. that may occur in the input sentence with pycorrector. And unified standardization is carried out on the numbers and the time years, and the storage format of the numbers and the time years is consistent with that of an original database. By correcting the input natural sentences and carrying out standardized processing on numbers and time years, the quality and consistency of the input sentences are ensured, high-quality input is provided for subsequent entity identification, knowledge graph query and SQL sentence generation, and the accuracy and reliability of the whole system are improved. pycorrector is a Python-based text error correction library capable of automatically detecting and correcting common spelling errors, grammar errors, and punctuation errors.

And S3.2, transmitting the standardized natural sentences into a pre-trained first-layer BERT model to label entity contents.

S3.2.1, segmenting the natural language into a plurality of words, and performing BIO labeling on the words. The BERT training data set is constructed, some natural language contents are firstly subjected to word segmentation processing through HanLP word segmentation tools and marked through BIO, and [ CLS ] and [ SEP ] special identifiers are added to represent the beginning and the end of sentences. The labeling rules for BIO are as follows:

B-Entity represents the beginning of the Entity,

I-Entity represents the middle part of the Entity,

B-Logic: represents the beginning of a logical word,

I-Logic-the middle part of the representation logical word,

B-Operator-Ca representing the Operator (addition, subtraction, multiplication, division)

B-Operator-Co representing comparer (e.g. greater than, less than, equal to, etc.)

O represents a common non-physical word or a non-logical word

Word segmentation is an important step in chinese natural language processing because there are no obvious separators between words in chinese text. BIO labeling is carried out on the text after word segmentation, which is a common sequence labeling method used for labeling information such as entities, logical words and the like in the text.

S3.2.2 training BERT model-based entity labeling

The model input sequence X is:

X=[[CLS],x1,x2,......,xn,[SEP],y1,y2,......,[SEP]]

Pretraining is carried out by using the BERT model, fine tuning is carried out on a training data set, and accuracy of named entity identification is optimized.

And S3.3, splitting the natural language into a plurality of clauses according to the logic words marked by the BIO, and obtaining a clause set Q= { Q1, Q2.

And S3.4, vectorizing the entity nouns in the labeled sentences to obtain texts to be matched, calculating cosine similarity between the texts to be matched and the entity nouns in the vector database, and finding out the most similar entity nouns to replace. The calculation formula is as follows:

And S3.5, after word segmentation, inquiring the entity nouns of each non-logical word and each connective word through a knowledge graph, replacing the text, and replacing the entity nouns into a table.

S3.6, if there is an entity to be calculated in each clause qi in the clause set Q, replacing the clause qi from the vector library to the needed entity pair and converting the entity pair and the operation symbol by vectorization representation

And S4, vectorizing the expression after the representation and replacement to obtain an expression to be searched, and searching the expression to be searched in a vector database to obtain corresponding table structure information.

And S4.1, carrying out vectorization characterization on each processed sub-sentence qi, and calculating cosine similarity through Milvus vector retrieval tools, wherein the calculation formula is as follows:

And retrieving corresponding database Table structure information T= (Id, table and Description) from the constructed vector database, wherein Table represents the Table name, description represents the structure information of the Table and comprises all attribute column information. Considering the complexity of the query, the present embodiment selects a query of up to 10 table structures to obtain a correlation list d= { T1, T2,...

S4.2, extracting a common part and different parts according to table names and table structure information if the entity is an attribute column of different tables after the transformation of the knowledge graph in each sub-sentence qi, and forming an attribute STATE pair COL_STATE= < SHARE, { E1, E2,.. The En } ], wherein SHARE represents the common part, ei represents a unique clause part, each Ei represents a Table.Columbn, and the embodiment uses a large language model after pretraining and fine tuning to realize entity extraction and classification tasks and carries out attribute mapping by combining the obtained industry knowledge graph.

S5.1, constructing corresponding query statement pairs for each table, and fine-tuning the large language model in QLora mode.

In the embodiment, a large language model is qwen 1.5.5-7B model, and a corresponding query statement pair SQL_TRAIN= { QUESTION, ANSWER } is constructed according to the content of each table, and the problem pair comprises single-table field query, connection field query and the like, so that fine tuning training of the model is realized.

And fine tuning the model by utilizing the constructed training text in a QLora mode, and further quantifying the model under the condition of LoRA weight matrix decomposition so as to achieve smaller consumption of memory resources.

Let the original weight matrix be W, loRA decompose the training matrix, and the introduced low-rank matrix be a and B, then the new weight matrix W' can be expressed as:

Wherein A ε R ^d×r,B∈R^r×k, R < < (d, k), R represents the rank of the matrix and lora_alpha represents the scaling factor.

S5.2, constructing a prompt for a prompt word according to the queried list structure list D, wherein the prompt comprises role positioning (database language expert) for defining a large model, the retrieved list structure list D= { T1, T2, & gt, T10} (structure information of a table possibly used), if sub-sentences exist, each sub-sentence is converted and then connected, the sub-sentences comprise corresponding STATEs (AND, OR, NONE), information entity= { (E1, E2, & gt, ei) to be calculated, OPERATION } (comprising set Entity attributes and OPERATIONs), COL_STATE (connection part), some other specific industry technical noun document information and the like

S5.3, carrying out Few-Shot on the large model, and improving the conversion success rate of the model by combining some examples.

And S5.4, the large model generates corresponding answers for the input sentences according to the prompt word template.

S6, extracting a Select statement of the large language model through the regular expression, and checking the form of the SQL text through sqlparse.

According to the invention, through constructing an industry knowledge graph, preprocessing sentences in advance by utilizing relation information of the industry knowledge graph, and carrying out entity labeling by utilizing the BERT model, the sentences are split, so that the problem of inaccuracy in conversion of the model into the table fields when the database tables of the industry are too many is solved, and the success rate and the overall effect of converting natural language into SQL sentences by the model are improved.

Claims

1. A method for converting natural language into SQL statements, comprising the steps of:

2. The method for converting natural language to SQL statements according to claim 1, wherein step S1 comprises converting the table description file into a unified JSON format.

3. The method for converting natural language into SQL statement according to claim 1, wherein step S3 comprises correcting the natural statement and normalizing the number and time in the natural statement.

4. The method for converting natural language into SQL statement according to claim 3, wherein the step S3 comprises segmenting the natural language into a plurality of words and performing BIO labeling on the words.

5. The method for converting a natural language into an SQL sentence according to claim 4, wherein the step S3 comprises splitting the natural language into a plurality of clauses according to the BIO-tagged logical word, and obtaining a clause set.

6. The method for converting natural language into SQL sentence according to claim 5, wherein the step S3 comprises vectorizing the entity nouns in the labeled sentence to obtain a text to be matched, calculating cosine similarity between the text to be matched and the entity nouns in the vector database, and finding out the most similar entity nouns for replacement.

7. The method of claim 6, wherein said step S4 comprises extracting common parts and different parts for said entity nouns in said clauses of said replaced sentence and forming attribute-state pairs.

8. The method for converting natural language to SQL statement according to claim 1, wherein step S5 comprises constructing a corresponding query statement pair for each table and fine-tuning the large language model by QLora.

9. The method for converting natural language into SQL sentence according to claim 1, wherein the method comprises step S6, extracting Select sentence of the large language model through regular expression, and checking the form of SQL text through sqlparse.