[go: up one dir, main page]

CN117971800A - A method for generating data quality SQL scripts based on large models - Google Patents

A method for generating data quality SQL scripts based on large models Download PDF

Info

Publication number
CN117971800A
CN117971800A CN202410030412.3A CN202410030412A CN117971800A CN 117971800 A CN117971800 A CN 117971800A CN 202410030412 A CN202410030412 A CN 202410030412A CN 117971800 A CN117971800 A CN 117971800A
Authority
CN
China
Prior art keywords
data
database
sql
large model
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410030412.3A
Other languages
Chinese (zh)
Inventor
王乔晨
刘虎
吴振刚
冯艺军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Building Materials Xinyun Zhilian Technology Co ltd
China National Building Materials Xinyun Zhilian Technology Co ltd Beijing Branch
Cnbm Technology Corp ltd
Original Assignee
China Building Materials Xinyun Zhilian Technology Co ltd
China National Building Materials Xinyun Zhilian Technology Co ltd Beijing Branch
Cnbm Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Building Materials Xinyun Zhilian Technology Co ltd, China National Building Materials Xinyun Zhilian Technology Co ltd Beijing Branch, Cnbm Technology Corp ltd filed Critical China Building Materials Xinyun Zhilian Technology Co ltd
Priority to CN202410030412.3A priority Critical patent/CN117971800A/en
Publication of CN117971800A publication Critical patent/CN117971800A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及数据质量管理技术领域,具体地说,涉及一种基于大模型的数据质量SQL脚本生成方法。包括如下步骤:基础配置;数据库的构建及管理;大模型的选择及管理;运行大模型;SQL生成方式的选择管理;SQL脚本生成及运行,并进行SQL诊断;数据质量检测;数据处理。本发明设计充分利用大模型的管理能力,基于大模型的外挂知识库QA功能,可以使得数据库管理系统DBMS在满足对数据进行全生命周期管理的基础上,进一步对数据质量进行高效的检测管理,以便及时对存在质量问题的数据进行修正;通过按数据的不同类型调用对应的不同SQL脚本生成方式,作业效率高且准确,缩短计算用时,降低算力要求,满足大型数据库对数据质量的管理作业要求。

The present invention relates to the technical field of data quality management, and specifically, to a method for generating a data quality SQL script based on a large model. The method includes the following steps: basic configuration; database construction and management; large model selection and management; large model operation; SQL generation method selection and management; SQL script generation and operation, and SQL diagnosis; data quality detection; data processing. The design of the present invention makes full use of the management capabilities of the large model. Based on the QA function of the external knowledge base of the large model, the database management system DBMS can further perform efficient detection and management of data quality on the basis of satisfying the full life cycle management of data, so as to timely correct the data with quality problems; by calling the corresponding different SQL script generation methods according to different types of data, the operation efficiency is high and accurate, the calculation time is shortened, the computing power requirements are reduced, and the management operation requirements of large databases for data quality are met.

Description

Data quality SQL script generation method based on large model
Technical Field
The invention relates to the technical field of data quality management, in particular to a data quality SQL script generation method based on a large model.
Background
Data has become a new type of important asset that internet enterprises rely on. "speaking with data" is also becoming a common understanding, however, not any data may be used to speak. At present, government institutions and enterprises and institutions basically establish data warehouse, and data in the data warehouse come from various subordinate or information systems, but the data quality is problematic due to various reasons, such as unclear statistical caliber, missing report, delayed report or open item, and the like. The core of data in the big data age is not "big", but rather "valuable", and the key to value is "quality". The quality of the data is directly related to the accuracy of the information, and the survival and the competitiveness of enterprises are also indirectly affected. It is therefore critical and absolutely necessary to learn how to judge and detect the quality of data. Otherwise, if the quality of the data in the large data platform cannot be guaranteed, then the data mining is meaningless. The key point of the data is that the quality of the data is high or low, and the high-quality data is the basis of all data applications. The goal of the enterprise for data quality management is summarized as to obtain trusted and available data. Data quality management is a cyclic management process, and the ultimate goal is to promote the value of data in use through reliable data, and finally gain economic benefit for enterprises.
Data quality is a measure of how well a data set meets the accuracy, integrity, effectiveness, consistency, uniqueness, timeliness, and applicability criteria, and is critical to all data governance plans within an organization. The data quality criteria may ensure that the company makes data-driven decisions to achieve its business goals. Data quality, data integrity and data profiling are all interrelated. Data quality is a broader class of criteria by which organizations evaluate their data for accuracy, integrity, validity, consistency, uniqueness, timeliness, and applicability. Data integrity is only concerned with a subset of these attributes, in particular accuracy, consistency and integrity. Data quality is also of greater concern from a data security standpoint, as are protection measures implemented to prevent malicious actors from corrupting the data. Data profiling, on the other hand, focuses on the process of reviewing and cleaning up data to maintain data quality standards within an organization. Data profiling may also include techniques to support these processes.
The data quality is evaluated in terms of multiple dimensions, which may vary from information source to information source. These dimensions are used to classify data quality metrics:
integrity: this represents the amount of data available or complete; if the percentage of missing values is high, analysis may be biased or misleading if the data does not represent a typical data sample;
uniqueness: this illustrates the number of duplicate data in the dataset; for example, in viewing customer data, you should expect each customer to have a unique customer ID;
Effectiveness is as follows: this dimension measures how much data matches the format required by any business rule; formatting typically includes metadata such as valid data types, ranges, patterns, etc.;
timeliness: the dimension refers to the readiness of the data within the expected time range; for example, a customer may wish to receive an order number immediately after purchase and may need to generate the data in real time;
Accuracy: this dimension refers to the correctness of data values based on agreed "fact sources"; since there may be multiple sources reporting the same index, it is important to specify a primary data source; other data sources may be used to confirm the accuracy of the primary data source; for example, the tool may check whether each data source is evolving in the same direction to enhance confidence in the accuracy of the data;
Consistency: this dimension evaluates data records from two different data sets; as previously described, multiple sources may be identified to report a single indicator; using different sources to examine consistent data trends and behavior enables organizations to trust any operational insight in their analysis; such logic may also be applied to relationships between data; for example, the number of employees in a department should not exceed the total number of employees in a company.
The purpose applicability is as follows: finally, objective applicability helps ensure that the data asset meets business requirements; this dimension can be difficult to evaluate, especially for newly emerging datasets.
These metrics may assist teams in data quality assessment within their organization to assess the amount and usefulness of the data for a particular purpose.
Data quality management (Data Quality Management) refers to a series of management activities such as identifying, measuring, monitoring, and early warning, which are performed on various data quality problems possibly caused in each stage of the life cycle of data from planning, acquiring, storing, sharing, maintaining, applying, and extinction, and further improves the data quality by improving and enhancing the management level of the organization.
At present, main means of data quality detection include:
Checking a data source: checking whether the source of the data is trusted, such as whether the data is from a legitimate channel or trusted authority;
And (3) data acquisition and inspection: checking whether errors or tampering exist in the data acquisition process or not, and whether data transmission is safe and reliable or not;
data verification and inspection: the data verification algorithm or the verification code is adopted to verify the data, so that the integrity and the accuracy of the data are ensured;
recording number checking method: the data condition is generally verified by comparing the number of records; mainly checking whether the record number of the data table is within a determined numerical value or a determined range;
Total key index analysis: for key indexes, comparing whether the total amount of data is consistent; the method mainly refers to the examination of summarization logic with the same business meaning and statistics from different dimensions; for example: counting the same field from different dimensions in the same table, and checking the total amount when a summary relation exists; the fields of the table have the same business meaning as the fields in other tables, statistics is carried out from different dimensionalities, a summary relation exists, data of the two tables are not processed by the same data source, and total amount inspection is needed when the condition is met; for example: total income, total profit, total cost, total investment, etc. of enterprises.
Furthermore, in addition to the above-mentioned data quality detection method, statistical and machine learning methods may be employed to detect data quality. For example, methods such as chi-square testing, correlation analysis, and the like are used to detect consistency and reliability of data.
However, these data quality detection means are more conventional, are inconvenient to operate, cannot be applied to all data, and cannot utilize the advantages of a large model to be applied to data quality management of a large database. SQL is a language developed for operating databases. If the data quality management can be performed on a large database based on SQL, the detection and management work of the data quality is expected to be improved. In view of this, we propose a data quality SQL script generation method based on a large model.
Disclosure of Invention
The invention aims to provide a data quality SQL script generation method based on a large model, which aims to solve the problems in the background technology.
In order to solve the technical problems, one of the purposes of the present invention is to provide a method for generating a data quality SQL script based on a large model, comprising the following steps:
S1, basic configuration: the method comprises the steps of deploying hardware configuration based on a consumer display card, ensuring that a database management system (DBMS) can be borne, meeting the calculation power requirement of large-model operation, and reserving a certain expansion space;
s2, constructing and managing a database: constructing a database, and loading a database management system DBMS to manage the whole life cycle of the data;
s3, selecting and managing a large model: the large model is divided into four major classes of large language model, computer vision (comprising images and video), audio and multi-mode large model; selecting a corresponding large model based on the data type;
S4, running a large model: based on the large model, the QA function of the external knowledge base is realized;
S5, selecting and managing SQL generation modes: different SQL generating modes are called for different types of data;
S6, generating and running SQL scripts and performing SQL diagnosis;
S7, detecting data quality: continuously carrying out quality detection on five aspects of real-time, accurate, complete, change and data flow through the generated executable SQL;
s8, data processing: and automatically performing corresponding correction operation on the data with the detected problems based on the database management system DBMS, and feeding back the data problems which cannot be processed to the user in real time.
As a further improvement of the technical solution, in the step S1, the extension space reserved in the basic configuration at least includes reserving a terminal socket (USB socket, signal line socket, etc.) for connecting with an external device, reserving a memory space for installing a system and a plug-in unit for supporting the operation of the method, and the like.
As a further improvement of the technical scheme, in the step S2, the database at least includes a local MySQL database, a default built-in database, a custom newly-added database and an autonomous update database; wherein:
The local MySQL database is basic service for supporting method operation, and a MySQL database or a Chroma vector database which needs to be installed locally can be adopted; ( A miniconda virtual environment is used. Creating a virtual environment and installing a python dependency package )
The default built-in database is a basic native knowledge base for supporting construction of a large model;
The custom newly added database is a service database which is used for being custom built according to the requirements of users; the method for constructing the custom newly-added database comprises the following steps:
1) The user sets parameters such as the name of a database, the type of the database, the storage position, the safety and the like of the database to be built based on the user definition of a database management system DBMS;
2) The user self-masters the newly-added data, and the database management system DBMS intelligently classifies the newly-added data according to preset rules;
3) The DBMS continuously supports the functions of adding, deleting and rewriting the database by a user, and automatically scans the original database when data is newly added each time so as to rapidly judge repeated data, thereby reducing the invalid occupation of the memory of the database;
the autonomous updating database is a database constructed by a database management system DBMS based on AI and big data technology and automatically crawled through a plug-in function so as to adapt to various scene use;
Wherein the database management system DBMS comprises a hierarchical database (HIERARCHICAL DATABASE, HDB), a relational database (Relational Database, RDB), an object-oriented database (Object Oriented Database, OODB), an object-oriented database (Object Oriented Database, OODB), and a Key-Value Store (KVS); in particular, a relational database management system (Relational Database MANAGEMENT SYSTEM, RDBMS);
In addition, in the database management system DBMS, the type recognition and classification of the data need to be performed with emphasis, and the data classification mode at least includes: sorting by data file format, sorting by data file size, sorting by data content, etc.; and classifying the classified data respectively.
As a further improvement of the technical scheme, the classifying process of the data after classification in the database management system DBMS adopts a K-means clustering algorithm, and the algorithm comprises the following steps:
step1, selecting K initial clustering centers, Wherein i=1, 2..k is the number of times of iterative operations to find the cluster center;
Step2, distributing the pattern samples { X } to be classified to one of the K cluster centers one by one according to a minimum distance criterion For all i+.j, j=1, 2,..k, ifThenWherein k is the sequence number of iterative operation, the first iteration k=1, s j represents the j-th cluster, and the cluster center is Z j;
step3, calculating new vector values of each cluster center J=1, 2,..k, find the mean vector of the samples contained in each cluster domain:
Wherein N j is the number of samples contained in the jth cluster domain S j;
the mean vector is taken as a new clustering center, so that the following clustering criterion function J can be minimized:
step4, if J=1, 2, &..k, returning to Step2, reclassifying the pattern samples one by one, and repeating the iterative operation; ifJ=1, 2,..k, then the algorithm converges and the calculation ends.
As a further improvement of the technical scheme, in the step S4, the function of externally hanging the knowledge base QA is realized based on a large model, and the method specifically comprises the following steps:
S4.1, preparing a knowledge base: preparing a document of a main stream database and classifying the document;
S4.2, embedding Embedding: converting the text into a vector for storage, and storing the knowledge after Embedding in a vector database for later retrieval; the embedding Embedding mainly comprises three steps of read file (load file), steering amount+persistent storage (init_vector_store) and query;
s4.3, inquiring large model knowledge: acquiring relevant knowledge according to a query step;
S4.4, generating reasoning: the ability of the large model is utilized, and the large model is used for realizing reasoning and summarization based on the prior knowledge through ICL (In-Context-Learning).
As a further improvement of the present technical solution, in step S5, different SQL generating modes are specifically called for different types of data:
The SQL generation mode mainly comprises three modes of generating scripts by programming, utilizing system views and manual operation of a database; the programming mode is suitable for objects such as tables, indexes and the like which cannot be generated in batches by other methods; the system view mode is suitable for obtaining scalar functions, table value functions, storage processes, triggers and objects of views; the manual operation of the database generates the script applicable to objects which do not need or cannot be automated and programmable;
Before the SQL generating mode is called, the SQL statement type needs to be confirmed, and the SQL can be divided into three types according to different functions, and the SQL statement type mainly comprises: three data definition languages DDL, data manipulation language DML and data control language DCL;
DDL (Data Definition Language ) is used to create or delete objects such as databases for storing data and tables in the databases;
DML (Data Manipulation Language ) is used to query or alter records in a table;
The DCL (Data Control Language ) is used to confirm or cancel changes made to the data in the database; in addition, it is also possible to set whether the user of the RDBMS has authority to operate an object (database table, etc.) in the database; among which DML is most used.
As a further improvement of the technical scheme, in the step S6, the process of generating and running the SQL script specifically includes the following steps:
s6.1, confirming the data type, calling an SQL generating mode corresponding to the data type, selecting an SQL statement type according to the operation requirement of a user on the data, and calling a pre-written stored SQL template from a local MySQL database;
S6.2, generating an executable SQL, firstly selecting a corresponding database, and then generating the SQL according to the Schema information of the corresponding database by using the model;
s6.3, automatically analyzing and executing SQL and outputting an operation result;
S6.4, training the executable SQL based on machine learning, and performing intelligent diagnosis on the newly generated SQL by a machine learning model trained for a sufficient number of times, and directly stopping the operation process of the SQL with problems.
As a further improvement of the technical scheme, in the step S6.1, when a pre-written stored SQL template is called from a local MySQL database, a keyword extraction algorithm is adopted to perform template matching; the keyword extraction adopts an SKE algorithm which is jointly determined by semantic donation values and statistical characteristic values of words, and the calculation function of the word criticality is as follows:
Where Vd i represents the semantic contribution value of W i; vdw denotes semantic donation value weights; tw is the weight of the statistical characteristic value; loc ij indicates whether W i has occurred at position j; locw j denotes the weight of the position j in the statistical feature, wherein j takes the values of 1,2 and 3, and the represented position types are respectively a title, a section head and a section tail; len i denotes the word length of W i; lenw denotes word length weights in the statistical features; pos i represents the part of speech value of W i; posw denotes the lexical weights in the statistical features; tfidf i represents the TF-IDF value of W i and tfidfw represents the TF-IDF weight in the statistical feature.
As a further improvement of the present technical solution, in the step S7, the quality detection performed on the data in five aspects specifically includes:
real-time: judging whether the data has hysteresis conditions or not, and ensuring that the data is up-to-date in real time;
The accuracy is that: judging whether the data is accurate or not at the field level and whether the range of the value of each field accords with the expectation or not;
And (3) completing: judging whether the data has a defect or not;
and (3) changing: judging whether the database schema of the data is changed or not;
data flow: and judging the dependent flow of the data flow and the upstream and downstream conditions which are influenced if the data is in error.
As a further improvement of the technical scheme, in the step S8, the corresponding update, adjustment, alignment, correction and full-flow adjustment processing operations are performed on the data to ensure the quality of the data, aiming at the data quality problems in five aspects detected in the step S7; redundant repeated data and residual fragment data possibly exist after the data are processed, and the data need to be cleaned; the data cleaning adopts an entropy algorithm of information quantity, and the calculation formula is as follows:
where i=1, 2,3,..n, X i denotes the i-th state (n states in total), P (X i) denotes the probability of occurrence of the i-th state, H (X) is the amount of information required to eliminate uncertainty, in bits (bits).
The second object of the present invention is to provide a data quality SQL script generating platform device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the processor is configured to implement the steps of the above-mentioned large model-based data quality SQL script generating method when executing the computer program.
It is a further object of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the large model based data quality SQL script generation method described above.
Compared with the prior art, the invention has the beneficial effects that:
1. In the large-model-based data quality SQL script generation method, the calculation force requirements of a database management system DBMS and large-model operation are met by configuring hardware conditions with enough performances, and meanwhile, external equipment and loading plug-ins are conveniently expanded, so that the management operation and maintenance requirements of a large model and a large database can be met;
2. According to the large-model-based data quality SQL script generation method, the management capability of the large model is fully utilized, and the database management system DBMS can further perform efficient detection management on the data quality on the basis of meeting the requirement of full life cycle management on the data based on the QA function of the external knowledge base of the large model so as to correct the data with quality problems in time;
3. according to the large-model-based data quality SQL script generation method, the corresponding different SQL script generation modes are called according to different types of data, so that the operation efficiency is high and accurate, the calculation time is shortened, the calculation force requirement is reduced, and the management operation requirement of a large database on the data quality is met.
Drawings
FIG. 1 is an exemplary overall process flow diagram of the present invention;
FIG. 2 is a C/S block diagram of an exemplary relational database management system RDBMS of the present invention;
FIG. 3 is a block diagram of an exemplary multiple client accessing the same RDBMS in the present invention;
FIG. 4 is one of exemplary partial process flow diagrams of the present invention;
FIG. 5 is a second exemplary partial process flow diagram of the present invention;
Fig. 6 is a block diagram of an exemplary electronic computer platform device according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, the present embodiment provides a data quality SQL script generation method based on a large model, which includes the following steps:
S1, basic configuration: the method comprises the steps of deploying hardware configuration based on a consumer display card, ensuring that a database management system (DBMS) can be borne, meeting the calculation power requirement of large-model operation, and reserving a certain expansion space;
The hardware configuration can preferably adopt GPU (graphics processing unit) with model number RTX4090 and 24G-sized video memory, and the basic performance can smoothly perform dialogue reasoning without jamming.
In this step, the reserved extension space in the basic configuration at least includes reserved terminal sockets (USB sockets, signal line sockets, etc.) for connecting external devices, reserved memory space for installing the system and plug-in components for supporting the method operation, and the like.
S2, constructing and managing a database: constructing a database, and loading a database management system DBMS to manage the whole life cycle of the data;
In the step, the database at least comprises a local MySQL database, a default built-in database, a self-defined newly-added database and an autonomous update database; wherein:
The local MySQL database is a basic service supporting the operation of the method, and a MySQL database or a Chroma vector database which needs to be installed locally can be adopted; if miniconda is used, then the virtual environment needs to be created first and the python dependency package installed;
the default built-in database is a basic native knowledge base for supporting construction of a large model;
the custom newly added database is a service database which is used for being custom built according to the requirements of users; the method for constructing the custom newly-added database comprises the following steps:
1) The user sets parameters such as the name of a database, the type of the database, the storage position, the safety and the like of the database to be built based on the user definition of a database management system DBMS;
2) The user self-masters the newly-added data, and the database management system DBMS intelligently classifies the newly-added data according to preset rules;
3) The DBMS continuously supports the functions of adding, deleting and rewriting the database by a user, and automatically scans the original database when data is newly added each time so as to rapidly judge repeated data, thereby reducing the invalid occupation of the memory of the database;
The autonomous updating database is a database constructed by the database management system DBMS based on AI and big data technology and automatically crawled through a plug-in function so as to adapt to various scene use;
Wherein the database management system DBMS comprises a hierarchical database (HIERARCHICAL DATABASE, HDB), a relational database (Relational Database, RDB), an object-oriented database (Object Oriented Database, OODB), an object-oriented database (Object Oriented Database, OODB) and a Key-Value Store (KVS); in particular, a relational database management system (Relational Database MANAGEMENT SYSTEM, RDBMS);
In particular, when using the relational database management system RDBMS, the most common system architecture is the client/server type (C/S type) architecture, as shown in fig. 2. A server refers to a program (software) for receiving requests from other programs and processing the requests accordingly, or a device (computer) in which such a program is installed. Processing continues on the computer and waits for the next request to be received. An RDBMS is also a server that can read and return data from a database stored on a hard disk, and can change the data to specified content.
In contrast, a program (software) that makes a request to a server, or a device (computer) that installs the program, is called a client. The process of accessing a database managed by the RDBMS and writing and reading data is called an RDBMS client. The RDBMS client sends information such as what data it wants to acquire or what changes it wants to make to the RDBMS server via SQL statements. The RDBMS returns the requested data according to the contents of the statement or updates the data stored in the database. Thus, SQL sentences can be used for realizing the read-write operation of the relational database.
In addition, the RDBMS may be installed on the same computer as the client thereof, or may be installed on different computers, respectively. In this way, not only can the two be related to each other through a network, but also multiple clients can access the same RDBMS, as shown in fig. 3. The clients do not have to use the same program and can operate the database as long as SQL can be sent to the RDBMS. And a plurality of clients can also simultaneously perform read-write operation on the same database.
In addition, the RDBMS needs to operate a database storing a large amount of data in addition to receiving requests from a plurality of clients at the same time, and thus is typically installed on a computer having superior performance to the clients. When a database with a particularly large data size is operated, a plurality of computers may be used in combination. Although the system architecture of RDBMS is diverse, the SQL statements sent from clients are basically the same.
In addition, in the database management system DBMS, the type recognition and classification of the data need to be performed with emphasis, and the data classification method at least includes: sorting by data file format, sorting by data file size, sorting by data content, etc.; and classifying the classified data respectively.
Further, the classifying process after classifying the data in the database management system DBMS adopts a K-means clustering algorithm, and the algorithm comprises the following steps:
step1, selecting K initial clustering centers, Wherein i=1, 2..k is the number of times of iterative operations to find the cluster center;
Step2, distributing the pattern samples { X } to be classified to one of the K cluster centers one by one according to a minimum distance criterion For all i+.j, j=1, 2,..k, ifThenWherein k is the sequence number of iterative operation, the first iteration k=1, s j represents the j-th cluster, and the cluster center is Z j;
step3, calculating new vector values of each cluster center J=1, 2,..k, find the mean vector of the samples contained in each cluster domain:
Wherein N j is the number of samples contained in the jth cluster domain S j;
the mean vector is taken as a new clustering center, so that the following clustering criterion function J can be minimized:
step4, if J=1, 2, &..k, returning to Step2, reclassifying the pattern samples one by one, and repeating the iterative operation; ifJ=1, 2,..k, then the algorithm converges and the calculation ends
S3, selecting and managing a large model: the large model is divided into four major classes of large language model, computer vision (comprising images and video), audio and multi-mode large model; selecting a corresponding large model based on the data type;
S4, running a large model: based on the large model, the QA function of the external knowledge base is realized;
in this step, as shown in fig. 4, the function of externally hanging the knowledge base QA is implemented based on the large model, which specifically includes the following steps:
S4.1, preparing a knowledge base: preparing a document of a main stream database and classifying the document;
S4.2, embedding Embedding: converting the text into a vector for storage, and storing the knowledge after Embedding in a vector database for later retrieval; the embedding Embedding mainly comprises three steps of read file (load file), steering amount+persistent storage (init_vector_store) and query;
s4.3, inquiring large model knowledge: acquiring relevant knowledge according to a query step;
S4.4, generating reasoning: the ability of the large model is utilized, and the large model is used for realizing reasoning and summarization based on the prior knowledge through ICL (In-Context-Learning).
S5, selecting and managing SQL generation modes: different SQL generating modes are called for different types of data;
In this step, different SQL generation modes are called for different types of data, specifically:
The SQL generation mode mainly comprises three modes of generating scripts by programming, utilizing system views and manual operation of a database; the programming mode is suitable for objects such as tables, indexes and the like which cannot be generated in batches by other methods; the system view mode is suitable for obtaining scalar functions, table value functions, storage processes, triggers and objects of views; the manual operation of the database generates the script applicable to objects which do not need or cannot be automated and programmable;
Before the SQL generating mode is called, the SQL statement type needs to be confirmed, and the SQL can be divided into three types according to different functions, and the SQL statement type mainly comprises: three data definition languages DDL, a data manipulation language DML and a data control language DCL, wherein the most used DML;
DDL (Data Definition Language ) is used to create or delete objects such as databases for storing data and tables in the databases; the DDL contains the following instructions:
CREATE: creating objects such as a database, a table and the like;
DROP: deleting objects such as a database, a table and the like;
ALTER: modifying the structures of objects such as a database, a table and the like;
DML (Data Manipulation Language ) is used to query or alter records in a table; the DML contains the following instructions:
SELECT: look up data in a table;
INSERT: inserting new data into the table;
UPDATE: updating data in the table;
DELETE: deleting data in the table;
The DCL (Data Control Language ) is used to confirm or cancel changes made to the data in the database; in addition, it is also possible to set whether the user of the RDBMS has authority to operate an object (database table, etc.) in the database; the DCL contains the following instructions:
COMMIT: confirming changes made to the data in the database;
ROLLBACK: canceling the change of the data in the database;
GRANT: giving the user operation authority;
REVOKE: and canceling the operation authority of the user.
S6, generating and running SQL scripts and performing SQL diagnosis;
in this step, as shown in fig. 5, the process of generating and running the SQL script specifically includes the following steps:
s6.1, confirming the data type, calling an SQL generating mode corresponding to the data type, selecting an SQL statement type according to the operation requirement of a user on the data, and calling a pre-written stored SQL template from a local MySQL database;
S6.2, generating an executable SQL, firstly selecting a corresponding database, and then generating the SQL according to the Schema information of the corresponding database by using the model;
s6.3, automatically analyzing and executing SQL and outputting an operation result;
in this step, SQL SERVER is preferably used to generate and run the SQL script, and in SQL SERVER,
S6.4, training the executable SQL based on machine learning, and performing intelligent diagnosis on the newly generated SQL by a machine learning model trained for a sufficient number of times, and directly stopping the operation process of the SQL with problems.
In step S6.1, when the SQL template is pre-written, writing rules of the SQL statement need to be complied with, including but not limited to:
1) SQL statements are to be given a semicolon (; ) Ending;
2) SQL statements are case-insensitive; but for ease of understanding, the SQL statement may be written with the following rules: capitalizing the key words; capital of table names; the remainder (column name, etc.) is lowercase;
3) The writing mode of the constant is fixed;
4) Words need to be separated by half-angle spaces or line feed; words need to be separated by half-angle spaces or line breaks.
Further, in step S6.1, when a pre-written stored SQL template is called from the local MySQL database, a keyword extraction algorithm is adopted to perform template matching; the keyword extraction adopts an SKE algorithm which is jointly determined by semantic donation values and statistical characteristic values of words, and the calculation function of the word criticality is as follows:
Where Vd i represents the semantic contribution value of W i; vdw denotes semantic donation value weights; tw is the weight of the statistical characteristic value; loc ij indicates whether W i has occurred at position j; locw j denotes the weight of the position j in the statistical feature, wherein j takes the values of 1,2 and 3, and the represented position types are respectively a title, a section head and a section tail; len i denotes the word length of W i; lenw denotes word length weights in the statistical features; pos i represents the part of speech value of W i; posw denotes the lexical weights in the statistical features; tfidf i represents the TF-IDF value of W i and tfidfw represents the TF-IDF weight in the statistical feature.
S7, detecting data quality: continuously carrying out quality detection on five aspects of real-time, accurate, complete, change and data flow through the generated executable SQL;
In this step, the quality detection for five aspects of data is specifically:
real-time: judging whether the data has hysteresis conditions or not, and ensuring that the data is up-to-date in real time;
The accuracy is that: judging whether the data is accurate or not at the field level and whether the range of the value of each field accords with the expectation or not;
And (3) completing: judging whether the data has a defect or not;
and (3) changing: judging whether the database schema of the data is changed or not;
data flow: and judging the dependent flow of the data flow and the upstream and downstream conditions which are influenced if the data is in error.
S8, data processing: automatically performing corresponding updating, checking, filling and other correction operations on the data with problems detected based on a database management system DBMS, and feeding back the data problems which cannot be processed to a user in real time;
In the step, aiming at the data quality problems in five aspects detected in the step S7, corresponding updating, adjusting, filling, correcting and full-flow adjusting processing operations are carried out on the data so as to ensure the quality of the data; redundant repeated data and residual fragment data possibly exist after the data are processed, and the data need to be cleaned; the data cleaning adopts an entropy algorithm of information quantity, and the calculation formula is as follows:
Where i=1, 2,3,..n, X i denotes the i-th state (n states in total), P (X i) denotes the probability of occurrence of the i-th state, and the learning (X) is the amount of information required to eliminate uncertainty in bits (bits).
As shown in fig. 6, the present embodiment further provides a data quality SQL script generation platform device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor.
The processor comprises one or more than one processing core, the processor is connected with the memory through a bus, the memory is used for storing program instructions, and the processor realizes the steps of the large-model-based data quality SQL script generation method when executing the program instructions in the memory.
Alternatively, the memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
In addition, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the large model-based data quality SQL script generation method when being executed by a processor.
Optionally, the present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the data quality SQL script generation method of the above aspects based on a large model.
It will be appreciated by those of ordinary skill in the art that the processes for implementing all or part of the steps of the above embodiments may be implemented by hardware, or may be implemented by a program for instructing the relevant hardware, and the program may be stored in a computer readable storage medium, where the above storage medium may be a read-only memory, a magnetic disk or optical disk, etc.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. The data quality SQL script generation method based on the large model is characterized by comprising the following steps of:
S1, basic configuration: the method comprises the steps of deploying hardware configuration based on a consumer display card, ensuring that a database management system (DBMS) can be borne, meeting the calculation power requirement of large-model operation, and reserving a certain expansion space;
s2, constructing and managing a database: constructing a database, and loading a database management system DBMS to manage the whole life cycle of the data;
S3, selecting and managing a large model: the large model is divided into four major classes of large language model, computer vision, audio frequency and multi-mode large model; selecting a corresponding large model based on the data type;
S4, running a large model: based on the large model, the QA function of the external knowledge base is realized;
S5, selecting and managing SQL generation modes: different SQL generating modes are called for different types of data;
S6, generating and running SQL scripts and performing SQL diagnosis;
S7, detecting data quality: continuously carrying out quality detection on five aspects of real-time, accurate, complete, change and data flow through the generated executable SQL;
s8, data processing: and automatically performing corresponding correction operation on the data with the detected problems based on the database management system DBMS, and feeding back the data problems which cannot be processed to the user in real time.
2. The large model-based data quality SQL script generation method of claim 1, wherein: in step S1, the reserved extension space in the basic configuration at least includes reserving a terminal socket for connecting with an external device, and reserving a memory space for installing a system and a plug-in for supporting the method to operate.
3. The large model-based data quality SQL script generation method of claim 1, wherein: in the step S2, the database at least includes a local MySQL database, a default built-in database, a custom newly-added database, and an autonomous update database; wherein:
the local MySQL database is basic service for supporting method operation, and MySQL database needing local installation or Chroma vector database needing no special installation can be adopted;
the default built-in database is a basic native knowledge base for supporting construction of a large model;
The custom newly added database is a service database which is used for being custom built according to the requirements of users; the method for constructing the custom newly-added database comprises the following steps:
1) The user sets parameters of the library name, the database type, the storage position and the security of the database to be built based on the database management system DBMS in a self-defining mode;
2) The user self-masters the newly-added data, and the database management system DBMS intelligently classifies the newly-added data according to preset rules;
3) The DBMS continuously supports the functions of adding, deleting and rewriting of the database by a user, and automatically scans the original database when data is newly added each time so as to rapidly judge repeated data;
the autonomous updating database is a database constructed by a database management system DBMS based on AI and big data technology and automatically crawled through a plug-in function so as to adapt to various scene use;
Wherein the database management system DBMS comprises a hierarchical database, a relational database, an object-oriented database and a key value storage system; in particular, a relational database management system RDBMS is adopted;
In addition, in the database management system DBMS, the type recognition and classification of the data need to be performed with emphasis, and the data classification mode at least includes: classifying according to data file formats, classifying according to data file sizes and classifying according to data contents; and classifying the classified data respectively.
4. The large model based data quality SQL script generation method of claim 3, wherein: the classifying process of the data after classifying in the DBMS adopts a K-means clustering algorithm, and the algorithm comprises the following steps:
step1, selecting K initial clustering centers, Wherein i=1, 2..k is the number of times of iterative operations to find the cluster center;
Step2, distributing the pattern samples { X } to be classified to one of the K cluster centers one by one according to a minimum distance criterion For all i+.j, j=1, 2,..k, ifThenWherein k is the sequence number of iterative operation, the first iteration k=1, s j represents the j-th cluster, and the cluster center is Z j;
step3, calculating new vector values of each cluster center The mean vector of the samples contained in each cluster domain is calculated:
Wherein N j is the number of samples contained in the jth cluster domain S j;
the mean vector is taken as a new clustering center, so that the following clustering criterion function J can be minimized:
step4, if Returning to Step2, reclassifying the mode samples one by one, and repeating iterative operation; ifThe algorithm converges and the calculation ends.
5. The large model-based data quality SQL script generation method of claim 1, wherein: in the step S4, the function of externally hanging the knowledge base QA is realized based on the large model, which specifically includes the following steps:
S4.1, preparing a knowledge base: preparing a document of a main stream database and classifying the document;
S4.2, embedding Embedding: converting the text into a vector for storage, and storing the knowledge after Embedding in a vector database for later retrieval; the embedding Embedding mainly comprises three steps of file reading, vector quantity and persistent storage and query;
S4.3, inquiring large model knowledge: acquiring related knowledge according to the inquiring step;
S4.4, generating reasoning: and the capability of the large model is utilized, and the large model is enabled to realize reasoning and summarization based on the prior knowledge through the ICL.
6. The large model-based data quality SQL script generation method of claim 1, wherein: in step S5, different SQL generation modes are specifically called for different types of data:
The SQL generation mode mainly comprises three modes of generating scripts by programming, utilizing system views and manual operation of a database; the programming mode is suitable for objects of which the table and the index class cannot be generated in batches; the system view mode is suitable for obtaining scalar functions, table value functions, storage processes, triggers and objects of views; the manual operation of the database generates the script applicable to objects which do not need or cannot be automated and programmable;
Before the SQL generating mode is called, the SQL statement type needs to be confirmed, and the SQL can be divided into three types according to different functions, and the SQL statement type mainly comprises: three data definition languages DDL, data manipulation language DML and data control language DCL;
DDL is used for creating or deleting a database for storing data, a table in the database and other objects;
the DML is used to query or alter records in the table;
The DCL is used for confirming or canceling the change of the data in the database; in addition, it is also possible to set whether the user of the RDBMS has authority to operate the objects in the database.
7. The large model-based data quality SQL script generation method of claim 1, wherein: in step S6, the process of generating and running the SQL script specifically includes the following steps:
s6.1, confirming the data type, calling an SQL generating mode corresponding to the data type, selecting an SQL statement type according to the operation requirement of a user on the data, and calling a pre-written stored SQL template from a local MySQL database;
S6.2, generating an executable SQL, firstly selecting a corresponding database, and then generating the SQL according to the Schema information of the corresponding database by using the model;
s6.3, automatically analyzing and executing SQL and outputting an operation result;
S6.4, training the executable SQL based on machine learning, and performing intelligent diagnosis on the newly generated SQL by a machine learning model trained for a sufficient number of times, and directly stopping the operation process of the SQL with problems.
8. The large model based data quality SQL script generation method of claim 7, wherein: in the step S6.1, when a pre-written and stored SQL template is called from a local MySQL database, a keyword extraction algorithm is adopted for template matching; the keyword extraction adopts an SKE algorithm which is jointly determined by semantic donation values and statistical characteristic values of words, and the calculation function of the word criticality is as follows:
Where Vd i represents the semantic contribution value of W i; vdw denotes semantic donation value weights; tw is the weight of the statistical characteristic value; loc ij indicates whether W i has occurred at position j; locw j denotes the weight of the position j in the statistical feature, wherein j takes the values of 1,2 and 3, and the represented position types are respectively a title, a section head and a section tail; len i denotes the word length of W i; lenw denotes word length weights in the statistical features; pos i represents the part of speech value of W i; posw denotes the lexical weights in the statistical features; tfidf i represents the TF-IDF value of W i and tfidfw represents the TF-IDF weight in the statistical feature.
9. The large model-based data quality SQL script generation method of claim 1, wherein: in the step S7, the quality detection of the data in five aspects specifically includes:
real-time: judging whether the data has hysteresis conditions or not, and ensuring that the data is up-to-date in real time;
The accuracy is that: judging whether the data is accurate or not at the field level and whether the range of the value of each field accords with the expectation or not;
And (3) completing: judging whether the data has a defect or not;
and (3) changing: judging whether the database schema of the data is changed or not;
data flow: and judging the dependent flow of the data flow and the upstream and downstream conditions which are influenced if the data is in error.
10. The large model based data quality SQL script generation method of claim 9, wherein: in the step S8, for the data quality problems in five aspects detected in the step S7, the corresponding processing operations of updating, adjusting, filling, correcting and full-flow adjusting are performed on the data, so as to ensure the quality of the data; redundant repeated data and residual fragment data possibly exist after the data are processed, and the data need to be cleaned; the data cleaning adopts an entropy algorithm of information quantity, and the calculation formula is as follows:
Where i=1, 2,3,..n, X i denotes the i-th state, n states in total, P (X i) denotes the probability of occurrence of the i-th state, H (X) is the amount of information required to eliminate uncertainty in bits, bit.
CN202410030412.3A 2024-01-08 2024-01-08 A method for generating data quality SQL scripts based on large models Pending CN117971800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410030412.3A CN117971800A (en) 2024-01-08 2024-01-08 A method for generating data quality SQL scripts based on large models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410030412.3A CN117971800A (en) 2024-01-08 2024-01-08 A method for generating data quality SQL scripts based on large models

Publications (1)

Publication Number Publication Date
CN117971800A true CN117971800A (en) 2024-05-03

Family

ID=90855797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410030412.3A Pending CN117971800A (en) 2024-01-08 2024-01-08 A method for generating data quality SQL scripts based on large models

Country Status (1)

Country Link
CN (1) CN117971800A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012059816A2 (en) * 2010-11-04 2012-05-10 Conprox Ab Method and apparatus for handling digital objects in a communication network
CN109993439A (en) * 2019-04-02 2019-07-09 山东浪潮云信息技术有限公司 A kind of quality determining method based on government data
CN116028489A (en) * 2022-12-19 2023-04-28 城云科技(中国)有限公司 Automated Data Exploration Method and Its Application
CN116166779A (en) * 2022-12-22 2023-05-26 国网绿色能源有限公司 Multi-hop knowledge base question-answering IR-MT model based on intermediate reasoning attention
CN117331920A (en) * 2023-09-22 2024-01-02 建信金融科技有限责任公司 Data quality detection processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012059816A2 (en) * 2010-11-04 2012-05-10 Conprox Ab Method and apparatus for handling digital objects in a communication network
CN109993439A (en) * 2019-04-02 2019-07-09 山东浪潮云信息技术有限公司 A kind of quality determining method based on government data
CN116028489A (en) * 2022-12-19 2023-04-28 城云科技(中国)有限公司 Automated Data Exploration Method and Its Application
CN116166779A (en) * 2022-12-22 2023-05-26 国网绿色能源有限公司 Multi-hop knowledge base question-answering IR-MT model based on intermediate reasoning attention
CN117331920A (en) * 2023-09-22 2024-01-02 建信金融科技有限责任公司 Data quality detection processing method and device

Similar Documents

Publication Publication Date Title
AU2024203846B2 (en) Confidence generation for managing a generative artificial intelligence model
JP7806137B2 (en) Generating rules for data processing values of data fields from their semantic labels
US11126475B2 (en) Systems and methods to use neural networks to transform a model into a neural network model
US8949166B2 (en) Creating and processing a data rule for data quality
US11243923B2 (en) Computing the need for standardization of a set of values
US12493646B2 (en) Systems and methods for risk factor predictive modeling with document summarization
US20240412143A1 (en) Systems and methods for risk factor predictive modeling with document summarization
US11544327B2 (en) Method and system for streamlined auditing
US20230039971A1 (en) Automated return evaluation with anomoly detection
US12271355B2 (en) Semantic classification for data management
US7840603B2 (en) Method and apparatus for database change management
CN120723802A (en) Sentence conversion method, device, equipment and medium based on fine-tuning model
US20250217341A1 (en) Metadata determination and storage method
CN117971800A (en) A method for generating data quality SQL scripts based on large models
US20250036602A1 (en) Machine learning techniques for discovering keys in relational datasets
US20260030241A1 (en) Automated generation of pairs of natural language questions and database queries
CN120849301B (en) Test case multiplexing automatic management method, system, equipment and medium
CN115953136B (en) Contract review method, device, computer equipment and storage medium
EP4016327A1 (en) Method of processing data from a data source, apparatus and computer program
CN119046260A (en) Data model management method, device, equipment, medium and program product
HK40092271A (en) Method, apparatus, apparatus and computer storage medium for identifying sensitive data
CN120804136A (en) SQL evaluation wind control system, device and storage medium
CN119513103A (en) Data lineage analysis method, device, computer equipment and readable storage medium
CN121326707A (en) Statement auditing method, electronic device, storage medium and program product
CN121144332A (en) Index analysis method, device, equipment and medium based on multitasking model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination