CN115617830A - Data query optimization processing method and device based on machine learning - Google Patents
Data query optimization processing method and device based on machine learning Download PDFInfo
- Publication number
- CN115617830A CN115617830A CN202110785793.2A CN202110785793A CN115617830A CN 115617830 A CN115617830 A CN 115617830A CN 202110785793 A CN202110785793 A CN 202110785793A CN 115617830 A CN115617830 A CN 115617830A
- Authority
- CN
- China
- Prior art keywords
- query
- information
- execution plan
- machine learning
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及计算机应用技术领域,具体涉及一种基于机器学习的数据查询优化处理方法和装置。另外,还涉及一种电子设备及处理器可读存储介质。The invention relates to the technical field of computer applications, in particular to a machine learning-based data query optimization processing method and device. In addition, it also relates to an electronic device and a processor-readable storage medium.
背景技术Background technique
结构化查询语言(Structured Query Language)简称SQL,是一种数据库查询和程序设计语言,用于存取数据以及查询、更新和管理关系数据库系统中存储的数据。结构化查询语言能够允许用户在高层数据结构上工作。其不要求用户指定数据的存放方法,也不需要用户了解具体的数据存放方式,所以具有完全不同底层结构的不同数据库系统,可以使用相同的结构化查询语言作为数据输入与管理的接口。同时,结构化查询语言语句可以嵌套,从而使其具有极大的灵活性和强大的功能。Structured Query Language (Structured Query Language), referred to as SQL, is a database query and programming language for accessing data and querying, updating and managing data stored in relational database systems. Structured Query Language allows users to work on high-level data structures. It does not require the user to specify the data storage method, nor does it require the user to understand the specific data storage method, so different database systems with completely different underlying structures can use the same structured query language as the interface for data input and management. At the same time, structured query language statements can be nested, so that it has great flexibility and powerful functions.
近年来,随着互联网技术的快速发展,借助结构化查询语言实现数据查询应用越来越广泛。然而,由于数据查询过程的执行环境是极其复杂的,尤其是随着数据量和数据分布的变化,现有技术中用静态统计数据进行优化方案匹配,优化质量准确性不高,同时无法应对数据查询的多种执行场景,导致数据查询效率较低。因此,如何设计一种稳定、高效的基于机器学习的数据查询优化处理方案成为本领域技术人员研究的重要课题。In recent years, with the rapid development of Internet technology, data query with the help of structured query language has become more and more widely used. However, since the execution environment of the data query process is extremely complex, especially with the change of data volume and data distribution, in the prior art, static statistical data is used for optimization scheme matching, the accuracy of optimization quality is not high, and at the same time, it cannot cope with data Various execution scenarios of queries lead to low data query efficiency. Therefore, how to design a stable and efficient data query optimization processing scheme based on machine learning has become an important research topic for those skilled in the art.
发明内容Contents of the invention
为此,本发明提供一种基于机器学习的数据查询优化处理方法及装置,以解决现有技术存在的数据查询优化处理方案局限性较高,导致复杂执行环境下的数据查询效率和准确性较差的问题。For this reason, the present invention provides a data query optimization processing method and device based on machine learning to solve the problem that existing data query optimization processing schemes in the prior art have relatively high limitations, resulting in relatively low efficiency and accuracy of data query in complex execution environments. bad question.
第一方面,本发明提供一种基于机器学习的数据查询优化处理方法,包括:获取输入的结构化查询语句信息;In the first aspect, the present invention provides a data query optimization processing method based on machine learning, including: acquiring input structured query statement information;
确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量;Determining the execution plan corresponding to the structured query statement information, and converting the execution environment parameter information corresponding to the execution plan into a corresponding representation vector;
将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划;Inputting the representation vector into a query optimizer model based on a long short-term memory network, obtaining cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information;
其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。Wherein, the query optimizer model is obtained by training the historical execution plan as sample data and the actual cost corresponding to the historical execution plan as the target value.
在一个实施例中,所述执行环境参数信息包含查询解析树、执行所述结构化查询语句信息的查询负载数据、执行环境的硬件特征信息以及软件特征信息。In one embodiment, the execution environment parameter information includes query parsing tree, query load data for executing the structured query statement information, hardware feature information and software feature information of the execution environment.
在一个实施例中,所述查询优化器模型,包括:基数估计模块,成本估计模块,连接顺序选择模块、具有机器学习和强化学习能力的学习型优化器模块;In one embodiment, the query optimizer model includes: a cardinality estimation module, a cost estimation module, a connection order selection module, and a learning optimizer module with machine learning and reinforcement learning capabilities;
其中,所述基数估计模块,用于确定所述执行计划每个级别上处理的总行数,将所述总行数确定为所述执行计划的基数信息;Wherein, the cardinality estimation module is configured to determine the total number of rows processed at each level of the execution plan, and determine the total number of rows as the cardinality information of the execution plan;
所述成本估计模块,用于基于所述结构化查询语句信息对应的统计信息,确定相应的查询优化器中各个物理路径的成本信息;其中,统计信息包括:高频值统计信息,直方图统计信息,目标系数统计信息,类型高频值信息,数组类型直方图信息以及范围值类型统计信息;The cost estimation module is configured to determine the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein, the statistical information includes: high-frequency value statistical information, histogram statistics information, target coefficient statistics, type high frequency value information, array type histogram information, and range value type statistics;
所述连接顺序选择模块,用于基于所述长短期记忆网络的强化学习能力进行连接顺序选择优化,得到对应的执行计划;The connection sequence selection module is used to select and optimize the connection sequence based on the reinforcement learning capability of the long-short-term memory network to obtain a corresponding execution plan;
所述学习型优化器模块,用于基于所述基数信息和所述成本信息,利用机器学习和强化学习的方式从所述执行计划中确定相应的目标执行计划。The learning optimizer module is configured to determine a corresponding target execution plan from the execution plan by using machine learning and reinforcement learning based on the cardinality information and the cost information.
在一个实施例中,所述的基于机器学习的数据查询优化处理方法,还包括:基于预设的改写规则对预先确定的结构化查询语句信息的优化类型对所述结构化查询语句信息中的目标查询语句进行优化改写;或者,基于预设的学习规则,采用机器学习和强化学习的方式对所述结构化查询语句信息进行优化选择,得到相应的查询改写结果。In one embodiment, the machine learning-based data query optimization processing method further includes: optimizing the predetermined structured query statement information based on preset rewriting rules and optimizing the structured query statement information. Optimizing and rewriting the target query statement; or, based on preset learning rules, using machine learning and reinforcement learning to optimize and select the information of the structured query statement to obtain a corresponding query rewriting result.
在一个实施例中,所述的基于机器学习的数据查询优化处理方法,还包括:采用对应数据库的查询解析工具中的SQL解析器对所述结构化查询语句信息进行解析,得到包含运算符表达式的查询解析树;利用所述查询优化器模型,根据预设的插拔原则对所述运算符表达式进行优化处理,得到相应的查询解析结果。In one embodiment, the machine learning-based data query optimization processing method further includes: using the SQL parser in the query parsing tool of the corresponding database to parse the structured query statement information to obtain the expression containing the operator formula query parsing tree; using the query optimizer model to optimize the operator expression according to the preset plugging principle, and obtain corresponding query parsing results.
在一个实施例中,所述长短期记忆网络查询优化器模型用于,基于从所述历史执行计划进行特征提取得到的表示特征,对所述结构化查询语句信息进行分析处理,确定对应的代价信息。In one embodiment, the long-short-term memory network query optimizer model is used to analyze and process the structured query statement information based on the representation features obtained by feature extraction from the historical execution plan, and determine the corresponding cost information.
第二方面,本发明还提供基于机器学习的数据查询优化处理装置,包括:结构化查询语句获取单元,用于获取输入结构化查询语句信息;In the second aspect, the present invention also provides a data query optimization processing device based on machine learning, including: a structured query statement acquisition unit, configured to acquire input structured query statement information;
向量表示单元,用于确定所述结构化查询语句信息对应的执行计划,将所述执行计划对应的执行环境参数信息转化成相应的表示向量;A vector representation unit, configured to determine the execution plan corresponding to the structured query statement information, and convert the execution environment parameter information corresponding to the execution plan into a corresponding representation vector;
查询优化处理单元,用于将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划;A query optimization processing unit, configured to input the representation vector into a query optimizer model based on a long short-term memory network, obtain cost information output by the query optimizer model, and determine a corresponding target execution plan based on the cost information ;
其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。Wherein, the query optimizer model is obtained by training the historical execution plan as sample data and the actual cost corresponding to the historical execution plan as the target value.
在一个实施例中,所述执行环境参数信息包含查询解析树、执行所述结构化查询语句信息的查询负载数据、执行环境的硬件特征信息以及软件特征信息。In one embodiment, the execution environment parameter information includes query parsing tree, query load data for executing the structured query statement information, hardware feature information and software feature information of the execution environment.
在一个实施例中,所述查询优化器模型,包括:基数估计模块,成本估计模块,连接顺序选择模块、具有机器学习和强化学习能力的学习型优化器模块;In one embodiment, the query optimizer model includes: a cardinality estimation module, a cost estimation module, a connection order selection module, and a learning optimizer module with machine learning and reinforcement learning capabilities;
其中,所述基数估计模块,用于确定所述执行计划每个级别上处理的总行数,将所述总行数确定为所述执行计划的基数信息;Wherein, the cardinality estimation module is configured to determine the total number of rows processed at each level of the execution plan, and determine the total number of rows as the cardinality information of the execution plan;
所述成本估计模块,用于基于所述结构化查询语句信息对应的统计信息,确定相应的查询优化器中各个物理路径的成本信息;其中,统计信息包括:高频值统计信息,直方图统计信息,目标系数统计信息,类型高频值信息,数组类型直方图信息以及范围值类型统计信息;The cost estimation module is configured to determine the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein, the statistical information includes: high-frequency value statistical information, histogram statistics information, target coefficient statistics, type high frequency value information, array type histogram information, and range value type statistics;
所述连接顺序选择模块,用于基于所述长短期记忆网络的强化学习能力进行连接顺序选择优化,得到对应的执行计划;The connection sequence selection module is used to select and optimize the connection sequence based on the reinforcement learning capability of the long-short-term memory network to obtain a corresponding execution plan;
所述学习型优化器模块,用于基于所述基数信息和所述成本信息,利用机器学习和强化学习的方式从所述执行计划中确定相应的目标执行计划。The learning optimizer module is configured to determine a corresponding target execution plan from the execution plan by using machine learning and reinforcement learning based on the cardinality information and the cost information.
在一个实施例中,所述的基于机器学习的数据查询优化处理装置,还包括:查询改写单元,用于基于预设的改写规则对预先确定的结构化查询语句信息的优化类型对所述结构化查询语句信息中的目标查询语句进行优化改写;或者,基于预设的学习规则,采用机器学习和强化学习的方式对所述结构化查询语句信息进行优化选择,得到相应的查询改写结果。In one embodiment, the machine learning-based data query optimization processing device further includes: a query rewriting unit, configured to optimize the structure of the predetermined structured query statement information based on preset rewriting rules Optimize and rewrite the target query statement in the structured query statement information; or, based on preset learning rules, use machine learning and reinforcement learning to optimize and select the structured query statement information to obtain corresponding query rewriting results.
在一个实施例中,所述的基于机器学习的数据查询优化处理方法,还包括:查询解析单元,用于采用对应数据库的查询解析工具中的SQL解析器对所述结构化查询语句信息进行解析,得到包含运算符表达式的查询解析树;利用所述查询优化器模型,根据预设的插拔原则对所述运算符表达式进行优化处理,得到相应的查询解析结果。In one embodiment, the machine learning-based data query optimization processing method further includes: a query parsing unit for parsing the structured query statement information by using the SQL parser in the query parsing tool of the corresponding database , to obtain a query parsing tree including an operator expression; using the query optimizer model, the operator expression is optimized according to a preset plugging principle, and a corresponding query parsing result is obtained.
在一个实施例中,所述长短期记忆网络查询优化器模型用于,基于从所述历史执行计划进行特征提取得到的表示特征,对所述结构化查询语句信息进行分析处理,确定对应的代价信息。In one embodiment, the long-short-term memory network query optimizer model is used to analyze and process the structured query statement information based on the representation features obtained by feature extraction from the historical execution plan, and determine the corresponding cost information.
第三方面,本发明还提供一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任意一项所述的基于机器学习的数据查询优化处理方法的步骤。In a third aspect, the present invention also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the program, any one of the above-mentioned The steps of the machine learning-based data query optimization processing method.
第四方面,本发明还提供一种处理器可读存储介质,所述处理器可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如上述任意一项所述的基于机器学习的数据查询优化处理方法的步骤。In a fourth aspect, the present invention also provides a processor-readable storage medium, where a computer program is stored on the processor-readable storage medium, and when the computer program is executed by a processor, the machine-based Learn the steps of the data query optimization processing method.
采用本发明所述的基于机器学习的数据查询优化处理方法,能够基于动态的执行环境参数信息,在不同执行环境下得到更准确的代价估计,从而快速、准确的选择到更优的执行计划,提高了查询优化处理的效率和精确度。By adopting the data query optimization processing method based on machine learning described in the present invention, based on the dynamic execution environment parameter information, more accurate cost estimates can be obtained in different execution environments, so as to quickly and accurately select a better execution plan, The efficiency and accuracy of query optimization processing are improved.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获取其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.
图1为本发明实施例提供的基于机器学习的数据查询优化处理方法的流程示意图;FIG. 1 is a schematic flowchart of a data query optimization processing method based on machine learning provided by an embodiment of the present invention;
图2为本发明实施例提供的用于实现基于机器学习的数据查询优化处理方法的结构示意图;FIG. 2 is a schematic structural diagram for implementing a machine learning-based data query optimization processing method provided by an embodiment of the present invention;
图3为本发明实施例提供的基于机器学习的数据查询优化处理方法中查询优化模块的结构示意图;3 is a schematic structural diagram of a query optimization module in a data query optimization processing method based on machine learning provided by an embodiment of the present invention;
图4为本发明实施例提供的基于机器学习的数据查询优化处理方法中SQL语句强化学习模型的结构示意图;4 is a schematic structural diagram of a SQL statement reinforcement learning model in a data query optimization processing method based on machine learning provided by an embodiment of the present invention;
图5为本发明实施例提供的基于机器学习的数据查询优化处理方法对应的具体实施流程示意图;FIG. 5 is a schematic diagram of a specific implementation process corresponding to a data query optimization processing method based on machine learning provided by an embodiment of the present invention;
图6为本发明实施例提供的基于机器学习的数据查询优化处理装置的结构示意图;6 is a schematic structural diagram of a data query optimization processing device based on machine learning provided by an embodiment of the present invention;
图7为本发明实施例提供的电子设备的实体结构示意图。FIG. 7 is a schematic diagram of a physical structure of an electronic device provided by an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
下面基于本发明所述的基于机器学习的数据查询优化处理方法,对其实施例进行详细描述。如图1所示,其为本发明实施例提供的基于机器学习的数据查询优化处理方法的流程示意图,具体实现过程包括以下步骤:Embodiments of the machine learning-based data query optimization processing method described in the present invention will be described in detail below. As shown in Figure 1, it is a schematic flow chart of a data query optimization processing method based on machine learning provided by an embodiment of the present invention, and the specific implementation process includes the following steps:
步骤101:获取输入的结构化查询语句信息。Step 101: Obtain the information of the input structured query statement.
在本发明实施例中,在执行本步骤之前,需要预先进行模型训练。具体的,首先需要将历史执行过的SQL(Structured Query Language)执行计划、该执行计划相关的属性列、表以及查询树、执行SQL语句的负载数据、硬件特征信息以及软件特征信息等执行环境信息转化成相应的向量表示,即确定对应的表示向量,将表示向量作为原始查询优化器模型的输入,以执行该执行计划的实际代价作为目标值,将目标值和预测值进行比较算出相应的损失,进行反向传播更新参数,实现模型的训练。所述原始查询优化器模型是指基于LSTM(Long Short-Term Memory;长短期记忆网络)的成本模型。进一步的,在本步骤中,获取SQL数据查询时输入的结构化查询语句信息,即SQL查询语句。In the embodiment of the present invention, before performing this step, model training needs to be performed in advance. Specifically, it is first necessary to include historically executed SQL (Structured Query Language) execution plans, attribute columns, tables, and query trees related to the execution plan, load data for executing SQL statements, hardware feature information, software feature information, and other execution environment information. Transform into the corresponding vector representation, that is, determine the corresponding representation vector, use the representation vector as the input of the original query optimizer model, take the actual cost of executing the execution plan as the target value, and compare the target value with the predicted value to calculate the corresponding loss , perform backpropagation to update the parameters, and realize the training of the model. The original query optimizer model refers to a cost model based on LSTM (Long Short-Term Memory; long short-term memory network). Further, in this step, the structured query statement information input during the SQL data query, that is, the SQL query statement is obtained.
步骤102:确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量。其中,所述执行环境参数信息包含查询解析树、执行所述结构化查询语句信息的查询负载数据、执行环境的硬件特征信息以及软件特征信息等。所述执行计划为SQL执行计划。Step 102: Determine the execution plan corresponding to the structured query statement information, and convert the execution environment parameter information corresponding to the execution plan into a corresponding representation vector. Wherein, the execution environment parameter information includes query parsing tree, query load data for executing the structured query statement information, hardware feature information and software feature information of the execution environment, and the like. The execution plan is an SQL execution plan.
在本发明实施例中,首先确定SQL查询语句信息对应的各个执行计划,将该执行计划相关的属性列、表以及查询解析树、负载数据、硬件特征信息以及软件特征信息等执行环境信息转化成相应的表示向量,输入训练好的成本模型中,获取成本模型输出的代价,从而便于后续选取代价最低的执行计划进行执行。所述查询解析树,即抽象语法树(AbstractSyntax Tree,AST)。In the embodiment of the present invention, first determine each execution plan corresponding to the SQL query statement information, and convert the execution environment information such as attribute columns, tables, query parsing trees, load data, hardware feature information, and software feature information related to the execution plan into The corresponding representation vector is input into the trained cost model, and the cost output by the cost model is obtained, so that the execution plan with the lowest cost can be selected for subsequent execution. The query parse tree is an abstract syntax tree (AbstractSyntax Tree, AST).
在查询特征表示过程中,由于机器学习方法的性能在很大程度上取决于通过向量表示数据的形式或特征。因此,本申请抽取用户进行数据查询时的执行环境参数信息、数据库元数据(表、属性、统计信息等)等特征,可采用One-Hot方式进行特征表示。具体的,所述特征表示为:将SQL执行计划与该执行计划相关的属性列、表等对象,以及生成的查询解析树等转化成相应的表示向量,作为成本模型的输入。所述向量表示为:将属性列、表、查询解析树等,转化为相应的表示向量作为成本模型的输入。其中,查询解析树表示结构为Tree-LSTM,其是图神经的网络(GNN,Graph Neural Networks)的一种,该Tree-LSTM的参数可按照预设规则进行设置,在此不再详细赘述。In the process of query feature representation, since the performance of machine learning methods largely depends on the form or characteristics of data represented by vectors. Therefore, this application extracts features such as execution environment parameter information and database metadata (tables, attributes, statistical information, etc.) when users perform data queries, and can use the One-Hot method for feature representation. Specifically, the features are expressed as: transforming the SQL execution plan and objects such as attribute columns and tables related to the execution plan, and the generated query parsing tree into corresponding representation vectors, which are used as input of the cost model. The vector representation is as follows: attribute columns, tables, query parsing trees, etc. are converted into corresponding representation vectors as the input of the cost model. Among them, the representation structure of the query parsing tree is Tree-LSTM, which is a kind of graph neural network (GNN, Graph Neural Networks). The parameters of the Tree-LSTM can be set according to preset rules, and will not be described in detail here.
在具体实施过程中,SQL查询语句的表示可以包括:令n为数据库中的表数,每个表具有从0到n-1的唯一标识符;令m为每个元素为0或1的n*n矩阵;Mi,j为1表示在第i个表和第j个表之间存在连接关系,不存在则为0;然后,通过将原始矩阵m的所有行(从0到n-1)放入向量中来将该矩阵展平为向量v,即Vi*n+j=mi,j;|v|=n*n(把一个矩阵转化为一个一维的向量)。In the specific implementation process, the expression of the SQL query statement may include: let n be the number of tables in the database, each table has a unique identifier from 0 to n-1; let m be n where each element is 0 or 1 *n matrix; Mi, j being 1 means that there is a connection relationship between the i-th table and the j-th table, and 0 if it does not exist; then, by adding all rows of the original matrix m (from 0 to n-1) put into a vector to flatten the matrix into a vector v, that is, Vi*n+j=mi, j; |v|=n*n (convert a matrix into a one-dimensional vector).
属性列的表示:基于预设的SQL查询语句,只需要使用SQL中的谓词来构造列的表示形式,而无需扫描数据库。Representation of attribute columns: Based on preset SQL query statements, only the predicates in SQL need to be used to construct the representation of columns without scanning the database.
对于具有k列的表,使用其k个列的向量表示其表的表示向量。对于SQL查询语句中的一些符号化的操作符拟采用One-Hot的形式进行编码。除此之外,在实际实施过程中,针对基数估计中的Hash Join,Seq Scan等、每个表以及每个表的属性值等,也进行了相关的One-Hot向量化,在此不再详细赘述。For a table with k columns, use the vector of its k columns to represent its table's representation vector. For some symbolic operators in the SQL query statement, it is planned to use the One-Hot form to encode. In addition, in the actual implementation process, related One-Hot vectorization is also carried out for Hash Join, Seq Scan, etc. in cardinality estimation, each table and the attribute value of each table, etc., which will not be repeated here Go into details.
在模型处理过程中,嵌入层:将向量化之后的操作符和各种表的属性、字段等,经过一层Embedding将原本稀疏高维的表示向量变成低维稠密的表示向量。将最后处理好的数据输入到成本模型中进行计算。将操作符、元数据、基数估计操作方法以及节点图分别用Ot,Mt,Pt,Bt表示。经过激活层的处理,比如可以表示为如下:In the process of model processing, the embedding layer: transforms the vectorized operators and the attributes and fields of various tables into low-dimensional dense representation vectors through a layer of Embedding. Input the final processed data into the cost model for calculation. The operator, metadata, cardinality estimation operation method and node graph are denoted by O t , M t , P t , B t respectively. After the activation layer processing, for example, it can be expressed as follows:
embed(Ot)=ReLU(WoOt+bo)embed(O t )=ReLU(W o O t +b o )
embed(Mt)=ReLU(WmMt+bm)embed(M t )=ReLU(W m M t +b m )
embed(Bt)=ReLU(WbBt+bb)embed(B t )=ReLU(W b B t +b b )
其中,embed表示嵌入层函数;ReLU为线性整流函数;Wo、Wm、Wb、bo、bm、bb为预设参数。Among them, embed represents the embedding layer function; ReLU is the linear rectification function; W o , W m , W b , b o , b m , and b b are preset parameters.
之后的节点类型,比如可以表示为如下:The following node types, for example, can be expressed as follows:
其中,Pt表示为基数估计操作方法的选择类型,具体包含AND、OR及其他选择类型的操作;embed表示嵌入层函数;expr表示表达式;Wm、bp、为预设参数。Among them, P t represents the selection type of the cardinality estimation operation method, specifically including AND, OR and other selection types of operations; embed represents the embedded layer function; expr represents the expression; W m , b p , is a preset parameter.
表示层:本发明实施例中需要解决的问题是避免全局节点之间与根节点之间的损失,因此提出了使用基于LSTM的成本模型进行模型训练。将嵌入层返回值输入到表示层中,可对应如下形式化定义:Presentation layer: The problem to be solved in the embodiment of the present invention is to avoid the loss between the global nodes and the root node, so it is proposed to use the LSTM-based cost model for model training. Inputting the return value of the embedding layer into the presentation layer can correspond to the following formal definition:
output=LSTM(embed(Ot),embed(Mt),embed(Bt),embed(Pt))output=LSTM(embed(O t ),embed(M t ),embed(B t ),embed(P t ))
评估层:使用两层的全连接网络(Fully Connected,FC)进行连接,设置特定的Dropout以防止训练的过拟合。激活函数使用ReLU(Rectified Linear Unit,ReLU)线性整流函数或Sigmoid(S型函数)进行,最后将目标值和预测值进行比较算出相应的损失,进行反向传播更新参数。整体的形式化可以定义如下:Evaluation layer: use a two-layer fully connected network (Fully Connected, FC) to connect, and set a specific dropout to prevent overfitting of training. The activation function is performed using ReLU (Rectified Linear Unit, ReLU) linear rectification function or Sigmoid (S-type function). Finally, the target value is compared with the predicted value to calculate the corresponding loss, and the backpropagation is performed to update the parameters. The overall formalization can be defined as follows:
Final_output=sigmoid(Dropout(FC(output)))Final_output = sigmoid(Dropout(FC(output)))
以上拟采用方法可根据实际业务进行调整,以达到最优效果,最终可以得到训练好的基于长短期记忆网络的查询优化器模型。The above proposed method can be adjusted according to the actual business to achieve the optimal effect, and finally a trained query optimizer model based on the long-term and short-term memory network can be obtained.
步骤103:将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划。其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。所述长短期记忆网络查询优化器模型用于,基于从所述历史执行计划进行特征提取得到的表示特征,对所述结构化查询语句信息进行分析处理,确定对应的代价信息。Step 103: Input the representation vector into the query optimizer model based on the long short-term memory network, obtain the cost information output by the query optimizer model, and determine the corresponding target execution plan based on the cost information. Wherein, the query optimizer model is obtained by training the historical execution plan as sample data and the actual cost corresponding to the historical execution plan as the target value. The long-short-term memory network query optimizer model is used to analyze and process the structured query statement information based on the representation features obtained by feature extraction from the historical execution plan, and determine corresponding cost information.
如图2所示,本发明实施例中,查询处理系统包含查询解析器、查询改写器、查询优化器、查询执行器等;存储管理系统包含缓冲管理模块等;机器学习模型管理器。As shown in FIG. 2, in the embodiment of the present invention, the query processing system includes a query parser, query rewriter, query optimizer, query executor, etc.; the storage management system includes a buffer management module, etc.; and a machine learning model manager.
在查询解析过程中,可采用对应数据库的查询解析工具中的SQL解析器对所述结构化查询语句信息进行解析,得到包含运算符表达式的查询解析树;利用所述查询优化器模型,根据预设的插拔原则对所述运算符表达式进行优化处理,得到相应的查询解析结果。具体的,可采用对应数据库的查询解析工具Calcite,应用于各种离线、搜索、实时查询引擎,如Drill、Hive、Kylin、Solr、flink、Samza等,对SQL查询语句进行解析、SQL校验等功能。通过Calcite的SQL解析器解析得到查询解析树,该查询解析树包含了运算符表达式。查询优化器可根据插拔原则对运算符表达式进行优化,其中可使用元数据库提供的信息进行成本计算等操作。In the query parsing process, the SQL parser in the query parsing tool of the corresponding database can be used to parse the structured query statement information to obtain a query parsing tree containing operator expressions; using the query optimizer model, according to The preset insertion and removal principle optimizes the operator expression to obtain a corresponding query analysis result. Specifically, Calcite, a query analysis tool corresponding to the database, can be used to apply to various offline, search, and real-time query engines, such as Drill, Hive, Kylin, Solr, flink, Samza, etc., to analyze SQL query statements, SQL verification, etc. Features. The query parsing tree is parsed by Calcite's SQL parser, and the query parsing tree contains operator expressions. The query optimizer can optimize operator expressions according to the principle of plugging and unplugging, and can use the information provided by the metadata database for cost calculation and other operations.
在查询改写过程中,可基于预设的改写规则对预先确定的结构化查询语句信息的优化类型对所述结构化查询语句信息中的目标查询语句进行优化改写;或者,基于预设的学习规则,采用机器学习和强化学习的方式对所述结构化查询语句信息进行优化选择,得到相应的查询改写结果。具体的,提供一个预设的SQL改写工具,并对索引推荐进行优化,基于模型可以学习在给定查询工作量的情况下建立索引的优势和成本,然后根据学习到的信息自动建立索引,例如:外键索引,以提升查询速度。与传统的查询改写方法相比,只有基于SQL中的内容、用户的点击行为、语义信息以及获取到的不同用户的Session信息进行SQL语句的改写。本发明在此基础之上,采用两种方式进行改进,即基于预设的改写规则将预先确定的SQL优化类型对所述结构化查询语句信息中的目标查询语句进行优化改写;而基于预设的学习方式,通过机器学习和强化学习进行SQL查询语句的优化选择。During the query rewriting process, the target query statement in the structured query statement information may be optimized and rewritten based on the predetermined optimization type of structured query statement information based on preset rewriting rules; or, based on preset learning rules , using machine learning and reinforcement learning to optimize and select the structured query statement information to obtain a corresponding query rewriting result. Specifically, a preset SQL rewriting tool is provided, and the index recommendation is optimized. Based on the model, the advantages and costs of indexing can be learned in the case of a given query workload, and then the index is automatically established according to the learned information, for example : Foreign key index to improve query speed. Compared with the traditional query rewriting method, SQL statements are rewritten only based on the content in SQL, user click behavior, semantic information, and obtained Session information of different users. On this basis, the present invention adopts two ways to improve, that is, based on the preset rewriting rules, the predetermined SQL optimization type is optimized and rewritten to the target query statement in the structured query statement information; and based on the preset The learning method uses machine learning and reinforcement learning to optimize the selection of SQL query statements.
在查询优化过程中,所述查询优化器模型,包括:基数估计模块,成本估计模块,连接顺序选择模块、基于规则的优化器和基于成本的优化器、具有机器学习和强化学习能力的学习型优化器模块。其中,所述基数估计模块,用于确定所述执行计划每个级别上处理的总行数,将所述总行数确定为所述执行计划的基数信息;所述成本估计模块,用于基于所述结构化查询语句信息对应的统计信息,确定相应的查询优化器中各个物理路径的成本信息;其中,统计信息包括:高频值统计信息,直方图统计信息,目标系数统计信息,类型高频值信息,数组类型直方图信息以及范围值类型统计信息;所述连接顺序选择模块,用于基于所述长短期记忆网络的强化学习能力进行连接顺序选择优化,得到对应的执行计划;所述学习型优化器模块,用于基于所述基数信息和所述成本信息,利用机器学习和强化学习的方式从所述执行计划中确定相应的目标执行计划。In the query optimization process, the query optimizer model includes: a cardinality estimation module, a cost estimation module, a connection order selection module, a rule-based optimizer and a cost-based optimizer, a learning type with machine learning and reinforcement learning capabilities Optimizer module. Wherein, the cardinality estimation module is used to determine the total number of rows processed at each level of the execution plan, and determine the total number of rows as the cardinality information of the execution plan; the cost estimation module is used to determine the number of rows based on the The statistical information corresponding to the structured query statement information determines the cost information of each physical path in the corresponding query optimizer; among them, the statistical information includes: high-frequency value statistical information, histogram statistical information, target coefficient statistical information, type high-frequency value Information, array type histogram information and range value type statistical information; the connection order selection module is used to optimize the connection order selection based on the reinforcement learning ability of the long short-term memory network, and obtain the corresponding execution plan; the learning type An optimizer module, configured to determine a corresponding target execution plan from the execution plan by using machine learning and reinforcement learning based on the cardinality information and the cost information.
在查询优化器中,先用机器学习使用Spark MLib库中的分类和回归将SQL查询语句进行相应的分类,从分类之后的种类再提取相应的优化方法。查询优化器的整体结构如图3所示。其主要包括两个大部分:SQL优化部分和查询优化器部分。其中进行SQL优化部分时,可将SQL优化部分改写的SQL查询语句进行优化,通过对数据的特征提取、降维、协同过滤、关联分析以及趋势预测等,形成有索引推荐、视图推荐、分区推荐等优化功能。查询优化器部分中拟使用结合基数估计,成本估计,连接顺序选择(采用Tree-LSTM模型结构),同时采用基于规则的优化器(RBO,Rule-Based Optimizer),基于成本的优化器(CBO,Cost-Based Optimizer)以及具有机器学习和强化学习功能的学习型优化器。In the query optimizer, machine learning is first used to classify SQL query statements using classification and regression in the Spark MLib library, and then extract corresponding optimization methods from the classified types. The overall structure of the query optimizer is shown in Figure 3. It mainly includes two parts: SQL optimization part and query optimizer part. When performing the SQL optimization part, the SQL query statement rewritten in the SQL optimization part can be optimized, and index recommendations, view recommendations, and partition recommendations can be formed through data feature extraction, dimensionality reduction, collaborative filtering, association analysis, and trend prediction. and other optimization functions. In the query optimizer part, it is planned to use combined cardinality estimation, cost estimation, and connection order selection (using Tree-LSTM model structure), while using rule-based optimizer (RBO, Rule-Based Optimizer), cost-based optimizer (CBO, Cost-Based Optimizer) and Learning Optimizer with Machine Learning and Reinforcement Learning.
所述规模估算,即基数估算。首先确定执行计划每个级别上处理的总行数,称为该执行计划的基数;由查询中所使用的运算符规定的算法的开销模式;第一个因素(基数)用作第二个因素(开销模式)的输入参数;因此,增大基数将减少估计开销,从而加快执行计划。The size estimate is the base estimate. First determine the total number of rows processed at each level of the execution plan, known as the cardinality of the execution plan; the overhead pattern of the algorithm dictated by the operators used in the query; the first factor (the cardinality) is used as the second factor ( cost mode); therefore, increasing the cardinality will reduce the estimated cost and thus speed up the execution plan.
所述成本估算,即成本模型。确定查询优化器进行物理优化需要计算各种物理路径的成本,而成本估算严重依赖统计信息。所述统计信息是否能准确地描述表中的数据分布情况是决定成本的准确性的重要条件之一。在具体实施过程中,通过成本估算就可以了解一个表有多少条数据,用了多少个页面,某个值出现的频率等等,然后根据这些信息计算出一个约束条件能够过滤掉多少数据,这种约束条件过滤出的数据占总数据量的比例,称之为“选择率”。The cost estimate is a cost model. Determining the query optimizer for physical optimization requires calculating the cost of various physical paths, and cost estimation relies heavily on statistics. Whether the statistical information can accurately describe the data distribution in the table is one of the important conditions for determining the accuracy of the cost. In the specific implementation process, through cost estimation, you can know how many pieces of data a table has, how many pages are used, the frequency of a certain value, etc., and then calculate how much data can be filtered out by a constraint based on this information. The ratio of the data filtered out by such constraints to the total amount of data is called the "selection rate".
根据SQL查询语句提供多种形式的统计信息,其包括单列的统计信息和多列的统计信息,单列的统计信息是指每个表的属性列都在系统表中产生一个对应的统计元组,这个元组负责从多个角度描绘这个属性中的数据分布。Various forms of statistical information are provided according to SQL query statements, including single-column statistical information and multi-column statistical information. Single-column statistical information means that the attribute columns of each table generate a corresponding statistical tuple in the system table. This tuple is responsible for depicting the distribution of data in this attribute from multiple perspectives.
该统计信息包括:高频值统计,直方图统计,相关系数统计,类型高频值,数组类型直方图,最后还有范围值类型统计。从以上统计出的信息来形成对数据总体的把握,选择率的估算需要借助于统计信息(包括直方图、高频值、NULL值率等)。The statistical information includes: high-frequency value statistics, histogram statistics, correlation coefficient statistics, type high-frequency values, array type histogram, and finally range value type statistics. From the above statistical information to form a grasp of the overall data, the estimation of the selection rate needs the help of statistical information (including histogram, high frequency value, NULL value rate, etc.).
在实际实施过程中,确定了约束条件的选择率,也就是确定了通过扫描路径要扫描出来的结果所占的比例或者通过连接操作所获得的元组所占的比例,通过这个比例就可以推算出中间结果和最终结果的数量,进而使用这些数量来计算成本,从而确定相应的代价。In the actual implementation process, the selection rate of the constraints is determined, that is, the proportion of the results to be scanned through the scanning path or the proportion of the tuples obtained through the connection operation is determined, and this proportion can be calculated The number of intermediate results and final results is calculated, and then these numbers are used to calculate the cost, so as to determine the corresponding price.
所述计划生成,即连接顺序选择。采用Tree-LSTM模型和强化学习进行查询的连接优化。具体而言,其使用带有树结构的长期短期记忆(LSTM)的强化学习进行连接顺序选择。首先采用图神经网络来捕获连接树的结构;另外其支持数据库模式和多别名表名称的修改。在基于JOB(Join Order Benchmark)和TPC-H上进行的大量实验表明,RTOS(Real-timeoperating system)优于传统优化器和现有的基于DRL(Deep Reinforcement Learning)的学习型优化器。尤其是与动态规划相比,RTOS连接计划(估计)成本为101%,执行时间为67%。Said plan generation, ie connection order selection. Join optimization of queries using Tree-LSTM model and reinforcement learning. Specifically, it uses reinforcement learning with tree-structured long short-term memory (LSTM) for connection order selection. First, a graph neural network is used to capture the structure of the connection tree; in addition, it supports the modification of the database schema and the name of the multi-alias table. A large number of experiments based on JOB (Join Order Benchmark) and TPC-H show that RTOS (Real-time operating system) is superior to traditional optimizers and existing learning optimizers based on DRL (Deep Reinforcement Learning). Especially compared to dynamic programming, RTOS connection planning (estimated) costs 101% and execution time 67%.
进一步的,在选取高效执行计划过程中,可采用强化学习方法将实际执行计划时间是作为执行代价(Reward)进行训练,即执行计划(即调度策略)的调整。目标是找到数据查询的执行计划时间最小的执行计划。Furthermore, in the process of selecting an efficient execution plan, the actual execution plan time can be used as the execution cost (Reward) for training by means of reinforcement learning, that is, the adjustment of the execution plan (ie, the scheduling strategy). The goal is to find the execution plan with the smallest execution plan time for the data query.
强化学习主要模型结构如图4所示。每次进行SQL查询语句解析都会改变查询的负载数据和SQL查询语句的有效性,从而可看出对SQL查询语句的调优是一个序列的调优问题。而强化学习可以较好解决序列决策过程,能减轻对训练样本的依赖。强化学习通过调度器(Agent)与环境不断试错的方式进行交互,寻找最优调度策略(Action),以最大化与环境交互获得的累计奖励。在数学上可建模为马尔科夫决策过程(MDP,Markov DecisionProcess),用于在系统状态具有马尔可夫性质的环境中模拟智能体可实现的随机策略与回报,要素包括状态、动作、策略函数和奖励函数。本发明可采用Q-Learning、DQN(Deep Q-Learing Network)异步、离线RL算法,并通过情节性元强化学习加快强化学习效率。强化学习主要从与环境的交互获得反馈,其可解释性有限,用深度学习进行目标求解时,会进一步导致可解释性差。在应用到数据库的场景计算中,主要依赖实践经验,例如:其Reward计算主要从前后查询时间获得收益。因此本项目不仅如上的成本模型的Tree-LSTM方法作为基线方法,还拟与本强化学习方法进行集成或融合,进一步提高效率和可解释性。The main model structure of reinforcement learning is shown in Figure 4. Every time the SQL query statement is parsed, the query load data and the validity of the SQL query statement will be changed, so it can be seen that the optimization of the SQL query statement is a sequential optimization problem. Reinforcement learning can better solve the sequential decision-making process and reduce the dependence on training samples. Reinforcement learning interacts with the environment through trial and error through the scheduler (Agent) to find the optimal scheduling strategy (Action) to maximize the cumulative rewards obtained from interacting with the environment. Mathematically, it can be modeled as a Markov decision process (MDP, Markov Decision Process), which is used to simulate the random strategy and reward that the agent can achieve in an environment where the system state has Markov properties. The elements include state, action, strategy function and reward function. The present invention can adopt Q-Learning, DQN (Deep Q-Learing Network) asynchronous, off-line RL algorithm, and accelerate the efficiency of reinforcement learning through episodic meta-reinforcement learning. Reinforcement learning mainly obtains feedback from the interaction with the environment, and its interpretability is limited. When using deep learning to solve the goal, it will further lead to poor interpretability. In the scenario calculation applied to the database, it mainly relies on practical experience. For example, its Reward calculation mainly obtains benefits from the front and rear query times. Therefore, this project not only uses the Tree-LSTM method of the above cost model as the baseline method, but also intends to integrate or fuse with this reinforcement learning method to further improve efficiency and interpretability.
如图2所示,在一个实施过程中,用户通过用户界面将SQL查询语句信息进行输入,机器学习库和强化学习库对其通过扩展,将其转化为关系SQL查询语句。当查询处理系统接收到SQL查询语句之后,经过对SQL查询语句的解析、改写、优化、和执行之后,再从索引文件或缓存中提取出执行结果。查询处理系统依赖于采集器,可将机器学习库和强化学习库的负载性能指数与查询处理的性能进行相互结合,最终达到一个对环境有效的交互。存储管理模块(即存储管理系统)采用混合数据存储管理系统(Database Management System;DBMS)所提供的异构数据平台交换技术,实现多种数据库的数据读取和复杂关联查询,从而简化数据预处理过程。查询优化模型具体可细化为规模估算,成本估算,以及最终的计划生成等。As shown in Figure 2, in an implementation process, the user inputs SQL query statement information through the user interface, and the machine learning library and reinforcement learning library extend it to convert it into a relational SQL query statement. After receiving the SQL query statement, the query processing system extracts the execution result from the index file or cache after parsing, rewriting, optimizing, and executing the SQL query statement. The query processing system relies on the collector, which can combine the load performance index of the machine learning library and the reinforcement learning library with the performance of query processing, and finally achieve an effective interaction with the environment. The storage management module (that is, the storage management system) adopts the heterogeneous data platform exchange technology provided by the hybrid data storage management system (Database Management System; DBMS) to realize data reading and complex correlation query of various databases, thereby simplifying data preprocessing process. The query optimization model can be detailed into scale estimation, cost estimation, and final plan generation.
在本发明实施例中,机器学习库主要采用Spark MLib、ML Algorithm、StoredProcedure、Spark UDF等机器学习库和相关方法。除此之外,本发明还可以对所提供的机器学习库进行自定义算法的开发。所示的元数据库用于保存数据库模式、知识图谱、用户查询负载日志、系统性能日志、机器学习库和强化学习库等信息。In the embodiment of the present invention, the machine learning library mainly uses machine learning libraries such as Spark MLib, ML Algorithm, Stored Procedure, Spark UDF and related methods. In addition, the present invention can also develop custom algorithms for the provided machine learning library. The metabase shown is used to hold information such as database schemas, knowledge graphs, user query load logs, system performance logs, machine learning libraries, and reinforcement learning libraries.
如图5所示,在另一个实际实施过程中,可使用数据集A进行模型测试和训练。数据集A是一个基于点击量排行榜的真实数据集,可以提供真实的负载数据。其包含来自33个模板中的113个SQL查询语句,以及包含3.6GB的数据(当计算索引时为11GB)和21个表。每个SQL查询语句中的表范围为4到17个。具体的,首先要对数据集A中的SQL查询语句进行抽取和加载,而且对加载的SQL查询语句按训练集和测试集9:1的比例进行分配,SQL查询语句加载完成后对表的特征(表名,列名,谓词等)进行编码,获得表示向量,将表示向量作为基于深度强化学习的查询优化器模型的输入,然后对编码提取特征进行训练。其中间状态是包含部分状态的连接树。中间状态将传递给查询优化器模型以选择现在应执行的操作,并根据所选操作不断更新自身的状态。系统的终止状态是将所有表连接在一起,构成包含所有表的连接树,其将被转换为SQL查询语句的执行计划并传递给管理存储模块的混合数据存储管理系统进行执行。As shown in Figure 5, in another practical implementation process, data set A can be used for model testing and training. Dataset A is a real dataset based on the traffic leaderboard, which can provide real load data. It contains 113 SQL queries from 33 templates, and contains 3.6GB of data (11GB when indexes are calculated) and 21 tables. The range of tables in each SQL query statement is 4 to 17. Specifically, the SQL query statements in the data set A must be extracted and loaded first, and the loaded SQL query statements are allocated according to the ratio of 9:1 between the training set and the test set. After the SQL query statements are loaded, the characteristics of the table (table name, column name, predicate, etc.) are encoded to obtain a representation vector, and the representation vector is used as the input of the query optimizer model based on deep reinforcement learning, and then the encoding extraction feature is trained. The intermediate state is a connected tree containing partial states. The intermediate state is passed to the query optimizer model to choose what action should be taken now, and it keeps updating its own state based on the chosen action. The terminal state of the system is to connect all tables together to form a connection tree containing all tables, which will be converted into the execution plan of the SQL query statement and passed to the hybrid data storage management system that manages the storage module for execution.
本发明采用了基于机器学习、强化学习等方法,提升查询优化器运算效率,持续学习用户的查询的负载数据,提升了数据库SQL数据查询的自适应优化能力,即针对用户的数据规模和查询的负载数据的特点进行系统优化,从而促进为数据查询分析注智赋能。The present invention adopts methods based on machine learning, reinforcement learning, etc. to improve the computing efficiency of the query optimizer, continuously learn the load data of the user's query, and improve the self-adaptive optimization capability of the database SQL data query, that is, for the user's data scale and query efficiency. The system is optimized according to the characteristics of the load data, so as to promote the intelligent empowerment of data query analysis.
采用本发明实施例所述的基于机器学习的数据查询优化处理方法,能够基于动态的执行环境参数信息,在不同执行环境下得到更准确的代价估计,从而快速、准确的选择到更优的执行计划,提高了查询优化处理的效率和精确度。Using the machine learning-based data query optimization processing method described in the embodiment of the present invention can obtain more accurate cost estimates in different execution environments based on dynamic execution environment parameter information, so as to quickly and accurately select better execution plan, which improves the efficiency and accuracy of query optimization processing.
与上述提供的一种位于基站侧的基于机器学习的数据查询优化处理方法相对应,本发明还提供一种位于基站侧的基于机器学习的数据查询优化处理装置。由于该装置的实施例相似于上述方法实施例,所以描述得比较简单,相关之处请参见上述方法实施例部分的说明即可,下面描述的基于机器学习的数据查询优化处理装置的实施例仅是示意性的。请参考图6所示,其为本发明实施例提供的一种基于机器学习的数据查询优化处理装置的结构示意图。Corresponding to the machine learning-based data query optimization processing method provided above at the base station side, the present invention also provides a machine learning-based data query optimization processing device at the base station side. Since the embodiment of the device is similar to the above-mentioned method embodiment, the description is relatively simple. For relevant information, please refer to the description of the above-mentioned method embodiment. The embodiment of the data query optimization processing device based on machine learning described below is only is indicative. Please refer to FIG. 6 , which is a schematic structural diagram of a data query optimization processing device based on machine learning provided by an embodiment of the present invention.
本发明所述的一种基于机器学习的数据查询优化处理装置具体包括如下部分:A machine learning-based data query optimization processing device according to the present invention specifically includes the following parts:
结构化查询语句获取单元601,用于获取输入的结构化查询语句信息;A structured query
向量表示单元602,用于确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量;A
查询优化处理单元603,用于将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划。The query
其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。Wherein, the query optimizer model is obtained by training the historical execution plan as sample data and the actual cost corresponding to the historical execution plan as the target value.
采用本发明实施例所述的基于机器学习的数据查询优化处理装置,能够基于动态的执行环境参数信息,在不同执行环境下得到更准确的代价估计,从而快速、准确的选择到更优的执行计划,提高了查询优化处理的效率和精确度。Using the machine learning-based data query optimization processing device described in the embodiment of the present invention can obtain more accurate cost estimates in different execution environments based on dynamic execution environment parameter information, so as to quickly and accurately select better execution plan, which improves the efficiency and accuracy of query optimization processing.
与上述提供的基于机器学习的数据查询优化处理方法相对应,本发明还提供一种电子设备。由于该电子设备的实施例相似于上述方法实施例,所以描述得比较简单,相关之处请参见上述方法实施例部分的说明即可,下面描述的电子设备仅是示意性的。如图7所示,其为本发明实施例公开的一种电子设备的实体结构示意图。该电子设备可以包括:处理器(processor)701、存储器(memory)702和通信总线703,其中,处理器701,存储器702通过通信总线703完成相互间的通信,通过通信接口704与外部进行通信。处理器701可以调用存储器702中的逻辑指令,以执行基于机器学习的数据查询优化处理方法。该方法包括:获取输入的结构化查询语句信息;确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量;将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划;其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。Corresponding to the machine learning-based data query optimization processing method provided above, the present invention also provides an electronic device. Since the embodiment of the electronic device is similar to the above-mentioned method embodiment, the description is relatively simple. For related details, please refer to the description of the above-mentioned method embodiment. The electronic device described below is only illustrative. As shown in FIG. 7 , it is a schematic diagram of a physical structure of an electronic device disclosed in an embodiment of the present invention. The electronic device may include: a processor (processor) 701, a memory (memory) 702 and a
此外,上述的存储器702中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:存储芯片、U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the
另一方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在处理器可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的基于机器学习的数据查询优化处理方法。该方法包括:获取输入的结构化查询语句信息;确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量;将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划;其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。On the other hand, an embodiment of the present invention also provides a computer program product, the computer program product includes a computer program stored on a processor-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer During execution, the computer can execute the machine learning-based data query optimization processing method provided by the above method embodiments. The method includes: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting the execution environment parameter information corresponding to the execution plan into a corresponding representation vector; and converting the representation vector Input into the query optimizer model based on the long short-term memory network, obtain the cost information output by the query optimizer model, and determine the corresponding target execution plan based on the cost information; wherein, the query optimizer model is the historical The execution plan is used as sample data, and the actual cost corresponding to the historical execution plan is used as a target value for training.
又一方面,本发明实施例还提供一种处理器可读存储介质,所述处理器可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的基于机器学习的数据查询优化处理方法。该方法包括:获取输入的结构化查询语句信息;确定所述结构化查询语句信息对应的执行计划,并将所述执行计划对应的执行环境参数信息转化成相应的表示向量;将所述表示向量输入到基于长短期记忆网络的查询优化器模型中,得到所述查询优化器模型输出的代价信息,并基于所述代价信息确定相应的目标执行计划;其中,所述查询优化器模型是将历史执行计划作为样本数据、所述历史执行计划对应的实际代价作为目标值进行训练得到的。In yet another aspect, an embodiment of the present invention further provides a processor-readable storage medium, where a computer program is stored on the processor-readable storage medium, and the computer program is implemented when executed by a processor to perform the functions provided by the above-mentioned embodiments. Data query optimization processing method based on machine learning. The method includes: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting the execution environment parameter information corresponding to the execution plan into a corresponding representation vector; and converting the representation vector Input into the query optimizer model based on the long short-term memory network, obtain the cost information output by the query optimizer model, and determine the corresponding target execution plan based on the cost information; wherein, the query optimizer model is the historical The execution plan is used as sample data, and the actual cost corresponding to the historical execution plan is used as a target value for training.
所述处理器可读存储介质可以是处理器能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NANDFLASH)、固态硬盘(SSD))等。The processor-readable storage medium can be any available medium or data storage device that can be accessed by a processor, including but not limited to magnetic storage (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (such as ROM, EPROM, EEPROM, non-volatile memory (NANDFLASH), solid-state disk (SSD)), etc.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110785793.2A CN115617830A (en) | 2021-07-12 | 2021-07-12 | Data query optimization processing method and device based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110785793.2A CN115617830A (en) | 2021-07-12 | 2021-07-12 | Data query optimization processing method and device based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115617830A true CN115617830A (en) | 2023-01-17 |
Family
ID=84856112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110785793.2A Pending CN115617830A (en) | 2021-07-12 | 2021-07-12 | Data query optimization processing method and device based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115617830A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116244333A (en) * | 2023-02-16 | 2023-06-09 | 山东大学 | A database query performance prediction method and system based on cost factor calibration |
CN116339744A (en) * | 2023-03-23 | 2023-06-27 | 深圳计算科学研究院 | Custom function optimization method, device, computer equipment and medium |
CN116561166A (en) * | 2023-05-09 | 2023-08-08 | 每日互动股份有限公司 | Optimization processing method of SQL (structured query language) statement |
CN117033416A (en) * | 2023-08-21 | 2023-11-10 | 宁夏大学 | BERT-based database query connection sequence adjustment method |
CN117056361A (en) * | 2023-07-03 | 2023-11-14 | 杭州拓数派科技发展有限公司 | Data query method and device for distributed database |
CN117216091A (en) * | 2023-09-15 | 2023-12-12 | 中国银行股份有限公司 | Optimization method, device, equipment and storage medium for HiveSQL multi-connection query |
CN117312357A (en) * | 2023-09-01 | 2023-12-29 | 厦门市美亚柏科信息股份有限公司 | Method and system for realizing machine learning algorithm based on SQL |
CN117520380A (en) * | 2023-11-21 | 2024-02-06 | 深圳计算科学研究院 | Method, device, equipment and medium for executing database query optimization plan |
CN117609288A (en) * | 2024-01-17 | 2024-02-27 | 矩阵起源(深圳)信息科技有限公司 | Data query policy optimization method, device, terminal equipment and storage medium |
CN117874071A (en) * | 2023-12-28 | 2024-04-12 | 电子科技大学长三角研究院(衢州) | Database query optimizer cost model based on machine learning |
WO2024152634A1 (en) * | 2023-01-18 | 2024-07-25 | 华为技术有限公司 | Method and device for selecting execution plan |
CN118520008A (en) * | 2024-07-25 | 2024-08-20 | 浙江大学 | Spark SQL-oriented intelligent query optimization method and system |
WO2025015940A1 (en) * | 2023-07-19 | 2025-01-23 | 华为云计算技术有限公司 | Data query method and related device |
-
2021
- 2021-07-12 CN CN202110785793.2A patent/CN115617830A/en active Pending
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024152634A1 (en) * | 2023-01-18 | 2024-07-25 | 华为技术有限公司 | Method and device for selecting execution plan |
CN116244333A (en) * | 2023-02-16 | 2023-06-09 | 山东大学 | A database query performance prediction method and system based on cost factor calibration |
CN116339744B (en) * | 2023-03-23 | 2024-01-16 | 深圳计算科学研究院 | Custom function optimization method, device, computer equipment and medium |
CN116339744A (en) * | 2023-03-23 | 2023-06-27 | 深圳计算科学研究院 | Custom function optimization method, device, computer equipment and medium |
CN116561166A (en) * | 2023-05-09 | 2023-08-08 | 每日互动股份有限公司 | Optimization processing method of SQL (structured query language) statement |
CN117056361A (en) * | 2023-07-03 | 2023-11-14 | 杭州拓数派科技发展有限公司 | Data query method and device for distributed database |
WO2025015940A1 (en) * | 2023-07-19 | 2025-01-23 | 华为云计算技术有限公司 | Data query method and related device |
CN117033416A (en) * | 2023-08-21 | 2023-11-10 | 宁夏大学 | BERT-based database query connection sequence adjustment method |
CN117312357A (en) * | 2023-09-01 | 2023-12-29 | 厦门市美亚柏科信息股份有限公司 | Method and system for realizing machine learning algorithm based on SQL |
CN117216091A (en) * | 2023-09-15 | 2023-12-12 | 中国银行股份有限公司 | Optimization method, device, equipment and storage medium for HiveSQL multi-connection query |
CN117520380A (en) * | 2023-11-21 | 2024-02-06 | 深圳计算科学研究院 | Method, device, equipment and medium for executing database query optimization plan |
CN117874071A (en) * | 2023-12-28 | 2024-04-12 | 电子科技大学长三角研究院(衢州) | Database query optimizer cost model based on machine learning |
CN117609288A (en) * | 2024-01-17 | 2024-02-27 | 矩阵起源(深圳)信息科技有限公司 | Data query policy optimization method, device, terminal equipment and storage medium |
CN117609288B (en) * | 2024-01-17 | 2024-04-30 | 矩阵起源(深圳)信息科技有限公司 | Data query policy optimization method, device, terminal equipment and storage medium |
CN118520008A (en) * | 2024-07-25 | 2024-08-20 | 浙江大学 | Spark SQL-oriented intelligent query optimization method and system |
CN118520008B (en) * | 2024-07-25 | 2024-10-11 | 浙江大学 | An intelligent query optimization method and system for Spark SQL |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115617830A (en) | Data query optimization processing method and device based on machine learning | |
Zhao et al. | Queryformer: A tree transformer model for query plan representation | |
US11681702B2 (en) | Conversion of model views into relational models | |
US8108399B2 (en) | Filtering of multi attribute data via on-demand indexing | |
US7933894B2 (en) | Parameter-sensitive plans for structural scenarios | |
CN117609470B (en) | Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform | |
US20190034485A1 (en) | System and method for optimizing large database management systems with multiple optimizers | |
US7509311B2 (en) | Use of statistics on views in query optimization | |
CN111444220B (en) | Cross-platform SQL query optimization method combining rule driving and data driving | |
US20120117054A1 (en) | Query Analysis in a Database | |
US12339841B2 (en) | Database management system and method for graph view selection for a relational-graph database | |
Wang et al. | Aop: Automated and interactive llm pipeline orchestration for answering complex queries | |
Bruno et al. | Polynomial heuristics for query optimization | |
Sharma et al. | Indexer++ workload-aware online index tuning with transformers and reinforcement learning | |
EP2151772A1 (en) | Method and system for using queries for multidimensional databases with databases of other types | |
CN118760761B (en) | Agent-based intelligent patent retrieval method and system | |
Barioni et al. | Seamlessly integrating similarity queries in SQL | |
CN119719424A (en) | Data lineage analysis method, device, terminal equipment and computer program product | |
Nandi | Mimir: Bringing ctables into practice | |
Ma et al. | CHASE: A Native Relational Database for Hybrid Queries on Structured and Unstructured Data | |
US12135714B2 (en) | Query graph embedding | |
Tran et al. | Metalevel Information in Ontology-Based Applications. | |
CN116881230A (en) | Automatic relational database optimization method based on cloud platform | |
CN115576970B (en) | Database cost model parameter optimization method and query method based on decision tree | |
CN118916385B (en) | Materialized view selection method and materialized view selection system based on large model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |