CN119088800A

CN119088800A - Data indexing and query optimization system and method based on database

Info

Publication number: CN119088800A
Application number: CN202411068003.9A
Authority: CN
Inventors: 张晖; 郝运凯; 李明; 李正蛟
Original assignee: Shandong Inspur Database Technology Co Ltd
Current assignee: Shandong Inspur Database Technology Co Ltd
Priority date: 2024-08-06
Filing date: 2024-08-06
Publication date: 2024-12-06

Abstract

The invention discloses a database-based data index and query optimization system and a method, which belong to the technical field of databases, and the technical problem to be solved by the invention is how to improve the data index efficiency and the query performance of a database system; the data storage module is used for storing data records in a database, the data records are stored in a column mode, the data compression rate and the query performance are improved, the disk I/O operation is reduced, the index building module is used for building an index according to the data records in the data storage module, the index adopts a B+ tree design, the number of times of the I/O operation when data are transmitted between the disk and the memory is reduced, and the query optimization module is used for receiving a query request and optimizing a query path according to the index built by the index building module.

Description

Database-based data index and query optimization system and method

Technical Field

The invention relates to the technical field of databases, in particular to a database-based data index and query optimization system and method.

Background

A database is a collection of data that is stored on a computer's storage device for a long period of time, organized according to certain rules, and sharable by various users or applications. It is a centralized data store, typically consisting of one or more data tables, each containing a plurality of rows and columns, for storing particular types of data. The database is mainly used for effectively organizing, storing and managing a large amount of data and providing the functions of structured storage, inquiry, modification, sequencing, statistics and the like of the data.

The database system is a more ideal data processing system developed to adapt to the data processing requirement, and is also a software system for providing data for the actual operable storage, maintenance and application system. Database systems are typically composed of four parts, database (DB), hardware, software (including operating systems, database management systems DBMSs, and applications), and personnel (including system analysts, database designers, application programmers, end users, and database administrators DBA). Among these, the database management system (DBMS) is the core software of the database system.

In recent years, with the large increase of data volume in society, the conventional database system often encounters problems of low database index efficiency and long query response time when processing large-scale data, and the performance of the database system is limited.

Therefore, how to improve the data index efficiency and the query performance of the database system is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a database-based data indexing and query optimizing system and method, which are used for solving the problem of how to improve the data indexing efficiency and the query performance of a database system.

The technical task of the invention is realized in the following way, a database-based data indexing and query optimizing system comprises a query optimizing module, an index constructing module and a data storage module;

The data storage module is used for storing data records in the database, wherein the data records are stored in a column, so that the data compression rate and the query performance are improved, and the disk I/O operation is reduced;

The index construction module is used for constructing an index according to the data record in the data storage module, the index adopts a B+ tree design, and the number of I/O operations when data are transmitted between the disk and the memory is reduced;

The query optimization module is used for receiving the query request and optimizing the query path according to the index constructed by the index construction module.

Preferably, the data storage module distributes data to a plurality of physical disks through RAID, and when different data is read and written, different disk blocks are used, so that the overall performance of data reading and writing is improved;

during actual reading and writing, the data storage module adopts a buffer pool formed by an internal database, so that physical reading and writing operations on a disk are reduced.

Preferably, the index construction module adopts a dynamic index structure based on data access frequency and data distribution, and dynamically adjusts the size of index nodes according to the data access frequency, wherein the adjustment of the size of the index nodes is based on a preset performance threshold value;

the index construction module analyzes the distribution of the data on the storage medium and optimizes the index structure by adopting a distribution perception algorithm based on cluster analysis and data inclination detection.

Preferably, the query optimization module comprises a query plan generator and a cost evaluator;

Wherein the query plan generator is configured to generate an index-based query execution plan;

The cost estimator is used for estimating the query time of different paths by adopting a machine learning model (cost model).

More preferably, the index-based query execution plan includes a multi-stage query execution flow and an intermediate result caching strategy;

The multi-stage query execution flow comprises data retrieval, data filtering, data projection, sequencing, connection and aggregation operations, wherein the data retrieval is based on index to retrieve data from a storage engine, the data filtering adopts a WHERE clause to filter the data, so that the data quantity of subsequent operations is reduced, and the data projection type SELECTs a required column according to the SELECT clause;

the intermediate result caching strategy comprises a selection strategy, a caching failure strategy and a caching replacement strategy, wherein the selection strategy is used for selecting the temporal locality and the spatial locality of an intermediate result generated according to a query execution flow, the caching failure strategy is used for setting a data cache and appointing caching failure by adopting a time standard, and the caching replacement strategy is used for removing data, the use frequency of which is lower than a set frequency threshold value within a set time range, from a caching area by adopting an LRU algorithm.

More preferably, the reference indexes input by the machine learning model comprise CPU processing time, I/O operands, data transmission quantity and data transmission mode;

The weight is characterized in that corresponding weight is given to the influence degree of the query time according to each reference index, linear relation description is adopted, the formula is y=ax ₁+bx₂+cx₃+dx₄, x ₁,x₂,x₃,x₄ is the numerical value of CPU processing time, I/O operand, data transmission quantity and transmission mode, and a, b, c and d are the weights of each reference index.

More preferably, the query plan generator predicts the distribution of query results using statistical information and a machine learning algorithm, and is capable of dynamically adjusting the query plan based on the predicted results, the dynamically adjusting the content of the query plan including adjusting the parallelism of query execution and resource allocation, and the method specifically comprises the following steps:

firstly, acquiring load resource information of the residual CPU core number, the residual memory capacity and the residual broadband capacity of a database system;

then, collecting query statistical information containing data access modes, data amounts and index use conditions, and controlling parallelism and memory use by using configuration parameters of maximum parallelism (max_parallel_depth) and working memory (work_mem);

meanwhile, a partitioning technology is used for a data table with a size exceeding a set threshold, and each query operation is executed on different partitions in parallel.

A data index and query optimization method based on a database comprises the following steps:

S1, receiving a query request and analyzing the content of the query request, wherein the content of the query request comprises query conditions, ordering requirements and data aggregation operation;

S2, constructing or updating an index by using an index construction module, and adapting to the change of the data record in the data storage module; the index adjustment strategy of the index construction module is based on data access frequency, data distribution and system load setting, and specifically comprises the steps of identifying data needing to be frequently accessed according to past query logs, optimizing related indexes, and adjusting the related indexes by combining the change condition of the data distribution after updating the data;

S3, generating a query execution plan by utilizing a query optimization module, wherein the content of the query execution plan comprises the steps of evaluating and selecting an optimal query path by using a cost evaluator and predicting query result distribution by using a machine learning model;

s4, executing the query and returning a result, and collecting statistical information of query execution for subsequent optimization, wherein the statistical information comprises query execution time, resource consumption and user satisfaction.

Preferably, constructing or updating the index includes incremental updating and full reconstruction;

The incremental updating adopts a partial updating mode, namely when the data change amount does not exceed a set threshold value, the incremental updating only updates the existing index so as to reflect the insertion, deletion and updating of the data record;

when the scale of the data change exceeds a set threshold, reconstructing the index to optimize performance, namely reconstructing the index for all the data, and establishing a corresponding mapping relation.

More preferably, the query optimization module works as follows:

s301, an initial query plan and a solution space saved for a query operation in the past are used as external input to be sent to a query optimization module;

S302, the query optimization module performs continuous optimization based on the established machine learning model and generates a corresponding query plan;

s303, the query plan is directly used for query operation of the database, time information of the query operation is used for feedback machine learning model calculation, and the steps are repeated until a group of optimal results are found.

The database-based data index and query optimization system and method of the invention have the following advantages:

The invention improves the efficiency of data index and the performance of query through the data storage module, the index construction module and the query optimization module;

the data record in the database is stored through the data storage module, the data record is stored in a column type, the data compression rate and the query performance storage are improved, and the disk I/O operation is reduced;

And thirdly, constructing an index according to the data record, wherein the index adopts a B+ tree design, and the number of I/O operations when data is transmitted between a disk and a memory is reduced.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a topology diagram of a database system household;

FIG. 2 is a block diagram of a database-based data indexing and query optimization system;

FIG. 3 is a flow chart diagram of the operation of the query optimization module;

FIG. 4 is a flow chart diagram of a database-based data indexing and query optimization method.

Detailed Description

The database-based data indexing and query optimization system and method of the present invention are described in detail below with reference to the drawings and detailed description.

Example 1:

as shown in fig. 1 and 2, the present embodiment provides a database-based data indexing and query optimization system, which includes a query optimization module, an index construction module, and a data storage module;

The data storage module in the embodiment distributes data to a plurality of physical disks through RAID, and when different data is read and written, different disk blocks are used, so that the overall performance of data reading and writing is improved;

The index construction module in the embodiment adopts a dynamic index structure based on data access frequency and data distribution, and dynamically adjusts the size of index nodes according to the data access frequency, wherein the adjustment of the size of the index nodes is based on a preset performance threshold value;

The query optimization module in this embodiment includes a query plan generator and a cost evaluator;

In this embodiment, the index-based query execution plan includes a multi-stage query execution flow and an intermediate result caching policy;

The reference index input by the machine learning model in the embodiment comprises CPU processing time, I/O operand, data transmission quantity and data transmission mode, wherein different weights are given to each factor at first;

The query plan generator in this embodiment predicts the distribution of query results by using statistical information and a machine learning algorithm, and can dynamically adjust the query plan according to the predicted results, wherein the dynamic adjustment of the content of the query plan includes adjustment of parallelism of query execution and resource allocation, and specifically comprises the following steps:

Example 2:

As shown in fig. 4, this embodiment provides a data index and query optimization method based on a database, which specifically includes the following steps:

In this embodiment, constructing or updating the index includes incremental updating and full reconstruction;

As shown in fig. 3, the working process of the query optimization module in this embodiment is specifically as follows:

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A data indexing and query optimization system based on a database, characterized in that the system comprises a query optimization module, an index building module and a data storage module;

Among them, the data storage module is used to store data records in the database; data records are stored in columnar format to improve data compression rate and query performance storage, and reduce disk I/O operations;

The index building module is used to build indexes based on the data records in the data storage module. The index adopts B+ tree design to reduce the number of I/O operations when data is transferred between disk and memory;

The query optimization module is used to receive query requests and optimize the query path according to the index built by the index building module.

2. The database-based data indexing and query optimization system according to claim 1 is characterized in that the data storage module distributes data to multiple physical disks through RAID, and uses different disk blocks when reading and writing different data, thereby improving the overall performance of data reading and writing;

During actual reading and writing, the data storage module uses a buffer pool composed of an internal database to reduce physical read and write operations on the disk.

3. The database-based data indexing and query optimization system according to claim 1, characterized in that the index building module adopts a dynamic index structure based on data access frequency and data distribution, and dynamically adjusts the size of the index node according to the data access frequency; wherein the adjustment of the index node size is based on a preset performance threshold;

The index building module uses a distribution-aware algorithm based on cluster analysis and data skew detection to analyze the distribution of data on the storage medium and optimize the index structure.

4. The database-based data indexing and query optimization system according to claim 1, characterized in that the query optimization module includes a query plan generator and a cost evaluator;

Among them, the query plan generator is used to generate an index-based query execution plan;

The cost evaluator is used to evaluate the query time of different paths using a machine learning model.

5. The database-based data indexing and query optimization system according to claim 4, characterized in that the index-based query execution plan includes a multi-stage query execution process and an intermediate result caching strategy;

The multi-stage query execution process includes data retrieval, data filtering, data projection, sorting, joining and aggregation operations; data retrieval is to retrieve data from the storage engine based on the index; data filtering uses the WHERE clause to filter data to reduce the amount of data for subsequent operations; data projection selects the required columns according to the SELECT clause;

The intermediate result cache strategy includes selection strategy, cache invalidation strategy, and cache replacement strategy; the selection strategy is based on the temporal locality and spatial locality of the intermediate results generated by the query execution process; the cache invalidation strategy is to specify cache invalidation by using time standards while setting data cache; the cache replacement strategy uses the LRU algorithm to remove data from the cache area with a usage frequency lower than the set number threshold within the set time range.

6. The database-based data indexing and query optimization system according to claim 4, characterized in that the reference indicators input into the machine learning model include CPU processing time, number of I/O operations, data transmission volume, and data transmission mode; different weights are initially assigned to each factor;

Among them, the weight is assigned according to the degree of influence of each reference indicator on the query time, and is described by a linear relationship. The formula is y= _ax1 + _bx2 + _cx3 + _dx4 ; among them, _x1 , _x2 , _x3 , _x4 are the values of CPU processing time, I/O operation number, data transmission volume, and transmission mode respectively, and a, b, c, d are the weights of each reference indicator respectively.

7. According to claim 4, the data indexing and query optimization system based on database is characterized in that the query plan generator uses statistical information and machine learning algorithms to predict the distribution of query results, and can dynamically adjust the query plan according to the prediction results, and the content of dynamically adjusting the query plan includes adjusting the parallelism and resource allocation of query execution; specifically as follows:

First, obtain the load resource information of the remaining number of CPU cores, remaining memory capacity, and remaining broadband capacity of the database system;

Then, query statistics including data access patterns, data volume, and index usage are collected, and configuration parameters for maximum parallelism and working memory are used to control parallelism and memory usage.

At the same time, partitioning technology is used for data tables whose size exceeds the set threshold, and each query operation is executed in parallel on different partitions.

8. A method for data indexing and query optimization based on a database, characterized in that the method is specifically as follows:

S1. Receive a query request and parse the content of the query request; wherein the content of the query request includes query conditions, sorting requirements, and data aggregation operations;

S2. Use the index building module to build or update the index to adapt to the changes in the data records in the data storage module; at the same time, the index building module dynamically adjusts the index structure according to the changes in the data access mode, and the content of the dynamic adjustment of the index structure includes splitting, merging or reorganizing the index; the index adjustment strategy of the index building module is based on the data access frequency, data distribution and system load setting, specifically: according to the past query logs, identify the data that needs to be accessed frequently, optimize the relevant indexes, and adjust the relevant indexes in combination with the changes in the data distribution after the data is updated;

S3. Generate a query execution plan using a query optimization module. The query execution plan includes using a cost evaluator to evaluate and select an optimal query path and using a machine learning model to predict the query result distribution.

S4. Execute the query and return the result, and collect the statistical information of the query execution for subsequent optimization; the statistical information includes the query execution time, resource consumption and user satisfaction.

9. The method for data indexing and query optimization based on a database according to claim 8, wherein building or updating an index includes incremental updating and full reconstruction;

Among them, incremental update adopts the partial update method, that is, when the data change amount does not exceed the set threshold, only the existing index is updated to reflect the insertion, deletion and update of data records;

When the scale of data changes exceeds the set threshold, the index is rebuilt to optimize performance, that is, the index is rebuilt for all data and the corresponding mapping relationship is established.

10. The method for data indexing and query optimization based on a database according to claim 8 or 9, wherein the working process of the query optimization module is as follows:

S301, the initial query plan and the solution space saved for the query operation in the past are sent to the query optimization module as external input;

S302, the query optimization module continuously optimizes based on the established machine learning model and generates a corresponding query plan;

S303. The query plan is directly used for the query operation of the database, and the time information of the query operation is used to feed back the machine learning model calculation, and this process is repeated until a set of optimal results is found.