CN119046346A

CN119046346A - Data multisource joint retrieval method and device

Info

Publication number: CN119046346A
Application number: CN202411514829.3A
Authority: CN
Inventors: 吴镝; 李存冰; 张尧臣; 陈焕新; 杨建�; 刘金革; 吕鹤
Original assignee: Inspur Software Technology Co Ltd
Current assignee: Inspur Software Technology Co Ltd
Priority date: 2024-10-29
Filing date: 2024-10-29
Publication date: 2024-11-29
Also published as: CN120277127A

Abstract

The present invention relates to the field of cloud information technology, and specifically provides a data multi-source joint retrieval method and device, which has the following steps: S1, containerizing and deploying the original Trino work node; S2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter; S3, data identification and feature analysis; S4, asynchronous real-time monitoring and analysis; S5, resource scheduling and optimization; S6, result caching, materialization acceleration. Compared with the prior art, the present invention can reduce resource waste and improve resource utilization.

Description

Data multisource joint retrieval method and device

Technical Field

The invention relates to the technical field of cloud information, and particularly provides a data multi-source joint retrieval method and device.

Background

In the prior art, the amount of data involved in daily work is huge and of a wide variety. These data are not only complex and variable, but are also stored in a decentralized manner among different business systems. The systems often adopt different technical architectures, data formats and storage modes, so that the data are difficult to directly communicate and share, and a small challenge is brought to the development of business.

In actual operation, cross-department, cross-source and cross-domain information retrieval and integration are a common and important task. In particular, in the process of making and resource allocation, it is often necessary to coordinate data of multiple departments to perform comprehensive analysis and comparison so as to make an effective decision. However, the implementation of this process faces a number of difficulties due to the diversity and complexity of the data sources.

In the cross-source and cross-domain data retrieval scheme, trino is used as a mature and powerful distributed query engine, and is widely applied in industry. The method can be connected with various data sources, including a relational database, a non-relational database, a large data platform and the like, provides a unified query interface and realizes the joint retrieval of cross-source and cross-domain data. Although Trino is an excellent distributed cross-source data retrieval tool, challenges remain in the face of complex and diverse data.

The data is characterized by diversity, complexity and huge data volume, and Trino is often low in retrieval efficiency due to resource limitation and stiffness of deployment modes when processing large-scale and cross-domain heterogeneous data query, and cannot meet the real-time requirement of work. Particularly, in a high concurrency scene, the waiting time caused by excessive inquiry tasks is prolonged, and the elastic expansion and contraction cannot be carried out in a traditional deployment mode, so that the problem of the execution efficiency of large-scale retrieval tasks is further aggravated. When multiple departments initiate the query request at the same time, the system resource is easy to reach the bottleneck, so that the query response is slow, and even the result cannot be returned in time.

Therefore, how to realize efficient and accurate cross-source and cross-domain information retrieval and integration in massive, complex and scattered data becomes a technical problem to be solved urgently in the industry.

Disclosure of Invention

The invention provides a data multi-source combined retrieval method with strong practicability aiming at the defects of the prior art.

The invention further aims to provide a data multi-source combined retrieval device which is reasonable in design, safe and applicable.

The technical scheme adopted for solving the technical problems is as follows:

A data multisource joint retrieval method comprises the following steps:

s1, carrying out containerized deployment on original Trino working nodes;

s2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter;

S3, data identification and feature analysis;

s4, asynchronous real-time monitoring and analysis;

s5, resource scheduling and optimization;

S6, caching results and accelerating materialization.

Further, in step S1, based on the dock and Kubernetes container technologies, container groups Pod with different resource restrictions are preconfigured to adapt to the query requirements under different loads;

according to the difference of the computing performance, a plurality of container groups Pod are preconfigured, and each Pod has different resource limitations and request requests.

Further, in step S3, the new data identifying and analyzing Module DATA INSIGHT Module automatically collects and analyzes key indexes of the accessed data source through the timing task, and on the basis, designs a calculation formula of the complexity of the database table, wherein the formula is as follows:

TableComplexityScore(TCS)=D ×(w₁×log(S) +w₂×F+w₃×I+w₄×R +w₅ ×C);

Wherein the parameters are defined as follows:

S is the size of the table, measured in number of rows or data volume;

F is the number of table fields, I is the number of indexes;

R is the structural complexity of the table, is the reciprocal of the normal form level 1NF, 2NF and 3NF, or is a quantization index of the structural complexity;

C is the number of external keys, w ₁,w₂,w₃,w₄,w₅ is a weight factor used for adjusting the influence degree of different factors on the complexity, and the weights are adjusted according to actual conditions;

D is a database type coefficient which reflects the influence of different database types on the complexity of the table, and different database types possibly have different optimization mechanisms and use scenes, so that the table complexity is affected differently;

D _RDBMS, coefficients of a relational database, D _Columnar, coefficients of a columnar database, D _In-Memory, coefficients of a memory database, and D _Document, coefficients of a document database;

the calculation steps are as follows:

(1) Determining the size S of the table and calculating the log (S) thereof to balance the influence of the data quantity on the complexity;

(2) Calculating the field number F, the index number I and the external key number C in the table, and evaluating the structural complexity R of the table;

(3) Determining a coefficient D according to the type of the database;

(4) Assigning a weight w ₁,w₂,w₃,w₄,w₅ to each factor;

(5) Substituting the values into a formula to calculate TCS;

According to the formula, the table complexity of each access data source is calculated, the access data sources are ordered from large to small, a table with higher head complexity is selected for marking according to actual conditions, and relevant information is recorded and used for query optimization and resource scheduling.

Further, in step S3, further includes:

(1) The method comprises the steps of regularly scanning and identifying table information of each database, setting a timing task, automatically counting key indexes by utilizing a script or a database management tool, and calculating the complexity of the table;

(2) Setting a threshold value, namely setting reasonable threshold values for each data source or table according to service requirements and system resource conditions, wherein the threshold values are used for judging the size and complexity of a query task;

(3) Marking and recording, marking the table which reaches or exceeds the threshold value, and storing relevant information in a special cache or database for inquiring and subsequent resource scheduling.

Further, in step S4, further comprising:

S4-1, asynchronous monitoring and real-time analysis;

s4-2, multi-dimensional pre-judging;

s4-3, calculating and pre-judging in real time;

S4-4, feedback and prompt;

S4-5, optimizing and adjusting.

Further, in step S4-1, a front-end asynchronous monitor InputMonitor is constructed, and an SQL statement input by the user client is monitored in real time in an asynchronous processing manner, and the SQL statement input by the user is subjected to preliminary analysis to extract key information;

In step S4-2, it includes:

(1) The complex large-table data feature analysis is carried out, whether a table queried by a user belongs to a table with higher complexity is checked according to a data table complexity calculation result, if the table with higher complexity is hit, query conditions and data distribution conditions are further analyzed, and data quantity and processing time related to query are evaluated;

(2) Analyzing a cross-domain query structure, analyzing SQL sentences of user queries, checking whether cross-domain queries of a plurality of data sources or tables are involved, evaluating complexity and data distribution conditions of joint operation in the queries, and influence of the cross-domain queries on system performance;

(3) Checking whether SQL sentences of user inquiry contain aggregation functions, statistics or operation operations, analyzing data quantity and calculation complexity of the operations, and whether index support is involved, and evaluating the complexity and execution time of the inquiry operations;

(4) The index matching efficiency analysis is to compare SQL sentences of user query with the index of the build table to determine whether the query content has index support, analyze the coverage condition and selectivity of the index and the matching degree of the query condition and the index, and evaluate the improving effect of the index on the query speed;

In this step, the following formula is designed for calculation and determination of the user query complexity;

Query Complexity Score(QCS)=TCS +w₆×E+w₇×Joins +w₈×(1-Acc);

Wherein:

QCS is a query complexity score that measures the overall complexity of a query;

TCS is a data table complexity score according to the description above;

E is the number of tables involved in the query;

joins is the number of JOIN operations involved in the query;

Acc is the use case of the aggregation function involved in the query, where 1-Acc is used to denote the case where no aggregation function is used, i.e., acc is 0, indicating that no aggregation function is used, with the highest complexity;

w ₆,w₇,w₈ is a weight factor used to adjust the degree of influence of different factors on the complexity of the query;

The calculation steps are as follows:

(1) Calculating a TCS for each table related to the query;

(2) Determining a number E of tables involved in the query;

(3) Calculating the number of connection operations Joins involved in the query;

(4) Evaluating the service condition Acc of the aggregation function in the query, if the aggregation function is used, the Acc is close to 1, and if the aggregation function is not used, the Acc is close to 0;

(5) Assigning a weight w ₆,w₇,w₈ to each factor;

(6) Substituting the values into a formula to calculate QCS;

according to the actual situation, a threshold value is set, and when the QCS value exceeds the threshold value, the QCS value is judged to be a large or complex query task, and resource scheduling is further optimized.

Further, in step S4-3, in the process of inputting the SQL statement by the user, the multi-dimensional pre-judgment is performed in real time, and the resource pre-scheduling policy is dynamically adjusted according to the pre-judgment result;

If the prejudging result is a large or complex query task, immediately triggering a resource pre-scheduling mechanism, reserving high-performance Pod in advance or performing Pod expansion, and ensuring the execution efficiency of the query task and the overall performance of the cluster;

In step S4-4, according to the pre-judging result, a corresponding prompt is given through a user interface, and if the pre-judging result is a simple query task, the query is directly executed without resource pre-scheduling or capacity expansion;

In step S4-5, the multi-dimensional pre-judging algorithm and the threshold value are continuously optimized and adjusted according to the execution condition and the resource use condition of the actual query task, and the data pre-counting result is regularly updated and maintained, so that the pre-judging algorithm can accurately reflect the actual condition of the data.

Further, in step S5, the method includes:

s5-1, reserving and adjusting resources based on pre-judgment;

By means of the prejudging result and the elastic telescopic mechanism of containerized deployment, intelligent triggering of resource scheduling, and for a large-scale query task to be executed, reserving a high-performance Pod group in advance for executing the large-scale task;

During the execution of the query task, the system continuously monitors the real-time load condition, supports the registration of the service of a third party to the Kubernetes API, accesses external service through the Kubernetes API, uses prometheus as a monitoring data source, uses prometheus-adapeter as an expansion service to register to the Kubernetes API, dynamically adjusts the quantity and resource allocation of Pod, and maintains the efficient execution of the query task;

S5-2, manually configuring the size of the timing task management Pod group;

The HPA and CronHPA are combined to cope with different scenes, the HPA is used for controlling according to service indexes, meanwhile, according to the rules of busy and idle service, the CronHPA is used for automatically scheduling resources in advance, expanding the resources before the service peak and releasing the resources when the service is idle.

Further, in step S6, the data is intelligently cached in the high-speed storage area, and a plurality of cache update policies are supported by using cache policy formulation and optimization, so that timeliness and accuracy of the cached data are ensured.

A data multi-source joint search device comprises at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor is configured to invoke the machine-readable program to perform a data multisource joint retrieval method.

Compared with the prior art, the data multisource joint retrieval method and device have the following outstanding beneficial effects:

The invention improves the data processing efficiency, and the system can dynamically adjust the resource allocation according to the real-time data load through the query engine and the resource elastic expansion mechanism which are arranged in a containerized manner, thereby obviously improving the data processing efficiency. Particularly, in policy analysis and public management tasks with large data volume and complex inquiry, the resource waste is reduced, and the resource utilization rate is improved.

And accelerating data retrieval response, namely combining real-time SQL query analysis with intelligent resource scheduling optimization, and realizing multidimensional pre-judgment and resource pre-scheduling of query tasks. The mechanism effectively improves the performance and response speed of data retrieval, so that the office can acquire key information more quickly, and thus, various public service demands can be responded quickly.

And optimizing the execution of the query tasks, namely, by monitoring and analyzing SQL sentences in real time, the system can identify complex query tasks needing a large amount of computation resources in advance, and timely adjust the scale of the container group so as to ensure that the query tasks can be executed in an environment with sufficient resources. The real-time optimization mechanism avoids query delay caused by insufficient resources, improves working efficiency, and particularly can provide rapid and accurate data support for decision making in time-sensitive decision making and decision support.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a data multisource joint retrieval method;

FIG. 2 is a diagram of a Trino architecture optimized in a data multisource joint search method;

FIG. 3 is a diagram illustrating real-time SQL parsing in a data multisource joint search method;

FIG. 4 is a schematic diagram of the scheduling and optimization of containerized resources in a data multisource joint retrieval method;

FIG. 5 is a diagram of a conventional Trino architecture in a data multisource joint search method.

Detailed Description

In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A preferred embodiment is given below:

As shown in FIG. 5, clients (clients) include clients and drivers implemented in JDBC drivers, ODBC, or other languages;

A service discovery node (Discover-node) is a service that combines script parsing nodes and working nodes together. After the working node is started, service registration is carried out on the service discovery node, and the script analysis node obtains the working node which can work normally from the service discovery node.

The Analysis and Scheduling node (Analysis and Scheduling node) is mainly used for receiving the query submitted by the Client, analyzing the query statement, performing lexical Analysis to generate a query execution plan, generating a query plan (stage) and a Task (Task) for Scheduling, merging the results, and returning the results to the Client (Client).

The working node (workbench) is mainly responsible for the read-write interaction with data and executing the query plan.

Heterogeneous Data protocol conversion adapter (Data-Connector) for Trino and Data source (e.g., hive, RDBMS) connections. The adapter is designed with full reference to JDBC, using standard API interfaces to interact with different data sources. Each data source has a particular adapter. In the configuration file of each data source, adapters of the data source are specified through a connector/name attribute, the number of the data sources and the number of the adapters are in one-to-one correspondence, and future expansion scenes are fully considered in redesign, so that unified access and management of heterogeneous data sources are realized.

And the data acceleration cache library (ACCELERATE CACHE) is used for caching a common cross-domain collaborative calculation result set, so that the quick access of hot spot data is realized.

The conventional Trino architecture can show that the overall architecture of the system is stiff, the number of the working nodes is fixed, flexible resource allocation cannot be carried out according to actual conditions when query tasks are executed, and the query performance can face a great challenge in the process of high concurrency or large-scale tasks.

As shown in fig. 2, in this embodiment, a real-time SQL query analysis, data recognition and analysis module is added, and meanwhile, the working nodes are subjected to containerization deployment, so as to optimize upgrade analysis and scheduling, thereby improving the overall performance of the cluster and enhancing the ability of prejudging and resource scheduling for large-scale SQL and concurrent tasks.

As shown in fig. 1, the specific implementation procedure is as follows:

s1, carrying out containerized deployment on original Trino working nodes;

The original Trino working nodes are subjected to containerized deployment through a cloud native technology, and portability, expandability and high availability of services are ensured on the basis of container technologies such as Docker, kubernetes and the like. By pre-configuring container groups (Pod) with different resource limits, flexible management and dynamic expansion of computing resources are realized so as to adapt to query requirements under different loads. Meanwhile, the containerized deployment mode also greatly improves the stability and reliability of the whole system.

Depending on the computing performance, multiple container groups (Pod) are preconfigured, each Pod having different resource limitations (e.g., CPU, memory) and requests (requests). For example, some less resource-rich Pod is created for performing a small task, while some high-performance, resource-rich Pod is created for performing a complex query.

and heterogeneous data (such as a relational database, a non-relational database, a big data platform and the like) is accessed through the heterogeneous data protocol conversion adapter, so that a unified data access interface is provided.

S3, data identification and feature analysis;

The newly added data identification and analysis Module DATA INSIGHT Module is designed to automatically collect and analyze key indexes of accessed data sources through timing tasks, such as database types, table structures, table data amounts, index conditions, data characteristics, external key conditions and the like. On the basis, a calculation formula of the complexity of the database table is designed, and the formula is as follows:

TableComplexityScore(TCS)=D ×(w₁×log(S)+w₂×F +w₃×I +w₄×R +w₅ ×C);

Wherein the parameters are defined as follows:

s is the size of the table, measured in number of rows or data volume (e.g., MB);

F is the number of table fields, I is the number of indexes;

r is the structural complexity of the table, and can be the reciprocal of a normal form level (1 NF, 2NF, 3 NF) or a quantization index of the structural complexity;

C is the number of external keys, w ₁,w₂,w₃,w₄,w₅ is a weight factor used for adjusting the influence degree of different factors on the complexity, and the weights can be adjusted according to actual conditions;

D is a database type coefficient that reflects the impact of different database types on the table complexity. Different database types may have different optimization mechanisms and usage scenarios, and thus have different impact on the complexity of the table.

D _RDBMS coefficients of relational database, e.g. 1.0, D _Columnar coefficients of columnar database, e.g. 0.8, D _In-Memory coefficients of in-memory database, e.g. 1.2, D _Document coefficients of document database, e.g. 1.1.

The calculation steps are as follows:

(1) The size S of the table is determined and its log (S) is calculated to balance the effect of the amount of data on complexity.

(2) The number of fields F, the number of indexes I, the number of foreign keys C in the table are calculated, and the structural complexity R of the table is evaluated.

(3) The coefficient D is determined according to the database type.

(4) A weight w ₁,w₂,w₃,w₄,w₅ is assigned to each factor.

(5) Substituting these values into the formula calculates TCS.

Assume that a data table is characterized as follows:

database type, relational database (D _RDBMS =1.0);

100 ten thousand lines of data, 20 fields, 5 indexes, 3NF (assumed to be quantization index 3), 5 foreign keys, and weight factor w ₁=0.5,w₂=0.2,w₃=0.1,w₄=0.1,w₅ =0.1;

The TCS is calculated as follows:

TCS=1.0×(0.5×log(1000000)+0.2×20+0.1×5+0.1×3+0.1×5);

TCS=1.0×(0.5×13.8155+4+0.5+0.3+0.5);

TCS=1.0×(6.9078+4+0.5+0.3+0.5) ;

TCS=1.0×12.7078;

TCS=12.7078;

And (3) periodically scanning and identifying the table information of each database, namely setting a timing task (such as a low peak period in the early morning every day), automatically counting key indexes such as each data source, the data volume of each table, the table size and the like by utilizing a script or a database management tool, and calculating the complexity of the table. This helps the system to understand the data distribution and scale, providing basis for subsequent query analysis and resource scheduling.

And setting a reasonable threshold (such as data quantity, line number and the like) for each data source or table according to service requirements and system resource conditions. These thresholds will be used to determine the size and complexity of the query task.

Marking and recording, namely marking a table which reaches or exceeds a threshold value, and storing related information in a special cache or database so as to quickly inquire and schedule subsequent resources.

S4, asynchronous real-time monitoring and analysis;

By containerized deployment of the traditional Trino, the framework of Trino has resource elasticity, dynamically expands resources elastically at the time of traffic peaks, and releases resources when the resource load is low. The dynamic use of resources according to the amount is realized. However, the HPA of Kubernetes supports CPU, and the monitoring data of the memory dimension is used as an elastic index, and lacks the capacity of performing predictive expansion on a large-scale query task. For example, when executing a certain TB-level cross-source data query, the system needs to occupy a large amount of resources, although the CPU and the memory occupancy rate are too high to trigger the elastic capacity expansion of the K8S, the cluster arrangement performance is enhanced, but the newly popped container does not participate in the calculation of the task, so that the efficiency of the complex query task still cannot be improved.

Based on this, the execution efficiency of such a large task is ensured by expanding the container in advance or reserving resources. However, the analysis of the SQL input frames of the users in the submitting stage also causes a certain time waste, so that the asynchronous real-time monitoring of the SQL input frames is adopted, and the complexity and the resource requirement of the query of the users are prejudged according to the data characteristics and the index information provided by the data identification and analysis module. The method and the system work cooperatively with analysis and scheduling to ensure that the query task can be executed in an environment with sufficient resources, thereby improving the query efficiency and performance.

The method comprises the following steps:

S4-1, asynchronous monitoring and real-time analysis;

A front-end asynchronous monitor InputMonitor is constructed to monitor SQL sentences input by a user client in real time, and an asynchronous processing mode is adopted to avoid blocking user operation. And (3) carrying out preliminary analysis on the SQL sentence input BY the user, and extracting key information such as table names, column names, query conditions, aggregation functions (such as GROUP BY), statistics or operation and the like.

S4-2, multi-dimensional pre-judging;

and (3) analyzing the data characteristics of the large table, namely checking whether a table queried by a user belongs to a table with higher complexity according to the complexity calculation result of the data table, and if the table with higher complexity is hit, further analyzing query conditions and data distribution conditions, and evaluating the data quantity and processing time possibly related to query.

The cross-domain query structure analysis analyzes SQL statements of the user query and checks whether the cross-domain query involves multiple data sources or tables. The complexity and data distribution of connection in queries (JOIN) operations, and the impact of cross-domain queries on system performance, are evaluated.

Query operation complexity analysis, namely checking whether an SQL statement of the user query contains an aggregation function (such as GROUP BY), statistics or operation (such as SUM, COUNT, AVG and the like). The data volume and computational complexity of these operations are analyzed, and whether index support is involved, the complexity and execution time of the query operation is assessed.

And (3) index matching efficiency analysis, namely comparing SQL sentences of the user query with the table building index to determine whether the query content has index support. And analyzing the coverage condition and the selectivity of the index and the matching degree of the query condition and the index, and evaluating the lifting effect of the index on the query speed.

In this step, the following formula is designed for calculation and determination of the user query complexity.

Query Complexity Score(QCS)=TCS +w₆×E+w₇×Joins +w₈×(1-Acc);

Wherein:

TCS is a data table complexity score according to the description above;

E is the number of tables involved in the query;

Joins is the number of JOIN (JOIN) operations involved in the query;

Acc is the use of the aggregation function (e.g., COUNT, AVG, SUM, MAX, MIN, etc.) involved in the query, where 1-Acc is used to denote the case where no aggregation function is used, i.e., acc is 0, indicating that no aggregation function is used, with the highest complexity.

W ₆,w₇,w₈ is a weight factor used to adjust the degree of influence of different factors on query complexity.

The calculation steps are as follows:

(1) The TCS is calculated for each table involved in the query.

(2) The number E of tables involved in the query is determined.

(3) The number of join operations Joins involved in the query is calculated.

(4) And evaluating the use condition Acc of the aggregation function in the query, wherein if the aggregation function is used, the Acc is close to 1, and if the aggregation function is not used, the Acc is close to 0.

(5) A weight w ₆,w₇,w₈ is assigned to each factor.

(6) These values are substituted into the formula to calculate QCS.

Assume that a query is characterized as follows:

Number of tables involved 3 tables

Number of JOIN operations involved in query 2 JOIN

No aggregation function is used in the query (acc=0)

Data sheet complexity score TCS 12.7078 (based on previous calculations)

Weight factor w ₆ = 0.3,w₇= 0.2,w₈ = 0.5

The QCS is calculated as follows:

QCS = 12.7078 + 0.3×3 + 0.2×2 + 0.5×(1-0)

QCS= 12.7078+0.9+0.4+0.5

QCS = 14.5078

S4-3, calculating and pre-judging in real time;

As shown in fig. 3, in the process of inputting the SQL statement by the user, the multi-dimensional pre-judgment is performed in real time, and the resource pre-scheduling policy is dynamically adjusted according to the pre-judgment result.

If the prejudging result is a large or complex query task, immediately triggering a resource pre-scheduling mechanism, and reserving high-performance Pod or performing Pod expansion in advance to ensure the execution efficiency of the query task and the overall performance of the cluster. As shown in FIG. 3, when a data query is made across Oracle, postgreSQL, mySQL data sources, real-time analysis is made during user input SQL and conclusions are drawn.

S4-4, feedback and prompt;

According to the prejudgment result, a corresponding prompt is given through a user interface, for example, the query may involve a large amount of data, the execution time is expected to be longer, please stand by, or the query is "high-performance resources are reserved for you, please stand by".

If the prejudgement result is a simple inquiry task, directly executing inquiry without carrying out resource pre-dispatching or capacity expansion.

S4-5, continuously optimizing and adjusting a multidimensional pre-judging algorithm and a threshold according to the execution condition and the resource use condition of the actual query task so as to improve the accuracy and the efficiency of pre-judging. And updating and maintaining the data pre-statistics result at regular intervals to ensure that the pre-judgment algorithm can accurately reflect the actual condition of the data.

S5, resource scheduling and optimization;

And the system is responsible for receiving SQL queries submitted by clients, performing lexical analysis to generate query execution plans, and dynamically adjusting resource allocation according to real-time performance indexes. The multi-dimensional pre-judgment is realized by comprehensively considering query complexity, data size, index matching efficiency and service-specific monitoring indexes, such as the number of backlog of a message queue and the waiting time of request processing. The flexible function of the container arrangement platform is utilized to automatically expand the scale of the container group during the peak of the business, ensure the efficient execution of the query task and recycle the resources during the valley of the business so as to save the cost. In addition, the method is also responsible for monitoring the query execution time, the resource use condition and the system health condition, adjusting the resource scheduling strategy according to the monitoring data, optimizing the query performance, and finally merging the calculation results of the calculation engine and returning the calculation results to the client so as to achieve the aims of high performance, high availability and cost optimization.

The method comprises the following steps:

(1) Reserving and adjusting resources based on pre-judgment;

by means of the prejudgment result and the elastic telescoping mechanism of containerized deployment, the system can intelligently trigger resource scheduling. For a large-scale query task to be executed, the system can reserve a high-performance Pod group and execute the large-scale task in advance so as to ensure that the task is smoothly carried out.

During the execution of the query task, the system continuously monitors real-time load conditions (such as a CPU, a memory, a task queue waiting condition, and a message queue backlog number), and since the HPA of the K8S supports the CPU, the monitoring data of the memory dimension is used as an elasticity index, but the support for the user-defined elasticity index (task queue waiting condition and message queue backlog) is not good. Thus, with the feature Kubernetes API Aggregator, registration of third party services to the Kubernetes API is supported. Access to external services directly through the Kubernetes API is achieved. The external service is used as a data source, so that a richer index detection range and flexible expansibility are realized, and the method is more suitable for complex and frequent data retrieval scenes.

As shown in fig. 4, prometheus may be used as a monitoring data source, prometheus-adapeter being registered as an extended service to the Kubernetes API in the implementation. By dynamically adjusting the number of Pods and the resource allocation, the method is beneficial to maintaining efficient execution of the query task and avoiding resource waste. The specific flow is as follows:

1) Node and Pod monitoring start point

In Kubernetes clusters, node (Node) and Pod (container group) are the starting points for resource monitoring. A plurality of Pod, each of which is an instance of an application, are run on these nodes.

2) Container resource monitoring tool collection data

Container resource monitoring tools (e.g., cAdvisor and kubelet) are responsible for collecting metrics on Node and Pod, including CPU, memory, and possibly custom metrics. These data are the basis for subsequent decisions.

3) Index server stores data

The collected Metrics are sent to a Metrics server (Metrics-server) that is responsible for storing and providing the Metrics so that other components can access and use.

4) Prometheus monitoring component collects more data

The Prometheus monitoring component further collects custom indexes of the measurement indexes of the Node and the Pod, such as task queue waiting conditions, message queue backlog numbers and the like, by means of a probe (such as Prometheus-agent) to expand the whole monitoring range, and stores the indexes in a Prometheus database.

5) Monitoring component adapter converting data format

The monitoring component adapter (e.g., prometheus-adapter) converts the metrics in Prometheus into a format that Kubernetes can understand so that Kubernetes can make auto-scaling and like decisions using these data.

6) Metric aggregator consolidates data

A Metrics aggregator (Metrics-aggregator) may be used to integrate Metrics from multiple monitoring components (e.g., promethaus and other monitoring tools) to provide a unified view for use by Kubernetes.

7) Horizontal Pod automatic retractor decision

A horizontal Pod auto-scaler (HPA) makes decisions based on metrics (e.g., CPU utilization, memory usage, custom metrics, etc.) obtained from a metrics aggregator or other source, triggering Pod scaling operations.

8) Deployment and replication set adjustment Pod number

According to the decision result of the HPA, the Kubernetes resources such as deployment (Deployment) and duplicate set (ReplicaSet) can correspondingly adjust the number of Pods so as to maintain the stability and performance of the system.

(2) Manually configuring the size of a timing task management Pod group;

To further save resources, the combined approach of HPA and CronHPA is used to cope with different scenarios. Under a daily working scene, HPA is used for controlling according to service indexes, meanwhile, according to the rule of busy service and idle service, cronHPA is used for automatically scheduling resources in advance, expanding the resources before the service peak and releasing the resources when the service is idle, so that the application cold start time is greatly shortened.

The mode of combining manual configuration and automatic adjustment ensures that the resource scheduling is more flexible and efficient, and is beneficial to meeting the service requirement and simultaneously saving the resource cost to the greatest extent.

S6, caching results and accelerating materialization;

Through the containerization transformation and the elastic expansion of Trino, the real-time query and analysis efficiency of a user is greatly improved, but when the frequent query of the data volume above the GB level is faced, the overall performance is still reduced due to the limitation of the I/O bottleneck of each data source. In order to overcome the difficulty, the data access mode is deeply analyzed in a dynamic projection mode, and the frequently accessed hot spot data and the large-scale data analysis query result are accurately identified. These data are intelligently cached to a high-speed storage area, such as a memory database or SSD. These high-speed memory areas have extremely high read-write speeds and capacities, and can ensure that data is provided quickly when needed. By caching the hotspot data and query results in these areas, the need for re-computation and retrieval from the original data source at each query is avoided, thereby greatly reducing the delay of queries and analysis. Meanwhile, the method supports cache policy formulation and optimization, and supports various cache update policies including periodic refreshing, incremental updating and the like for realizing data freshness preservation so as to ensure timeliness and accuracy of cache data. In addition, the aggregated cache table is supported to be optimized to improve query performance and data access efficiency.

Based on the method, the data multi-source joint search device in the embodiment comprises at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

The above-mentioned specific embodiments are merely specific examples of the present invention, and the scope of the present invention includes, but is not limited to, any suitable changes or substitutions made by those skilled in the art, which are consistent with the technical solutions described in the above-mentioned specific embodiments of the present invention, shall fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A data multi-source joint retrieval method, characterized by comprising the following steps:

S1. Deploy the original Trino working node in a containerized manner;

S3, data identification and feature analysis;

S4, asynchronous real-time monitoring and analysis;

S5. Resource scheduling and optimization;

S6, result cache, materialization acceleration.

2. A data multi-source joint retrieval method according to claim 1, characterized in that, in step S1, based on Docker and Kubernetes container technologies, container groups Pods with different resource restrictions are pre-configured to adapt to query requirements under different loads;

Multiple container group Pods are preconfigured based on different computing performance, and each Pod has different resource limits and requests.

3. According to claim 2, a data multi-source joint retrieval method is characterized in that, in step S3, a new data identification and analysis module Data Insight Module is added to automatically collect and analyze key indicators of the connected data source through a scheduled task, and on this basis, a calculation formula for the complexity of the database table is designed, and the formula is as follows:

TableComplexityScore (TCS) = D × (w ₁ × log (S) + w ₂ × F + w ₃ × I + w ₄ × R + w ₅ × C) ;

The parameters are defined as follows:

S is the size of the table, measured in terms of the number of rows or amount of data;

F is the number of table fields; I is the number of indexes;

R is the complexity of the table structure, which is the reciprocal of the normal form level 1NF, 2NF, 3NF, or a quantitative indicator of the structural complexity;

C is the number of foreign keys, w ₁ , w ₂ , w ₃ , w ₄ , w ₅ are weight factors used to adjust the influence of different factors on complexity. These weights are adjusted according to actual conditions;

D is the database type coefficient, which reflects the impact of different database types on table complexity. Different database types may have different optimization mechanisms and usage scenarios, and therefore have different impacts on table complexity.

D _RDBMS : coefficient of relational database, D _Columnar : coefficient of columnar database, D _In-Memory : coefficient of in-memory database, D _Document : coefficient of document database;

The calculation steps are as follows:

(1) Determine the size S of the table and calculate its logarithm log (S) to balance the impact of data volume on complexity;

(2) Calculate the number of fields F , the number of indexes I , the number of foreign keys C in the table, and evaluate the complexity of the table structure R ;

(3) Determine the coefficient D according to the database type;

(4) Assign weights w ₁ , w ₂ , w ₃ , w ₄ , w ₅ to each factor;

(5) Substitute these values into the formula to calculate TCS;

According to the above formula, the table complexity of each access data source is calculated and sorted from large to small. According to the actual situation, the table with higher header complexity is selected for marking and annotation, and the relevant information is recorded for query optimization and resource scheduling.

4. The method for multi-source joint retrieval of data according to claim 3, characterized in that in step S3, it further comprises:

(1) Regularly scan and identify information in each database table, set scheduled tasks, and use scripts or database management tools to automatically count key indicators and calculate the complexity of the table;

(2) Threshold setting: According to business requirements and system resource conditions, set reasonable thresholds for each data source or table. These thresholds will be used to determine the size and complexity of the query task;

(3) Marking and recording: Tables that reach or exceed the threshold are marked and the relevant information is stored in a dedicated cache or database for query and subsequent resource scheduling.

5. The method for multi-source joint retrieval of data according to claim 4, characterized in that in step S4, it further comprises:

S4-1, asynchronous monitoring and real-time analysis;

S4-2, multi-dimensional prediction;

S4-3, real-time calculation and prediction;

S4-4, Feedback and prompts;

S4-5. Optimization and adjustment.

6. A data multi-source joint retrieval method according to claim 5, characterized in that, in step S4-1, a front-end asynchronous listener InputMonitor is constructed, and an asynchronous processing method is adopted to monitor the SQL statements input by the user client in real time, and the SQL statements input by the user are preliminarily parsed to extract key information;

In step S4-2, it includes:

(1) Analysis of complex large table data characteristics: Based on the calculation results of the data table complexity, check whether the table queried by the user is a table with high complexity. If a table with high complexity is hit, further analyze the query conditions and data distribution, and evaluate the amount of data involved in the query and the processing time;

(2) Cross-domain query structure analysis: Analyze the SQL statements of user queries to check whether the cross-domain queries involve multiple data sources or tables, evaluate the complexity of the JOIN operations in the query and the data distribution, and the impact of cross-domain queries on system performance;

(3) Check whether the SQL statements of user queries contain aggregate functions, statistics, or calculation operations, analyze the data volume and calculation complexity of these operations, and whether index support is involved, and evaluate the complexity and execution time of the query operations;

(4) Index matching efficiency analysis: Compare the SQL statements of user queries with the table index to determine whether the query content is supported by the index, analyze the index coverage, selectivity, and the matching degree between the query conditions and the index, and evaluate the effect of the index on improving the query speed;

In this step, the following formula is designed to calculate and determine the complexity of user queries;

Query Complexity Score (QCS)=TCS + w ₆ × E + w ₇ × Joins + w ₈ ×(1- Acc );

in:

QCS is the query complexity score, which is used to measure the overall complexity of the query;

TCS is scored based on the complexity of the data table described above;

E is the number of tables involved in the query;

Joins is the number of JOIN operations involved in the query;

Acc is the usage of the aggregate function involved in the query. Here, 1- Acc is used to indicate that no aggregate function is used. That is, when Acc is 0, it means that no aggregate function is used, and the complexity is the highest.

w ₆ ,w ₇ ,w ₈ are weight factors used to adjust the influence of different factors on query complexity;

Calculation steps:

(1) Calculate the TCS of each table involved in the query;

(2) Determine the number of tables E involved in the query;

(3) Calculate the number of join operations involved in the query;

(4) Evaluate the usage of aggregate functions in the query Acc . If an aggregate function is used, Acc is close to 1. If not, Acc is close to 0.

(5) Assign weights w ₆ , w ₇ , w ₈ to each factor;

(6) Substitute these values into the formula to calculate QCS ;

According to the actual situation, a threshold is set. When the QCS value exceeds the threshold, it is determined to be a large or complex query task, and resource scheduling is further optimized.

7. A data multi-source joint retrieval method according to claim 6, characterized in that, in step S4-3, the multi-dimensional pre-judgment is performed in real time during the process of the user inputting the SQL statement, and the resource pre-scheduling strategy is dynamically adjusted according to the pre-judgment result;

If the prediction result is a large or complex query task, the resource pre-scheduling mechanism is immediately triggered to reserve high-performance Pods in advance or expand the Pod capacity to ensure the execution efficiency of the query task and the overall performance of the cluster;

In step S4-4, a corresponding prompt is given through the user interface according to the prediction result. If the prediction result is a simple query task, the query is directly executed without resource pre-scheduling or capacity expansion;

In step S4-5, according to the execution status of the actual query task and the resource usage, the multi-dimensional prediction algorithm and threshold are continuously optimized and adjusted, and the data pre-statistical results are regularly updated and maintained to ensure that the prediction algorithm can accurately reflect the actual situation of the data.

8. The method for multi-source joint retrieval of data according to claim 7, characterized in that, in step S5, it comprises:

S5-1. Resource reservation and adjustment based on prediction;

With the help of prediction results and the elastic scaling mechanism of containerized deployment, resource scheduling is triggered intelligently. For large query tasks to be executed, high-performance Pod groups are reserved in advance to execute large tasks.

During the execution of the query task, the system will continuously monitor the real-time load situation, and use the characteristics of Kubernetes API Aggregator to support the registration of third-party services to the Kubernetes API, access external services through the Kubernetes API, use prometheus as the monitoring data source, and register prometheus-adapeter as an extended service to the Kubernetes API to dynamically adjust the number of Pods and resource allocation to maintain efficient execution of query tasks;

S5-2. Manually configure the scheduled task management Pod group size;

HPA and CronHPA are combined to deal with different scenarios. HPA is used to control according to business indicators. CronHPA is used to automatically schedule resources in advance according to the rules of busy and idle business, expand resources before business peaks, and release resources when business is idle.

9. According to claim 8, a data multi-source joint retrieval method is characterized in that in step S6, the data is intelligently cached in a high-speed storage area, cache strategy formulation and optimization are used, and multiple cache update strategies are supported to ensure the timeliness and accuracy of the cached data. In addition, the aggregate cache table is optimized to improve query performance and data access efficiency.

10. A data multi-source joint retrieval device, characterized by comprising: at least one memory and at least one processor;

The at least one memory is used to store a machine-readable program;

The at least one processor is configured to call the machine-readable program to execute the method according to any one of claims 1 to 9.