[go: up one dir, main page]

CN119046346A - Data multisource joint retrieval method and device - Google Patents

Data multisource joint retrieval method and device Download PDF

Info

Publication number
CN119046346A
CN119046346A CN202411514829.3A CN202411514829A CN119046346A CN 119046346 A CN119046346 A CN 119046346A CN 202411514829 A CN202411514829 A CN 202411514829A CN 119046346 A CN119046346 A CN 119046346A
Authority
CN
China
Prior art keywords
data
query
complexity
resource
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202411514829.3A
Other languages
Chinese (zh)
Inventor
吴镝
李存冰
张尧臣
陈焕新
杨建�
刘金革
吕鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Technology Co Ltd
Original Assignee
Inspur Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Technology Co Ltd filed Critical Inspur Software Technology Co Ltd
Priority to CN202411514829.3A priority Critical patent/CN119046346A/en
Publication of CN119046346A publication Critical patent/CN119046346A/en
Priority to CN202510345138.3A priority patent/CN120277127A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及云信息技术领域,具体提供了一种数据多源联合检索方法及装置,具有如下步骤:S1、将原本Trino的工作节点进行容器化部署;S2、通过异构数据协议转换适配器接入异构数据;S3、数据识别与特征分析;S4、异步实时监听及分析;S5、资源调度及优化;S6、结果缓存,物化加速。与现有技术相比,本发明能够减少资源浪费,提高了资源利用率。

The present invention relates to the field of cloud information technology, and specifically provides a data multi-source joint retrieval method and device, which has the following steps: S1, containerizing and deploying the original Trino work node; S2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter; S3, data identification and feature analysis; S4, asynchronous real-time monitoring and analysis; S5, resource scheduling and optimization; S6, result caching, materialization acceleration. Compared with the prior art, the present invention can reduce resource waste and improve resource utilization.

Description

Data multisource joint retrieval method and device
Technical Field
The invention relates to the technical field of cloud information, and particularly provides a data multi-source joint retrieval method and device.
Background
In the prior art, the amount of data involved in daily work is huge and of a wide variety. These data are not only complex and variable, but are also stored in a decentralized manner among different business systems. The systems often adopt different technical architectures, data formats and storage modes, so that the data are difficult to directly communicate and share, and a small challenge is brought to the development of business.
In actual operation, cross-department, cross-source and cross-domain information retrieval and integration are a common and important task. In particular, in the process of making and resource allocation, it is often necessary to coordinate data of multiple departments to perform comprehensive analysis and comparison so as to make an effective decision. However, the implementation of this process faces a number of difficulties due to the diversity and complexity of the data sources.
In the cross-source and cross-domain data retrieval scheme, trino is used as a mature and powerful distributed query engine, and is widely applied in industry. The method can be connected with various data sources, including a relational database, a non-relational database, a large data platform and the like, provides a unified query interface and realizes the joint retrieval of cross-source and cross-domain data. Although Trino is an excellent distributed cross-source data retrieval tool, challenges remain in the face of complex and diverse data.
The data is characterized by diversity, complexity and huge data volume, and Trino is often low in retrieval efficiency due to resource limitation and stiffness of deployment modes when processing large-scale and cross-domain heterogeneous data query, and cannot meet the real-time requirement of work. Particularly, in a high concurrency scene, the waiting time caused by excessive inquiry tasks is prolonged, and the elastic expansion and contraction cannot be carried out in a traditional deployment mode, so that the problem of the execution efficiency of large-scale retrieval tasks is further aggravated. When multiple departments initiate the query request at the same time, the system resource is easy to reach the bottleneck, so that the query response is slow, and even the result cannot be returned in time.
Therefore, how to realize efficient and accurate cross-source and cross-domain information retrieval and integration in massive, complex and scattered data becomes a technical problem to be solved urgently in the industry.
Disclosure of Invention
The invention provides a data multi-source combined retrieval method with strong practicability aiming at the defects of the prior art.
The invention further aims to provide a data multi-source combined retrieval device which is reasonable in design, safe and applicable.
The technical scheme adopted for solving the technical problems is as follows:
A data multisource joint retrieval method comprises the following steps:
s1, carrying out containerized deployment on original Trino working nodes;
s2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter;
S3, data identification and feature analysis;
s4, asynchronous real-time monitoring and analysis;
s5, resource scheduling and optimization;
S6, caching results and accelerating materialization.
Further, in step S1, based on the dock and Kubernetes container technologies, container groups Pod with different resource restrictions are preconfigured to adapt to the query requirements under different loads;
according to the difference of the computing performance, a plurality of container groups Pod are preconfigured, and each Pod has different resource limitations and request requests.
Further, in step S3, the new data identifying and analyzing Module DATA INSIGHT Module automatically collects and analyzes key indexes of the accessed data source through the timing task, and on the basis, designs a calculation formula of the complexity of the database table, wherein the formula is as follows:
TableComplexityScore(TCS)=D ×(w1×log(S) +w2×F+w3×I+w4×R +w5 ×C);
Wherein the parameters are defined as follows:
S is the size of the table, measured in number of rows or data volume;
F is the number of table fields, I is the number of indexes;
R is the structural complexity of the table, is the reciprocal of the normal form level 1NF, 2NF and 3NF, or is a quantization index of the structural complexity;
C is the number of external keys, w 1,w2,w3,w4,w5 is a weight factor used for adjusting the influence degree of different factors on the complexity, and the weights are adjusted according to actual conditions;
D is a database type coefficient which reflects the influence of different database types on the complexity of the table, and different database types possibly have different optimization mechanisms and use scenes, so that the table complexity is affected differently;
D RDBMS, coefficients of a relational database, D Columnar, coefficients of a columnar database, D In-Memory, coefficients of a memory database, and D Document, coefficients of a document database;
the calculation steps are as follows:
(1) Determining the size S of the table and calculating the log (S) thereof to balance the influence of the data quantity on the complexity;
(2) Calculating the field number F, the index number I and the external key number C in the table, and evaluating the structural complexity R of the table;
(3) Determining a coefficient D according to the type of the database;
(4) Assigning a weight w 1,w2,w3,w4,w5 to each factor;
(5) Substituting the values into a formula to calculate TCS;
According to the formula, the table complexity of each access data source is calculated, the access data sources are ordered from large to small, a table with higher head complexity is selected for marking according to actual conditions, and relevant information is recorded and used for query optimization and resource scheduling.
Further, in step S3, further includes:
(1) The method comprises the steps of regularly scanning and identifying table information of each database, setting a timing task, automatically counting key indexes by utilizing a script or a database management tool, and calculating the complexity of the table;
(2) Setting a threshold value, namely setting reasonable threshold values for each data source or table according to service requirements and system resource conditions, wherein the threshold values are used for judging the size and complexity of a query task;
(3) Marking and recording, marking the table which reaches or exceeds the threshold value, and storing relevant information in a special cache or database for inquiring and subsequent resource scheduling.
Further, in step S4, further comprising:
S4-1, asynchronous monitoring and real-time analysis;
s4-2, multi-dimensional pre-judging;
s4-3, calculating and pre-judging in real time;
S4-4, feedback and prompt;
S4-5, optimizing and adjusting.
Further, in step S4-1, a front-end asynchronous monitor InputMonitor is constructed, and an SQL statement input by the user client is monitored in real time in an asynchronous processing manner, and the SQL statement input by the user is subjected to preliminary analysis to extract key information;
In step S4-2, it includes:
(1) The complex large-table data feature analysis is carried out, whether a table queried by a user belongs to a table with higher complexity is checked according to a data table complexity calculation result, if the table with higher complexity is hit, query conditions and data distribution conditions are further analyzed, and data quantity and processing time related to query are evaluated;
(2) Analyzing a cross-domain query structure, analyzing SQL sentences of user queries, checking whether cross-domain queries of a plurality of data sources or tables are involved, evaluating complexity and data distribution conditions of joint operation in the queries, and influence of the cross-domain queries on system performance;
(3) Checking whether SQL sentences of user inquiry contain aggregation functions, statistics or operation operations, analyzing data quantity and calculation complexity of the operations, and whether index support is involved, and evaluating the complexity and execution time of the inquiry operations;
(4) The index matching efficiency analysis is to compare SQL sentences of user query with the index of the build table to determine whether the query content has index support, analyze the coverage condition and selectivity of the index and the matching degree of the query condition and the index, and evaluate the improving effect of the index on the query speed;
In this step, the following formula is designed for calculation and determination of the user query complexity;
Query Complexity Score(QCS)=TCS +w6×E+w7×Joins +w8×(1-Acc);
Wherein:
QCS is a query complexity score that measures the overall complexity of a query;
TCS is a data table complexity score according to the description above;
E is the number of tables involved in the query;
joins is the number of JOIN operations involved in the query;
Acc is the use case of the aggregation function involved in the query, where 1-Acc is used to denote the case where no aggregation function is used, i.e., acc is 0, indicating that no aggregation function is used, with the highest complexity;
w 6,w7,w8 is a weight factor used to adjust the degree of influence of different factors on the complexity of the query;
The calculation steps are as follows:
(1) Calculating a TCS for each table related to the query;
(2) Determining a number E of tables involved in the query;
(3) Calculating the number of connection operations Joins involved in the query;
(4) Evaluating the service condition Acc of the aggregation function in the query, if the aggregation function is used, the Acc is close to 1, and if the aggregation function is not used, the Acc is close to 0;
(5) Assigning a weight w 6,w7,w8 to each factor;
(6) Substituting the values into a formula to calculate QCS;
according to the actual situation, a threshold value is set, and when the QCS value exceeds the threshold value, the QCS value is judged to be a large or complex query task, and resource scheduling is further optimized.
Further, in step S4-3, in the process of inputting the SQL statement by the user, the multi-dimensional pre-judgment is performed in real time, and the resource pre-scheduling policy is dynamically adjusted according to the pre-judgment result;
If the prejudging result is a large or complex query task, immediately triggering a resource pre-scheduling mechanism, reserving high-performance Pod in advance or performing Pod expansion, and ensuring the execution efficiency of the query task and the overall performance of the cluster;
In step S4-4, according to the pre-judging result, a corresponding prompt is given through a user interface, and if the pre-judging result is a simple query task, the query is directly executed without resource pre-scheduling or capacity expansion;
In step S4-5, the multi-dimensional pre-judging algorithm and the threshold value are continuously optimized and adjusted according to the execution condition and the resource use condition of the actual query task, and the data pre-counting result is regularly updated and maintained, so that the pre-judging algorithm can accurately reflect the actual condition of the data.
Further, in step S5, the method includes:
s5-1, reserving and adjusting resources based on pre-judgment;
By means of the prejudging result and the elastic telescopic mechanism of containerized deployment, intelligent triggering of resource scheduling, and for a large-scale query task to be executed, reserving a high-performance Pod group in advance for executing the large-scale task;
During the execution of the query task, the system continuously monitors the real-time load condition, supports the registration of the service of a third party to the Kubernetes API, accesses external service through the Kubernetes API, uses prometheus as a monitoring data source, uses prometheus-adapeter as an expansion service to register to the Kubernetes API, dynamically adjusts the quantity and resource allocation of Pod, and maintains the efficient execution of the query task;
S5-2, manually configuring the size of the timing task management Pod group;
The HPA and CronHPA are combined to cope with different scenes, the HPA is used for controlling according to service indexes, meanwhile, according to the rules of busy and idle service, the CronHPA is used for automatically scheduling resources in advance, expanding the resources before the service peak and releasing the resources when the service is idle.
Further, in step S6, the data is intelligently cached in the high-speed storage area, and a plurality of cache update policies are supported by using cache policy formulation and optimization, so that timeliness and accuracy of the cached data are ensured.
A data multi-source joint search device comprises at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to perform a data multisource joint retrieval method.
Compared with the prior art, the data multisource joint retrieval method and device have the following outstanding beneficial effects:
The invention improves the data processing efficiency, and the system can dynamically adjust the resource allocation according to the real-time data load through the query engine and the resource elastic expansion mechanism which are arranged in a containerized manner, thereby obviously improving the data processing efficiency. Particularly, in policy analysis and public management tasks with large data volume and complex inquiry, the resource waste is reduced, and the resource utilization rate is improved.
And accelerating data retrieval response, namely combining real-time SQL query analysis with intelligent resource scheduling optimization, and realizing multidimensional pre-judgment and resource pre-scheduling of query tasks. The mechanism effectively improves the performance and response speed of data retrieval, so that the office can acquire key information more quickly, and thus, various public service demands can be responded quickly.
And optimizing the execution of the query tasks, namely, by monitoring and analyzing SQL sentences in real time, the system can identify complex query tasks needing a large amount of computation resources in advance, and timely adjust the scale of the container group so as to ensure that the query tasks can be executed in an environment with sufficient resources. The real-time optimization mechanism avoids query delay caused by insufficient resources, improves working efficiency, and particularly can provide rapid and accurate data support for decision making in time-sensitive decision making and decision support.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a data multisource joint retrieval method;
FIG. 2 is a diagram of a Trino architecture optimized in a data multisource joint search method;
FIG. 3 is a diagram illustrating real-time SQL parsing in a data multisource joint search method;
FIG. 4 is a schematic diagram of the scheduling and optimization of containerized resources in a data multisource joint retrieval method;
FIG. 5 is a diagram of a conventional Trino architecture in a data multisource joint search method.
Detailed Description
In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A preferred embodiment is given below:
As shown in FIG. 5, clients (clients) include clients and drivers implemented in JDBC drivers, ODBC, or other languages;
A service discovery node (Discover-node) is a service that combines script parsing nodes and working nodes together. After the working node is started, service registration is carried out on the service discovery node, and the script analysis node obtains the working node which can work normally from the service discovery node.
The Analysis and Scheduling node (Analysis and Scheduling node) is mainly used for receiving the query submitted by the Client, analyzing the query statement, performing lexical Analysis to generate a query execution plan, generating a query plan (stage) and a Task (Task) for Scheduling, merging the results, and returning the results to the Client (Client).
The working node (workbench) is mainly responsible for the read-write interaction with data and executing the query plan.
Heterogeneous Data protocol conversion adapter (Data-Connector) for Trino and Data source (e.g., hive, RDBMS) connections. The adapter is designed with full reference to JDBC, using standard API interfaces to interact with different data sources. Each data source has a particular adapter. In the configuration file of each data source, adapters of the data source are specified through a connector/name attribute, the number of the data sources and the number of the adapters are in one-to-one correspondence, and future expansion scenes are fully considered in redesign, so that unified access and management of heterogeneous data sources are realized.
And the data acceleration cache library (ACCELERATE CACHE) is used for caching a common cross-domain collaborative calculation result set, so that the quick access of hot spot data is realized.
The conventional Trino architecture can show that the overall architecture of the system is stiff, the number of the working nodes is fixed, flexible resource allocation cannot be carried out according to actual conditions when query tasks are executed, and the query performance can face a great challenge in the process of high concurrency or large-scale tasks.
As shown in fig. 2, in this embodiment, a real-time SQL query analysis, data recognition and analysis module is added, and meanwhile, the working nodes are subjected to containerization deployment, so as to optimize upgrade analysis and scheduling, thereby improving the overall performance of the cluster and enhancing the ability of prejudging and resource scheduling for large-scale SQL and concurrent tasks.
As shown in fig. 1, the specific implementation procedure is as follows:
s1, carrying out containerized deployment on original Trino working nodes;
The original Trino working nodes are subjected to containerized deployment through a cloud native technology, and portability, expandability and high availability of services are ensured on the basis of container technologies such as Docker, kubernetes and the like. By pre-configuring container groups (Pod) with different resource limits, flexible management and dynamic expansion of computing resources are realized so as to adapt to query requirements under different loads. Meanwhile, the containerized deployment mode also greatly improves the stability and reliability of the whole system.
Depending on the computing performance, multiple container groups (Pod) are preconfigured, each Pod having different resource limitations (e.g., CPU, memory) and requests (requests). For example, some less resource-rich Pod is created for performing a small task, while some high-performance, resource-rich Pod is created for performing a complex query.
S2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter;
and heterogeneous data (such as a relational database, a non-relational database, a big data platform and the like) is accessed through the heterogeneous data protocol conversion adapter, so that a unified data access interface is provided.
S3, data identification and feature analysis;
The newly added data identification and analysis Module DATA INSIGHT Module is designed to automatically collect and analyze key indexes of accessed data sources through timing tasks, such as database types, table structures, table data amounts, index conditions, data characteristics, external key conditions and the like. On the basis, a calculation formula of the complexity of the database table is designed, and the formula is as follows:
TableComplexityScore(TCS)=D ×(w1×log(S)+w2×F +w3×I +w4×R +w5 ×C);
Wherein the parameters are defined as follows:
s is the size of the table, measured in number of rows or data volume (e.g., MB);
F is the number of table fields, I is the number of indexes;
r is the structural complexity of the table, and can be the reciprocal of a normal form level (1 NF, 2NF, 3 NF) or a quantization index of the structural complexity;
C is the number of external keys, w 1,w2,w3,w4,w5 is a weight factor used for adjusting the influence degree of different factors on the complexity, and the weights can be adjusted according to actual conditions;
D is a database type coefficient that reflects the impact of different database types on the table complexity. Different database types may have different optimization mechanisms and usage scenarios, and thus have different impact on the complexity of the table.
D RDBMS coefficients of relational database, e.g. 1.0, D Columnar coefficients of columnar database, e.g. 0.8, D In-Memory coefficients of in-memory database, e.g. 1.2, D Document coefficients of document database, e.g. 1.1.
The calculation steps are as follows:
(1) The size S of the table is determined and its log (S) is calculated to balance the effect of the amount of data on complexity.
(2) The number of fields F, the number of indexes I, the number of foreign keys C in the table are calculated, and the structural complexity R of the table is evaluated.
(3) The coefficient D is determined according to the database type.
(4) A weight w 1,w2,w3,w4,w5 is assigned to each factor.
(5) Substituting these values into the formula calculates TCS.
Assume that a data table is characterized as follows:
database type, relational database (D RDBMS =1.0);
100 ten thousand lines of data, 20 fields, 5 indexes, 3NF (assumed to be quantization index 3), 5 foreign keys, and weight factor w 1=0.5,w2=0.2,w3=0.1,w4=0.1,w5 =0.1;
The TCS is calculated as follows:
TCS=1.0×(0.5×log(1000000)+0.2×20+0.1×5+0.1×3+0.1×5);
TCS=1.0×(0.5×13.8155+4+0.5+0.3+0.5);
TCS=1.0×(6.9078+4+0.5+0.3+0.5) ;
TCS=1.0×12.7078;
TCS=12.7078;
According to the formula, the table complexity of each access data source is calculated, the access data sources are ordered from large to small, a table with higher head complexity is selected for marking according to actual conditions, and relevant information is recorded and used for query optimization and resource scheduling.
And (3) periodically scanning and identifying the table information of each database, namely setting a timing task (such as a low peak period in the early morning every day), automatically counting key indexes such as each data source, the data volume of each table, the table size and the like by utilizing a script or a database management tool, and calculating the complexity of the table. This helps the system to understand the data distribution and scale, providing basis for subsequent query analysis and resource scheduling.
And setting a reasonable threshold (such as data quantity, line number and the like) for each data source or table according to service requirements and system resource conditions. These thresholds will be used to determine the size and complexity of the query task.
Marking and recording, namely marking a table which reaches or exceeds a threshold value, and storing related information in a special cache or database so as to quickly inquire and schedule subsequent resources.
S4, asynchronous real-time monitoring and analysis;
By containerized deployment of the traditional Trino, the framework of Trino has resource elasticity, dynamically expands resources elastically at the time of traffic peaks, and releases resources when the resource load is low. The dynamic use of resources according to the amount is realized. However, the HPA of Kubernetes supports CPU, and the monitoring data of the memory dimension is used as an elastic index, and lacks the capacity of performing predictive expansion on a large-scale query task. For example, when executing a certain TB-level cross-source data query, the system needs to occupy a large amount of resources, although the CPU and the memory occupancy rate are too high to trigger the elastic capacity expansion of the K8S, the cluster arrangement performance is enhanced, but the newly popped container does not participate in the calculation of the task, so that the efficiency of the complex query task still cannot be improved.
Based on this, the execution efficiency of such a large task is ensured by expanding the container in advance or reserving resources. However, the analysis of the SQL input frames of the users in the submitting stage also causes a certain time waste, so that the asynchronous real-time monitoring of the SQL input frames is adopted, and the complexity and the resource requirement of the query of the users are prejudged according to the data characteristics and the index information provided by the data identification and analysis module. The method and the system work cooperatively with analysis and scheduling to ensure that the query task can be executed in an environment with sufficient resources, thereby improving the query efficiency and performance.
The method comprises the following steps:
S4-1, asynchronous monitoring and real-time analysis;
A front-end asynchronous monitor InputMonitor is constructed to monitor SQL sentences input by a user client in real time, and an asynchronous processing mode is adopted to avoid blocking user operation. And (3) carrying out preliminary analysis on the SQL sentence input BY the user, and extracting key information such as table names, column names, query conditions, aggregation functions (such as GROUP BY), statistics or operation and the like.
S4-2, multi-dimensional pre-judging;
and (3) analyzing the data characteristics of the large table, namely checking whether a table queried by a user belongs to a table with higher complexity according to the complexity calculation result of the data table, and if the table with higher complexity is hit, further analyzing query conditions and data distribution conditions, and evaluating the data quantity and processing time possibly related to query.
The cross-domain query structure analysis analyzes SQL statements of the user query and checks whether the cross-domain query involves multiple data sources or tables. The complexity and data distribution of connection in queries (JOIN) operations, and the impact of cross-domain queries on system performance, are evaluated.
Query operation complexity analysis, namely checking whether an SQL statement of the user query contains an aggregation function (such as GROUP BY), statistics or operation (such as SUM, COUNT, AVG and the like). The data volume and computational complexity of these operations are analyzed, and whether index support is involved, the complexity and execution time of the query operation is assessed.
And (3) index matching efficiency analysis, namely comparing SQL sentences of the user query with the table building index to determine whether the query content has index support. And analyzing the coverage condition and the selectivity of the index and the matching degree of the query condition and the index, and evaluating the lifting effect of the index on the query speed.
In this step, the following formula is designed for calculation and determination of the user query complexity.
Query Complexity Score(QCS)=TCS +w6×E+w7×Joins +w8×(1-Acc);
Wherein:
QCS is a query complexity score that measures the overall complexity of a query;
TCS is a data table complexity score according to the description above;
E is the number of tables involved in the query;
Joins is the number of JOIN (JOIN) operations involved in the query;
Acc is the use of the aggregation function (e.g., COUNT, AVG, SUM, MAX, MIN, etc.) involved in the query, where 1-Acc is used to denote the case where no aggregation function is used, i.e., acc is 0, indicating that no aggregation function is used, with the highest complexity.
W 6,w7,w8 is a weight factor used to adjust the degree of influence of different factors on query complexity.
The calculation steps are as follows:
(1) The TCS is calculated for each table involved in the query.
(2) The number E of tables involved in the query is determined.
(3) The number of join operations Joins involved in the query is calculated.
(4) And evaluating the use condition Acc of the aggregation function in the query, wherein if the aggregation function is used, the Acc is close to 1, and if the aggregation function is not used, the Acc is close to 0.
(5) A weight w 6,w7,w8 is assigned to each factor.
(6) These values are substituted into the formula to calculate QCS.
Assume that a query is characterized as follows:
Number of tables involved 3 tables
Number of JOIN operations involved in query 2 JOIN
No aggregation function is used in the query (acc=0)
Data sheet complexity score TCS 12.7078 (based on previous calculations)
Weight factor w 6 = 0.3,w7= 0.2,w8 = 0.5
The QCS is calculated as follows:
QCS = 12.7078 + 0.3×3 + 0.2×2 + 0.5×(1-0)
QCS= 12.7078+0.9+0.4+0.5
QCS = 14.5078
according to the actual situation, a threshold value is set, and when the QCS value exceeds the threshold value, the QCS value is judged to be a large or complex query task, and resource scheduling is further optimized.
S4-3, calculating and pre-judging in real time;
As shown in fig. 3, in the process of inputting the SQL statement by the user, the multi-dimensional pre-judgment is performed in real time, and the resource pre-scheduling policy is dynamically adjusted according to the pre-judgment result.
If the prejudging result is a large or complex query task, immediately triggering a resource pre-scheduling mechanism, and reserving high-performance Pod or performing Pod expansion in advance to ensure the execution efficiency of the query task and the overall performance of the cluster. As shown in FIG. 3, when a data query is made across Oracle, postgreSQL, mySQL data sources, real-time analysis is made during user input SQL and conclusions are drawn.
S4-4, feedback and prompt;
According to the prejudgment result, a corresponding prompt is given through a user interface, for example, the query may involve a large amount of data, the execution time is expected to be longer, please stand by, or the query is "high-performance resources are reserved for you, please stand by".
If the prejudgement result is a simple inquiry task, directly executing inquiry without carrying out resource pre-dispatching or capacity expansion.
S4-5, continuously optimizing and adjusting a multidimensional pre-judging algorithm and a threshold according to the execution condition and the resource use condition of the actual query task so as to improve the accuracy and the efficiency of pre-judging. And updating and maintaining the data pre-statistics result at regular intervals to ensure that the pre-judgment algorithm can accurately reflect the actual condition of the data.
S5, resource scheduling and optimization;
And the system is responsible for receiving SQL queries submitted by clients, performing lexical analysis to generate query execution plans, and dynamically adjusting resource allocation according to real-time performance indexes. The multi-dimensional pre-judgment is realized by comprehensively considering query complexity, data size, index matching efficiency and service-specific monitoring indexes, such as the number of backlog of a message queue and the waiting time of request processing. The flexible function of the container arrangement platform is utilized to automatically expand the scale of the container group during the peak of the business, ensure the efficient execution of the query task and recycle the resources during the valley of the business so as to save the cost. In addition, the method is also responsible for monitoring the query execution time, the resource use condition and the system health condition, adjusting the resource scheduling strategy according to the monitoring data, optimizing the query performance, and finally merging the calculation results of the calculation engine and returning the calculation results to the client so as to achieve the aims of high performance, high availability and cost optimization.
The method comprises the following steps:
(1) Reserving and adjusting resources based on pre-judgment;
by means of the prejudgment result and the elastic telescoping mechanism of containerized deployment, the system can intelligently trigger resource scheduling. For a large-scale query task to be executed, the system can reserve a high-performance Pod group and execute the large-scale task in advance so as to ensure that the task is smoothly carried out.
During the execution of the query task, the system continuously monitors real-time load conditions (such as a CPU, a memory, a task queue waiting condition, and a message queue backlog number), and since the HPA of the K8S supports the CPU, the monitoring data of the memory dimension is used as an elasticity index, but the support for the user-defined elasticity index (task queue waiting condition and message queue backlog) is not good. Thus, with the feature Kubernetes API Aggregator, registration of third party services to the Kubernetes API is supported. Access to external services directly through the Kubernetes API is achieved. The external service is used as a data source, so that a richer index detection range and flexible expansibility are realized, and the method is more suitable for complex and frequent data retrieval scenes.
As shown in fig. 4, prometheus may be used as a monitoring data source, prometheus-adapeter being registered as an extended service to the Kubernetes API in the implementation. By dynamically adjusting the number of Pods and the resource allocation, the method is beneficial to maintaining efficient execution of the query task and avoiding resource waste. The specific flow is as follows:
1) Node and Pod monitoring start point
In Kubernetes clusters, node (Node) and Pod (container group) are the starting points for resource monitoring. A plurality of Pod, each of which is an instance of an application, are run on these nodes.
2) Container resource monitoring tool collection data
Container resource monitoring tools (e.g., cAdvisor and kubelet) are responsible for collecting metrics on Node and Pod, including CPU, memory, and possibly custom metrics. These data are the basis for subsequent decisions.
3) Index server stores data
The collected Metrics are sent to a Metrics server (Metrics-server) that is responsible for storing and providing the Metrics so that other components can access and use.
4) Prometheus monitoring component collects more data
The Prometheus monitoring component further collects custom indexes of the measurement indexes of the Node and the Pod, such as task queue waiting conditions, message queue backlog numbers and the like, by means of a probe (such as Prometheus-agent) to expand the whole monitoring range, and stores the indexes in a Prometheus database.
5) Monitoring component adapter converting data format
The monitoring component adapter (e.g., prometheus-adapter) converts the metrics in Prometheus into a format that Kubernetes can understand so that Kubernetes can make auto-scaling and like decisions using these data.
6) Metric aggregator consolidates data
A Metrics aggregator (Metrics-aggregator) may be used to integrate Metrics from multiple monitoring components (e.g., promethaus and other monitoring tools) to provide a unified view for use by Kubernetes.
7) Horizontal Pod automatic retractor decision
A horizontal Pod auto-scaler (HPA) makes decisions based on metrics (e.g., CPU utilization, memory usage, custom metrics, etc.) obtained from a metrics aggregator or other source, triggering Pod scaling operations.
8) Deployment and replication set adjustment Pod number
According to the decision result of the HPA, the Kubernetes resources such as deployment (Deployment) and duplicate set (ReplicaSet) can correspondingly adjust the number of Pods so as to maintain the stability and performance of the system.
(2) Manually configuring the size of a timing task management Pod group;
To further save resources, the combined approach of HPA and CronHPA is used to cope with different scenarios. Under a daily working scene, HPA is used for controlling according to service indexes, meanwhile, according to the rule of busy service and idle service, cronHPA is used for automatically scheduling resources in advance, expanding the resources before the service peak and releasing the resources when the service is idle, so that the application cold start time is greatly shortened.
The mode of combining manual configuration and automatic adjustment ensures that the resource scheduling is more flexible and efficient, and is beneficial to meeting the service requirement and simultaneously saving the resource cost to the greatest extent.
S6, caching results and accelerating materialization;
Through the containerization transformation and the elastic expansion of Trino, the real-time query and analysis efficiency of a user is greatly improved, but when the frequent query of the data volume above the GB level is faced, the overall performance is still reduced due to the limitation of the I/O bottleneck of each data source. In order to overcome the difficulty, the data access mode is deeply analyzed in a dynamic projection mode, and the frequently accessed hot spot data and the large-scale data analysis query result are accurately identified. These data are intelligently cached to a high-speed storage area, such as a memory database or SSD. These high-speed memory areas have extremely high read-write speeds and capacities, and can ensure that data is provided quickly when needed. By caching the hotspot data and query results in these areas, the need for re-computation and retrieval from the original data source at each query is avoided, thereby greatly reducing the delay of queries and analysis. Meanwhile, the method supports cache policy formulation and optimization, and supports various cache update policies including periodic refreshing, incremental updating and the like for realizing data freshness preservation so as to ensure timeliness and accuracy of cache data. In addition, the aggregated cache table is supported to be optimized to improve query performance and data access efficiency.
Based on the method, the data multi-source joint search device in the embodiment comprises at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to perform a data multisource joint retrieval method.
The above-mentioned specific embodiments are merely specific examples of the present invention, and the scope of the present invention includes, but is not limited to, any suitable changes or substitutions made by those skilled in the art, which are consistent with the technical solutions described in the above-mentioned specific embodiments of the present invention, shall fall within the scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1.一种数据多源联合检索方法,其特征在于,具有如下步骤:1. A data multi-source joint retrieval method, characterized by comprising the following steps: S1、将原本Trino的工作节点进行容器化部署;S1. Deploy the original Trino working node in a containerized manner; S2、通过异构数据协议转换适配器接入异构数据;S2, accessing heterogeneous data through a heterogeneous data protocol conversion adapter; S3、数据识别与特征分析;S3, data identification and feature analysis; S4、异步实时监听及分析;S4, asynchronous real-time monitoring and analysis; S5、资源调度及优化;S5. Resource scheduling and optimization; S6、结果缓存,物化加速。S6, result cache, materialization acceleration. 2.根据权利要求1所述的一种数据多源联合检索方法,其特征在于,在步骤S1中,以Docker和Kubernetes容器技术为基础,预配置不同资源限制的容器组Pod适应不同负载下的查询需求;2. A data multi-source joint retrieval method according to claim 1, characterized in that, in step S1, based on Docker and Kubernetes container technologies, container groups Pods with different resource restrictions are pre-configured to adapt to query requirements under different loads; 根据计算性能的不同,预配置多个容器组Pod,每个Pod具有不同的资源限制和请求requests。Multiple container group Pods are preconfigured based on different computing performance, and each Pod has different resource limits and requests. 3.根据权利要求2所述的一种数据多源联合检索方法,其特征在于,在步骤S3中,新增数据识别与分析模块Data Insight Module,通过定时任务自动收集和分析已接入数据源的关键指标,在此基础上,设计一个数据库表复杂度的计算公式,公式如下:3. According to claim 2, a data multi-source joint retrieval method is characterized in that, in step S3, a new data identification and analysis module Data Insight Module is added to automatically collect and analyze key indicators of the connected data source through a scheduled task, and on this basis, a calculation formula for the complexity of the database table is designed, and the formula is as follows: TableComplexityScore(TCS)=D ×(w 1 ×log(S) + w 2 ×F + w 3 ×I + w 4 ×R + w 5 ×C)TableComplexityScore (TCS) = D × (w 1 × log (S) + w 2 × F + w 3 × I + w 4 × R + w 5 × C) ; 其中各参数定义如下:The parameters are defined as follows: S 是表的大小,以行数或数据量来衡量; S is the size of the table, measured in terms of the number of rows or amount of data; F 是表字段数;I 是索引数量; F is the number of table fields; I is the number of indexes; R 是表结构复杂度,为范式级别1NF、 2NF、 3NF的倒数,或者是结构复杂度的量化指标; R is the complexity of the table structure, which is the reciprocal of the normal form level 1NF, 2NF, 3NF, or a quantitative indicator of the structural complexity; C 是外键数量,w 1 ,w 2 ,w 3 ,w 4 ,w 5 是权重因子,用于调整不同因素对复杂度的影响程度,这些权重根据实际情况进行调整; C is the number of foreign keys, w 1 , w 2 , w 3 , w 4 , w 5 are weight factors used to adjust the influence of different factors on complexity. These weights are adjusted according to actual conditions; D 是数据库类型系数,它反映了不同数据库类型对表复杂度的影响,不同的数据库类型可能有不同的优化机制和使用场景,因此对表的复杂度有不同的影响; D is the database type coefficient, which reflects the impact of different database types on table complexity. Different database types may have different optimization mechanisms and usage scenarios, and therefore have different impacts on table complexity. D RDBMS 关系型数据库的系数,D Columnar 列式数据库的系数,D In-Memory 内存数据库的系数,D Document 文档数据库的系数; D RDBMS : coefficient of relational database, D Columnar : coefficient of columnar database, D In-Memory : coefficient of in-memory database, D Document : coefficient of document database; 计算步骤如下:The calculation steps are as follows: (1)确定表的大小S,并计算其对数log(S)以平衡数据量对复杂度的影响;(1) Determine the size S of the table and calculate its logarithm log (S) to balance the impact of data volume on complexity; (2)计算表中字段数F、索引数I、外键数C,以及评估表结构复杂度R(2) Calculate the number of fields F , the number of indexes I , the number of foreign keys C in the table, and evaluate the complexity of the table structure R ; (3)根据数据库类型确定系数D(3) Determine the coefficient D according to the database type; (4)为每个因素分配权重w 1 ,w 2 ,w 3 ,w 4 ,w 5 (4) Assign weights w 1 , w 2 , w 3 , w 4 , w 5 to each factor; (5)将这些值代入公式计算TCS;(5) Substitute these values into the formula to calculate TCS; 根据以上公式,计算出每个接入数据源的表复杂度,并按照从大到小进行排序,根据实际情况,选取头部复杂度较高的表进行标记进行标注,记录相关信息,用于查询优化和资源调度。According to the above formula, the table complexity of each access data source is calculated and sorted from large to small. According to the actual situation, the table with higher header complexity is selected for marking and annotation, and the relevant information is recorded for query optimization and resource scheduling. 4.根据权利要求3所述的一种数据多源联合检索方法,其特征在于,在步骤S3中,进一步包括:4. The method for multi-source joint retrieval of data according to claim 3, characterized in that in step S3, it further comprises: (1) 定时扫描和识别各数据库表信息,设置定时任务,利用脚本或数据库管理工具自动统计关键指标并计算表的复杂度;(1) Regularly scan and identify information in each database table, set scheduled tasks, and use scripts or database management tools to automatically count key indicators and calculate the complexity of the table; (2) 阈值设定,根据业务需求和系统资源情况,为每个数据源或表设定合理的阈值,这些阈值将用于判断查询任务的大小和复杂度;(2) Threshold setting: According to business requirements and system resource conditions, set reasonable thresholds for each data source or table. These thresholds will be used to determine the size and complexity of the query task; (3) 标注与记录,对达到或超过阈值的表进行标注,并将相关信息存储在专门的缓存或数据库中,以便查询和后续的资源调度。(3) Marking and recording: Tables that reach or exceed the threshold are marked and the relevant information is stored in a dedicated cache or database for query and subsequent resource scheduling. 5.根据权利要求4所述的一种数据多源联合检索方法,其特征在于,在步骤S4中,进一步包括:5. The method for multi-source joint retrieval of data according to claim 4, characterized in that in step S4, it further comprises: S4-1、异步监听和实时分析;S4-1, asynchronous monitoring and real-time analysis; S4-2、多维度预判;S4-2, multi-dimensional prediction; S4-3、实时计算与预判;S4-3, real-time calculation and prediction; S4-4、反馈与提示;S4-4, Feedback and prompts; S4-5、优化与调整。S4-5. Optimization and adjustment. 6.根据权利要求5所述的一种数据多源联合检索方法,其特征在于,在步骤S4-1中,构建一个前端异步监听器InputMonitor,采用异步处理的方式,实时监听用户客户端输入的SQL语句,对用户输入的SQL语句进行初步解析,提取关键信息;6. A data multi-source joint retrieval method according to claim 5, characterized in that, in step S4-1, a front-end asynchronous listener InputMonitor is constructed, and an asynchronous processing method is adopted to monitor the SQL statements input by the user client in real time, and the SQL statements input by the user are preliminarily parsed to extract key information; 在步骤S4-2中,包括:In step S4-2, it includes: (1)复杂大表数据特征分析,根据数据表复杂度计算结果,检查用户查询的表是否属于复杂度较高的表,若命中复杂度较高的表,则进一步分析查询条件和数据分布情况,评估查询涉及的数据量和处理时间;(1) Analysis of complex large table data characteristics: Based on the calculation results of the data table complexity, check whether the table queried by the user is a table with high complexity. If a table with high complexity is hit, further analyze the query conditions and data distribution, and evaluate the amount of data involved in the query and the processing time; (2)跨域查询结构分析,分析用户查询的SQL语句,检查是否涉及多个数据源或表的跨域查询,评估查询中连接JOIN操作的复杂性和数据分布情况,以及跨域查询对系统性能的影响;(2) Cross-domain query structure analysis: Analyze the SQL statements of user queries to check whether the cross-domain queries involve multiple data sources or tables, evaluate the complexity of the JOIN operations in the query and the data distribution, and the impact of cross-domain queries on system performance; (3)检查用户查询的SQL语句中是否包含聚合函数、统计或运算操作,分析这些操作的数据量和计算复杂度,以及是否涉及索引支持,评估查询操作的复杂度和执行时间;(3) Check whether the SQL statements of user queries contain aggregate functions, statistics, or calculation operations, analyze the data volume and calculation complexity of these operations, and whether index support is involved, and evaluate the complexity and execution time of the query operations; (4)索引匹配效率分析:将用户查询的SQL语句与建表索引进行比对,确定查询内容是否有索引支持,分析索引的覆盖情况、选择性以及查询条件与索引的匹配程度,评估索引对查询速度的提升效果;(4) Index matching efficiency analysis: Compare the SQL statements of user queries with the table index to determine whether the query content is supported by the index, analyze the index coverage, selectivity, and the matching degree between the query conditions and the index, and evaluate the effect of the index on improving the query speed; 在这一步中,设计了如下公式用户查询复杂度的计算与判定;In this step, the following formula is designed to calculate and determine the complexity of user queries; Query Complexity Score(QCS)=TCS + w 6 ×E + w 7 ×Joins + w 8 ×(1-Acc);Query Complexity Score (QCS)=TCS + w 6 × E + w 7 × Joins + w 8 ×(1- Acc ); 其中:in: QCS是查询复杂度评分,用来衡量查询的总体复杂度; QCS is the query complexity score, which is used to measure the overall complexity of the query; TCS是根据上文描述的数据表复杂度评分; TCS is scored based on the complexity of the data table described above; E是查询中涉及的表的数量; E is the number of tables involved in the query; Joins是查询中涉及的连接JOIN操作的数量; Joins is the number of JOIN operations involved in the query; Acc是查询中涉及的聚合函数的使用情况,这里使用1-Acc 来表示没有使用聚合函数的情况,即Acc为 0 时,表示没有使用聚合函数,复杂度最高; Acc is the usage of the aggregate function involved in the query. Here, 1- Acc is used to indicate that no aggregate function is used. That is, when Acc is 0, it means that no aggregate function is used, and the complexity is the highest. w 6 ,w 7 ,w 8 是权重因子,用来调整不同因素对查询复杂度的影响程度; w 6 ,w 7 ,w 8 are weight factors used to adjust the influence of different factors on query complexity; 计算步骤:Calculation steps: (1)计算涉及查询的每个表的TCS(1) Calculate the TCS of each table involved in the query; (2)确定查询中涉及的表的数量E(2) Determine the number of tables E involved in the query; (3)计算查询中涉及的连接操作数量Joins(3) Calculate the number of join operations involved in the query; (4)评估查询中聚合函数的使用情况Acc,如果使用了聚合函数,则Acc接近 1,如果没有使用,则Acc接近 0;(4) Evaluate the usage of aggregate functions in the query Acc . If an aggregate function is used, Acc is close to 1. If not, Acc is close to 0. (5)为每个因素分配权重w 6 ,w 7 ,w 8 (5) Assign weights w 6 , w 7 , w 8 to each factor; (6)将这些值代入公式计算QCS(6) Substitute these values into the formula to calculate QCS ; 根据实际的情况,设定一个阈值,当QCS值超过阈值后,判定为大型或复杂查询任务,进一步对资源调度进行优化。According to the actual situation, a threshold is set. When the QCS value exceeds the threshold, it is determined to be a large or complex query task, and resource scheduling is further optimized. 7.根据权利要求6所述的一种数据多源联合检索方法,其特征在于,在步骤S4-3中,在用户输入SQL语句的过程中,实时进行上述多维度预判,并根据预判结果动态调整资源预调度策略;7. A data multi-source joint retrieval method according to claim 6, characterized in that, in step S4-3, the multi-dimensional pre-judgment is performed in real time during the process of the user inputting the SQL statement, and the resource pre-scheduling strategy is dynamically adjusted according to the pre-judgment result; 若预判结果为大型或复杂查询任务,则立即触发资源预调度机制,提前预留高性能的Pod或进行Pod扩容,确保查询任务的执行效率和集群的整体性能;If the prediction result is a large or complex query task, the resource pre-scheduling mechanism is immediately triggered to reserve high-performance Pods in advance or expand the Pod capacity to ensure the execution efficiency of the query task and the overall performance of the cluster; 在步骤S4-4中,根据预判结果,通过用户界面给出相应提示,若预判结果为简单查询任务,则直接执行查询,无需进行资源预调度或扩容;In step S4-4, a corresponding prompt is given through the user interface according to the prediction result. If the prediction result is a simple query task, the query is directly executed without resource pre-scheduling or capacity expansion; 在步骤S4-5中,根据实际查询任务的执行情况和资源使用情况,不断优化和调整多维度预判的算法和阈值,定期对数据预统计的结果进行更新和维护,确保预判算法能够准确反映数据的实际情况。In step S4-5, according to the execution status of the actual query task and the resource usage, the multi-dimensional prediction algorithm and threshold are continuously optimized and adjusted, and the data pre-statistical results are regularly updated and maintained to ensure that the prediction algorithm can accurately reflect the actual situation of the data. 8.根据权利要求7所述的一种数据多源联合检索方法,其特征在于,在步骤S5中,包括:8. The method for multi-source joint retrieval of data according to claim 7, characterized in that, in step S5, it comprises: S5-1、基于预判的资源预留与调整;S5-1. Resource reservation and adjustment based on prediction; 借助预判结果和容器化部署的弹性伸缩机制,智能触发资源调度,对于即将执行的大型查询任务,提前将高性能的Pod群组预留与大型任务执行;With the help of prediction results and the elastic scaling mechanism of containerized deployment, resource scheduling is triggered intelligently. For large query tasks to be executed, high-performance Pod groups are reserved in advance to execute large tasks. 在查询任务执行期间,系统会持续监控实时负载情况,利用Kubernetes APIAggregator的特性,支持将第三方的服务注册到Kubernetes API,通过Kubernetes API访问到外部服务,使用的是prometheus作为监控数据源,prometheus-adapeter作为扩展服务注册到Kubernetes API,动态调整Pod的数量和资源分配,维持查询任务的高效执行;During the execution of the query task, the system will continuously monitor the real-time load situation, and use the characteristics of Kubernetes API Aggregator to support the registration of third-party services to the Kubernetes API, access external services through the Kubernetes API, use prometheus as the monitoring data source, and register prometheus-adapeter as an extended service to the Kubernetes API to dynamically adjust the number of Pods and resource allocation to maintain efficient execution of query tasks; S5-2、手动配置定时任务管理Pod组大小;S5-2. Manually configure the scheduled task management Pod group size; 使用HPA和CronHPA相结合的方式,来应对不同的场景,使用HPA根据业务指标进行控制,同时根据业务繁忙与空闲的规律,使用CronHPA提前自动调度好资源,在业务高峰之前扩资源,在业务空闲时释放资源。HPA and CronHPA are combined to deal with different scenarios. HPA is used to control according to business indicators. CronHPA is used to automatically schedule resources in advance according to the rules of busy and idle business, expand resources before business peaks, and release resources when business is idle. 9.根据权利要求8所述的一种数据多源联合检索方法,其特征在于,在步骤S6中,将数据智能地缓存到高速存储区域,使用缓存策略制定和优化,支持多种缓存更新策略,确保缓存数据的时效性和准确性,此外,对聚合缓存表进行优化,提高查询性能和数据访问效率。9. According to claim 8, a data multi-source joint retrieval method is characterized in that in step S6, the data is intelligently cached in a high-speed storage area, cache strategy formulation and optimization are used, and multiple cache update strategies are supported to ensure the timeliness and accuracy of the cached data. In addition, the aggregate cache table is optimized to improve query performance and data access efficiency. 10.一种数据多源联合检索装置,其特征在于,包括:至少一个存储器和至少一个处理器;10. A data multi-source joint retrieval device, characterized by comprising: at least one memory and at least one processor; 所述至少一个存储器,用于存储机器可读程序;The at least one memory is used to store a machine-readable program; 所述至少一个处理器,用于调用所述机器可读程序,执行权利要求1至9中任一所述的方法。The at least one processor is configured to call the machine-readable program to execute the method according to any one of claims 1 to 9.
CN202411514829.3A 2024-10-29 2024-10-29 Data multisource joint retrieval method and device Withdrawn CN119046346A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202411514829.3A CN119046346A (en) 2024-10-29 2024-10-29 Data multisource joint retrieval method and device
CN202510345138.3A CN120277127A (en) 2024-10-29 2025-03-24 Data multisource joint retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411514829.3A CN119046346A (en) 2024-10-29 2024-10-29 Data multisource joint retrieval method and device

Publications (1)

Publication Number Publication Date
CN119046346A true CN119046346A (en) 2024-11-29

Family

ID=93587755

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202411514829.3A Withdrawn CN119046346A (en) 2024-10-29 2024-10-29 Data multisource joint retrieval method and device
CN202510345138.3A Pending CN120277127A (en) 2024-10-29 2025-03-24 Data multisource joint retrieval method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202510345138.3A Pending CN120277127A (en) 2024-10-29 2025-03-24 Data multisource joint retrieval method and device

Country Status (1)

Country Link
CN (2) CN119046346A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930200A (en) * 2012-09-29 2013-02-13 北京奇虎科技有限公司 Progress identifying method and device as well as terminal equipment
US20160019249A1 (en) * 2014-07-18 2016-01-21 Wipro Limited System and method for optimizing storage of multi-dimensional data in data storage
CN113886457A (en) * 2021-09-14 2022-01-04 浪潮软件科技有限公司 A method for joint retrieval of cross-domain heterogeneous data
CN114791967A (en) * 2022-05-25 2022-07-26 武汉科技大学 Time series RDF data storage and query method based on bit matrix model
CN117992228A (en) * 2024-02-04 2024-05-07 正天技术有限公司 Elastic management method and device based on cloud native architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930200A (en) * 2012-09-29 2013-02-13 北京奇虎科技有限公司 Progress identifying method and device as well as terminal equipment
US20160019249A1 (en) * 2014-07-18 2016-01-21 Wipro Limited System and method for optimizing storage of multi-dimensional data in data storage
CN113886457A (en) * 2021-09-14 2022-01-04 浪潮软件科技有限公司 A method for joint retrieval of cross-domain heterogeneous data
CN114791967A (en) * 2022-05-25 2022-07-26 武汉科技大学 Time series RDF data storage and query method based on bit matrix model
CN117992228A (en) * 2024-02-04 2024-05-07 正天技术有限公司 Elastic management method and device based on cloud native architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋甫元等: "基于访问控制安全高效的多用户外包图像检索方案", 《网络与信息安全学报》, vol. 7, no. 5, 31 October 2021 (2021-10-31), pages 29 - 39 *

Also Published As

Publication number Publication date
CN120277127A (en) 2025-07-08

Similar Documents

Publication Publication Date Title
WO2020211300A1 (en) Resource allocation method and apparatus, and computer device and storage medium
US8082273B2 (en) Dynamic control and regulation of critical database resources using a virtual memory table interface
US8775413B2 (en) Parallel, in-line, query capture database for real-time logging, monitoring and optimizer feedback
US8082234B2 (en) Closed-loop system management method and process capable of managing workloads in a multi-system database environment
US8762367B2 (en) Accurate and timely enforcement of system resource allocation rules
US9135299B2 (en) System, method, and computer-readable medium for automatic index creation to improve the performance of frequently executed queries in a database system
US8423534B2 (en) Actively managing resource bottlenecks in a database system
JP4815459B2 (en) Load balancing control server, load balancing control method, and computer program
US8392404B2 (en) Dynamic query and step routing between systems tuned for different objectives
WO2019184739A1 (en) Data query method, apparatus and device
CN111752965B (en) Real-time database data interaction method and system based on micro-service
US20200012602A1 (en) Cache allocation method, and apparatus
US20090327216A1 (en) Dynamic run-time optimization using automated system regulation for a parallel query optimizer
CN103678520A (en) Multi-dimensional interval query method and system based on cloud computing
US8392461B2 (en) Virtual data maintenance
CN111522870B (en) Database access method, middleware and readable storage medium
CN113568931A (en) A routing analysis system and method for data access request
CN101043389A (en) Control system of grid service container
CN102739785A (en) Method for scheduling cloud computing tasks based on network bandwidth estimation
CN112597173A (en) Distributed database cluster system peer-to-peer processing system and processing method
WO2022266975A1 (en) Method for millisecond-level accurate slicing of time series stream data
CN119537383B (en) Storage method and device based on cold and hot data separation and multi-mode database engine
Wei et al. An optimization method for elasticsearch index shard number
CN114443686A (en) Compression graph construction method and device based on relational data
CN118819819B (en) A multi-database processing method based on load balancing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20241129

WW01 Invention patent application withdrawn after publication