[go: up one dir, main page]

CN114116223B - A request response method, device, system and readable storage medium - Google Patents

A request response method, device, system and readable storage medium Download PDF

Info

Publication number
CN114116223B
CN114116223B CN202111440837.4A CN202111440837A CN114116223B CN 114116223 B CN114116223 B CN 114116223B CN 202111440837 A CN202111440837 A CN 202111440837A CN 114116223 B CN114116223 B CN 114116223B
Authority
CN
China
Prior art keywords
data
engine
node
rolap
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111440837.4A
Other languages
Chinese (zh)
Other versions
CN114116223A (en
Inventor
常雅敏
梅凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Du Xiaoman Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Xiaoman Technology Beijing Co Ltd filed Critical Du Xiaoman Technology Beijing Co Ltd
Priority to CN202111440837.4A priority Critical patent/CN114116223B/en
Publication of CN114116223A publication Critical patent/CN114116223A/en
Application granted granted Critical
Publication of CN114116223B publication Critical patent/CN114116223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a request response method, in the method, a big data fusion engine receives a request and carries out corresponding calculation processing, data storage information is acquired through a first connector of a main control node, then data is acquired from a data source end through a second connector of a working node, so that storage resources are effectively isolated from calculation resources, the calculation node can realize stateless transverse expansion, and the problem of capacity bottleneck of an online analysis scene is solved; meanwhile, the engine master control node only needs to bear task analysis and request management tasks for executing phase division, and each engine working node is directly connected with the data source working node to directly acquire data, so that the data transmission path is short, and the work load of the engine master control node is reduced; meanwhile, the data are acquired and transmitted in a scattered manner through the fragments, so that the problems of short-plate effect and the like caused by acquiring a large amount of data at one time can be avoided. The invention also discloses a big data fusion engine, a system and a readable storage medium, which have corresponding technical effects.

Description

Request response method, device, system and readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a request response method, apparatus, system, and readable storage medium.
Background
The online analytical processing (OLAP, online Analysis Process) system is the most important application of the data warehouse system, and is specifically designed to support complex analysis operations, so that complex query processing of large data volume can be rapidly and flexibly performed according to the requirements of analysts, and the query results can be provided to decision-makers in an intuitive and easily understood form. Conventional OLAP clusters are divided into two types of OLAP (Relational OLAP) and MOLAP (Multi-dimension OLAP), wherein the data organization of the OLAP type clusters is a conventional Relational database.
ROLAP clusters can solve most of the demands of most of interactive analysis of current financial system analysts, but when the data volume reaches a certain scale, the clusters have obvious capacity bottleneck problem. When the storage and calculation coupled architecture is used for node addition or replacement, the cost is very high, data needs to be redistributed, and the problems of transient network IO pressure and the like can be further caused in the redistribution process; the larger the calculated data quantity of the node is, the higher the expansion or substitution failure rate is, the expansion has uncontrollable factors, and the problems of task disconnection retry and the like can exist in the period of synchronizing the nodes. Therefore, the traditional ROLAP cluster size is generally within hundred servers, and cannot meet the requirement of coping with larger-scale business increment.
In summary, how to solve the problem of capacity expansion bottleneck of the ROLAP cluster and ensure the stability and reliability of system operation is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a request response method, equipment, a system and a readable storage medium, which are used for solving the problem of capacity expansion bottleneck of a ROLAP cluster and guaranteeing the stability and reliability of system operation.
In order to solve the technical problems, the invention provides the following technical scheme:
A request response method, comprising:
after receiving a ROLAP cluster query request initiated by a client, the big data fusion engine controls a first connector through an engine main control node to acquire a surviving calculation instance in the current ROLAP cluster through the ROLAP cluster main control node;
After receiving the returned data source list through the first connector, the engine main control node encapsulates the data source list into data fragments, and distributes the data fragments to idle engine working nodes according to a cluster task distribution strategy;
The engine working node adds the loading task of the data fragment to a working thread to wait for queuing;
After receiving the data of the ROLAP cluster working node through the second connector, the engine working node returns the data to a downstream computing stage for computing so as to generate a result set;
and after receiving the result set, the engine main control node feeds back the result set to the client.
Optionally, before the returning the data to the downstream computing stage for computing processing, the method further includes:
and performing data format adaptation on the data, and converting the data into a unified data format of the big data fusion engine.
Optionally, after the engine working node receives the data of the ROLAP cluster working node through the second connector, the method further includes:
judging whether to configure data cache;
if data caching is configured, caching the data to a memory of the engine working node;
Correspondingly, before the first connector is controlled by the engine master node to obtain the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster master node, the method further comprises:
Judging whether the acquired data source is cache data or not;
If yes, extracting corresponding cache data from the internal memory of the engine working node, and executing the step of returning the data to a downstream computing stage for computing;
If not, executing the step of controlling the first connector to obtain the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster master control node by the engine master control node.
Optionally, before the caching the data in the memory of the engine working node, the method further includes:
judging whether the engine working node has enough memory to load the data;
And if yes, executing the step of caching the data to the internal memory of the engine working node.
Optionally, the determining whether the engine working node has sufficient memory to load the data includes:
Determining an empty space and a dirty data space in the engine working node;
judging whether the total space amount of the free space and the dirty data space reaches the loading capacity of the data or not;
If so, judging that the engine working node has enough memory to load the data; if not, judging that the engine working node does not have enough memory to load the data;
Correspondingly, before the step of caching the data into the memory of the engine working node is performed, the method further comprises: and clearing dirty data of the engine working node.
Optionally, after the caching the data to the memory of the engine working node, the method further includes:
Judging whether the data in the ROLAP cluster is updated or not;
If yes, the operation request is updated to the ROLAP cluster master control node data.
Optionally, before the caching the data to the memory of the engine working node, the method further includes:
judging whether the reading times of the data in a designated time period reach a threshold value or not;
And if so, executing the step of caching the data to the internal memory of the engine working node.
A big data fusion engine, comprising: an engine master control node; one end of the first connector is connected with the main control node, and the other end of the first connector is used for connecting with the ROLAP main control node; an engine working node; one end of the second connector is connected with the engine working node, and the other end of the second connector is used for connecting with the ROLAP working node;
Wherein, the engine master control node is used for: controlling a computing instance of the first connector that is currently surviving in a ROLAP cluster; after receiving the returned data source list through the first connector, encapsulating the data source list in data fragments, and distributing the data fragments to idle engine working nodes according to a cluster task distribution strategy; after receiving the result set, feeding back the result set to a client side initiating a ROLAP cluster query request;
The engine working node is configured to: adding the loading task of the data fragment to a working thread to wait for queuing processing; and after receiving the data of the ROLAP cluster working node through the second connector, returning the data to a downstream computing stage for computing so as to generate a result set.
A request response system, comprising: a big data fusion engine as described above, and a ROLAP cluster;
The first connector of the big data fusion engine is connected with the main control node of the ROLAP cluster, and the second connector of the big data fusion engine is connected with the working node of the ROLAP cluster.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the request response method described above.
According to the method provided by the embodiment of the invention, the big data fusion engine is connected with the data storage end, the big data fusion engine receives the request and carries out corresponding calculation processing, the data storage information is acquired through the first connector of the main control node, then the data is acquired from the data source end through the second connector of the working node, so that the storage resources are effectively isolated from the calculation resources, the decoupling between the storage and the calculation is realized, and based on the decoupling, the calculation node can be transversely expanded without the state, is not limited and influenced by the storage layer, the problem of the bottleneck of the capacity of an online analysis scene is solved, and the calculation resources and the storage resources can be fully utilized; meanwhile, the connection between the computing engine and the data storage end is that the main control node is connected with the main control node to acquire data storage information, the working node is connected with the working node to acquire data, the main control node of the engine only needs to bear task analysis and execute the request management task of stage division, and each engine working node is directly connected with the working node of the data source to directly acquire data.
Correspondingly, the embodiment of the invention also provides a big data fusion engine, a big data fusion system and a readable storage medium corresponding to the request response method, which have the technical effects and are not repeated here.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a flow chart of a request response method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a cache update according to an embodiment of the present invention;
Fig. 3 is a schematic connection diagram of a big data fusion engine according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a request response method, which solves the problem of capacity expansion bottleneck of ROLAP clusters and ensures the stability and reliability of system operation.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The ROLAP cluster can solve most of the requirements of most of interactive analysis of the analysts of the current financial system, but when the data volume reaches a certain scale, the cluster has obvious capacity bottleneck problem, so that the traditional ROLAP cluster scale is generally within hundred servers and cannot meet the requirement of coping with larger-scale business increment. Therefore, how to explore a new OLAP interactive analysis engine, solving the problem of coping with ever-increasing financial data, is the main objective of the present invention.
According to the invention, presto (a facebook open-source distributed SQL query engine is suitable for interactive analysis query, and data size supports GB to PB bytes) computing engines are selected for secondary development, and support is carried out from an engine side. The development of a computing engine with decoupled computation and storage replaces the current engine based on rdbms-class storage coupling, so that the computing resource node can support stateless expansion, and the storage can be connected with various third-party heterogeneous storages. And the resources of computation and storage are fully utilized.
Presto is a facebook developed sql-based distributed computing engine originally designed to solve the problem of slow queries of hive (a data warehouse tool based on Hadoop). presto is a pure memory-based computing engine, which can satisfy the property of decoupling bottom storage and easy expansion and contraction. And the user can self-define and develop components such as connectors, data types, resource management and the like to meet the requirements of the deep business form, so that pluggable management is realized.
Presto the compute engine supports relational database connections and business parties can query the mysql, oracle, etc. rmdbs database through presto. The fusion engine needs to solve the intercommunication between presto worker nodes and ROLAP and other online query clusters and hive and other mr (a programming model used for parallel operation of large-scale data sets (greater than 1 TB)), and the communication presto native computing engine of the mr cluster is already supported, so the fusion engine needs to solve the intercommunication with ROLAP class clusters. The method for interworking a fusion engine with a ROLAP cluster is mainly described in this embodiment, and the interconnection and interworking between the fusion engine and other databases or clusters can refer to the description of related technologies, which is not described herein.
Referring to fig. 1, fig. 1 is a flowchart of a request response method according to an embodiment of the invention, the method includes the following steps:
S101, after a big data fusion engine receives a ROLAP cluster query request initiated by a client, controlling a first connector through an engine main control node to obtain a calculation instance surviving in a current ROLAP cluster through the ROLAP cluster main control node;
after receiving the inquiry request of the client, the fusion engine distributes the request to a main control node, performs grammar analysis and lexical analysis of an execution plan, and then breaks down the inquiry request into a plurality of calculation stages to be distributed to a working node; the worker node derives a task to perform the execution plan processing associated with each computing stage.
The task judges whether the current calculation stage is a final source data end (namely, the calculation stage needing to acquire data), if so, the task requests the source data, and if not, the task requests the downstream calculation stage;
the request source data specifically includes: the source data end judges the acquired data source type, for example, if the data source is hive, the hive connector is requested to be connected with a hive distributed file system; if the data source is greenplum (a data store that provides powerful parallel data computing performance and mass data management capabilities based on the postgres database), then the connector of greenplum is connected. In this embodiment, only for the case that the acquired data source type is a ROLAP cluster, the description of the request processing procedure is performed from the data acquisition stage, and the procedure of the request processing (such as the syntax analysis performed by the master node and the division of the calculation stage, and the corresponding calculation processing performed by the working node) may refer to the description of the related implementation manner, which is not repeated herein.
In the method, a fusion engine is directly connected with a ROLAP cluster computing node, and a second connector (connector) at the computing node side is changed into a mode of directly acquiring data from the computing node of the ROLAP cluster. Specifically, the engine comprises a master node and a computing node, the first connector is located at the master node side, after the obtained data source type is determined to be a ROLAP cluster, the connector of the fusion engine firstly interacts with the master node (master node) of the ROLAP cluster to obtain the information of the computing instance (i.e. the active node) which is currently alive in the ROLAP cluster, such as the related information of addresses, ports, libraries, schema, tables and the like. Because the data are scattered and stored in all the active nodes in a distributed mode, the active nodes are required to be found first, then the data are acquired in the active nodes respectively, and the target data can be obtained after all the data are spliced.
S102, after receiving the returned data source list through the first connector, the engine main control node encapsulates the data source list in the data fragments, and distributes the data fragments to idle engine working nodes according to a cluster task allocation strategy;
The first connector obtains relevant information of each surviving computing node of the ROLAP cluster, effectively analyzes the obtained result information, assembles the obtaining connection of each computing instance data of the ROLAP cluster, and assembles the authentication from the joint engine to the ROLAP cluster into data fragments (split);
The data fragments (split) assembled by the current live computing examples of all ROLAP clusters form a data fragment (split) set, the data fragments (split) set is fed back to a schedule (job scheduling) of an engine main control node, the schedule is scheduled according to the current idle degree of each working node, a connector receives a returned data source list (list), the data source list (list) is packaged in the data fragments, and the data fragments are distributed to the working nodes of an idle fusion engine according to a cluster task distribution strategy. The data slicing (split) scheduling of the ROLAP cluster may be dynamically adjusted according to the size of each batch of scheduled data slices (split) of the cluster, which is not limited.
S103, the engine working node adds the loading task of the data fragment to a working thread to wait for queuing;
when the work node of the fusion engine receives the data fragment distribution, the loading task of the relevant data fragment is put into a work thread to wait for queuing processing. The processing manner of the working node for the working thread may refer to the description of related technologies, which is not described herein.
S104, after receiving the data of the ROLAP cluster working node through the second connector, the engine working node returns the data to a downstream computing stage for computing so as to generate a result set;
Once the working node of the fusion engine receives the throughput data of the working node at the bottom layer of the ROLAP cluster, the throughput data is returned to the upper computing stage for processing, and after the data is acquired by the upper computing stage, the data is transmitted to the downstream computing stage for processing until a final result set is obtained and returned to the main control node of the fusion engine.
Because there may be a certain difference between the data directly obtained from the working node at the bottom layer of the ROLAP cluster and the data structure directly identifiable by the engine, the data migration cost between heterogeneous analysis engines is higher at present, and related ETL (the process of loading the data of the service system into the data warehouse after extraction, cleaning and conversion is needed to integrate the scattered, disordered and non-uniform data in the enterprise together, so as to provide analysis basis for the decision of the enterprise) operation is required to be developed, and the data migration between engines is completely different, such as batch operation storage in hive and online processing operation in mysql. When a business party performs interactive analysis, the data stored from other engines needs to be moved to the relevant ROLAP engine, and the method not only makes the analysis period process analysis low in efficiency, but also can cause the storage engine to have multiple copies. For this, in order to avoid the influence of the acquired heterogeneous data source on the processes of data analysis and the like, the data format acquired from the ROLAP cluster can be further adapted, converted into the unified data format of the fusion engine, and then returned to the upper computing stage for processing
By connecting different heterogeneous storage engines to the memory or cache layer of the big data fusion engine, eliminating the difference of the acquired heterogeneous data sources on the storage model, binding the acquired data sources to the source computing stage (stage) of the SQL execution plan, realizing data intercommunication of different engines, reducing data migration and solving the joint query problem among heterogeneous engine storage systems.
Of course, the above arrangement or other heterogeneous elimination method may not be adopted in the case of relocation between non-heterogeneous data, and this is not limited in this embodiment.
And S105, after receiving the result set, the engine main control node feeds back the result set to the client.
Batch calculation is carried out in the fusion engine, and calculation results are returned to the master control node of the fusion engine in batches until the result set is returned to the client. Thus, the processing of the single request is completed.
Based on the description, in the technical scheme provided by the embodiment of the invention, the big data fusion engine is connected with the data storage end, the big data fusion engine receives the request and carries out corresponding calculation processing, the data storage information is acquired through the first connector of the main control node, and then the data is acquired from the data source end through the second connector of the working node, so that the storage resource and the calculation resource are effectively isolated, decoupling between storage and calculation is realized, and based on the decoupling, the calculation node can be expanded horizontally without a state, is not influenced by the limitation and the influence of the storage layer, the problem of capacity bottleneck of an online analysis scene is solved, and the calculation resource and the storage resource can be fully utilized; meanwhile, the connection between the computing engine and the data storage end is that the main control node is connected with the main control node to acquire data storage information, the working node is connected with the working node to acquire data, the main control node of the engine only needs to bear task analysis and execute the request management task of stage division, and each engine working node is directly connected with the working node of the data source to directly acquire data.
It should be noted that, based on the above embodiments, the embodiments of the present invention further provide corresponding improvements. The preferred/improved embodiments relate to the same steps as those in the above embodiments or the steps corresponding to the steps may be referred to each other, and the corresponding advantages may also be referred to each other, so that detailed descriptions of the preferred/improved embodiments are omitted herein.
The business side can directly communicate the data from the hive to the ROLAP cluster through the fusion engine to perform OLAP task analysis, but the obtained data source is remote, more IO and network resources are needed to be consumed, meanwhile, updating of part of offline resources is not needed to be real-time, and in order to reduce the network interaction problem with remote storage resources, the embodiment further provides a data caching scheme on the basis of the embodiment, and after the engine working node receives the data of the ROLAP cluster working node through the second connector in step S104, the following steps can be further executed: judging whether to configure data cache; if yes, the data is cached in the internal memory of the engine working node.
Under the above-mentioned buffer configuration, before the first connector is controlled by the engine master node to obtain the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster master node, the following steps are further executed:
(1) Judging whether the acquired data source is cache data or not;
(2) If yes, extracting corresponding cache data from the memory of the engine working node, and executing the step of returning the data to a downstream computing stage for computing;
(3) If not, executing the step of controlling the first connector through the engine main control node to acquire the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster main control node.
Changing schedule, checking whether needed bottom resources exist in the cache, and if yes, directly loading the needed bottom resources from the cache, wherein the cache data is subjected to custom related common filtering conditions to further accelerate source data loading.
If the data cache is not configured or is not configured, the processing mode in this case is not limited in this embodiment, and the calculation processing can be directly performed or the data is not needed to be backed up.
Setting a buffer on switch, configuring a buffer to a table, and further judging whether the reading times of the data in a specified time period reach a threshold value or not before buffering the data to the memory of an engine working node because the local memory resource is precious; if so, executing the step of caching the data into the memory of the engine working node. In the method, only the data in the memory are loaded as the service data used at high frequency, so that excessive occupation of the memory space is avoided.
If the number of readings does not reach the threshold, the processing method in this case is not limited in this embodiment, and the method may not be cached or stored in other remote memories.
In addition, in order to further ensure smooth caching of the data and avoid writing failure and other situations caused by insufficient space, before caching the data into the memory of the engine working node, whether the engine working node has enough memory to load the data can be further judged; if so, executing the step of caching the data into the memory of the engine working node. According to the method, the data caching step is executed after the fact that enough space is written in the node is determined, and the success rate of data caching can be remarkably improved.
If the engine working node does not have enough memory to load data, the processing manner in this case is not limited in this embodiment, and the node that is enough to load the current data may be found again, or may be cached in other spaces, for example, remote cluster cache (alluxio, a distributed file system based on memory) may be used, which is not described herein.
And wherein determining whether the engine working node has sufficient memory to load the data can directly determine whether the free space is satisfactory for loading the data, preferably by:
determining a free space and a dirty data space in an engine working node;
Judging whether the total space amount of the free space and the dirty data space reaches the loading capacity of the data or not;
If so, judging that the engine working node has enough memory to load data; if not, judging that the engine working node does not have enough memory to load data;
accordingly, before performing the step of caching the data to the memory of the engine working node, the method further comprises: dirty data of engine working nodes is cleared.
After the working node of the fusion engine acquires data fragments (split), whether node nodes have enough memory or not is evaluated to load, the evaluation algorithm also calculates dirty data of the table in the working node, if so, the dirty data is cleared and then is loaded, so that not only can occupation of useless data in the nodes on space be reduced, but also the success rate of caching of the data can be improved.
Further, after caching the data to the memory of the engine working node, whether the data in the ROLAP cluster is updated or not can be judged; if yes, the operation request is updated to the ROLAP cluster master control node data.
As shown in fig. 2, the fusion engine starts a task (task) of background cache refresh, starts 5 (configurable) working threads to perform periodic scanning, and requests related data update operation to a main control node (coordinate) of presto if the data is updated in presto clusters, so as to realize synchronous update of cache data.
Based on the above description, in the scheme provided by the embodiment, if the user configures the data cache, the fusion engine caches the data related to the table in the memory of the working node of each fusion engine, and when the user acquires the same data source again, the data is directly acquired from the memory, so that the IO consumption of fetching from the terminal can be reduced.
To enhance understanding of the above technical solution, in this embodiment, a ROLAP cluster represented by a connection greenplum (similar to the cluster such as doris) is taken as an example, and the complete request processing procedure is as follows:
1. after receiving a query request (SQL request) of a client (client), the fusion engine distributes the request to a main control node (coordinate) of presto, performs grammar analysis and lexical analysis of an execution plan, disassembles the grammar analysis into a plurality of computing stages (stages) and distributes the computing stages (stages) to working nodes of the fusion engine;
2. The work node of the fusion engine derives a task (task) to perform the associated execution plan processing of each calculation stage;
3. The task (task) judges whether the current computing stage (stage) is a final source data end (namely, a stage connected with a third party data source), if so, the source data is requested, and if not, the request is sent to a downstream computing stage (stage);
4. The final source data end judges the type of the acquired data source, if the data source type is hive, a request hive connector (connector) is connected with a Hive Distributed File System (HDFS); if greenplum, a connector to greenplum;
5. A connector (connector) connects heterogeneous data sources such as greenplum or hive data sources to produce N data fragments (split), and the data related to the multithread request is cached in a worker node according to an adaptation strategy of an engine;
Specifically, taking the acquisition greenplum as an example, the flow of interaction with greenplum data is as follows:
1) The connector of the fusion engine firstly interacts with a main control node (master node) of greenplum to acquire the information of the currently active stored data sources in greenplum, such as the related information of addresses, ports, libraries, schemas, tables and the like;
The data are scattered and stored in all active nodes in a distributed mode, so that the active nodes are required to be found first, then the data are taken from the active nodes respectively, and the target data are obtained after splicing.
2) The connector receives a returned data source list (list), packages the list in the data fragment, and distributes the list to the working nodes of the idle fusion engine according to the cluster task allocation strategy;
3) When a work node of the fusion engine receives the allocation of the data fragments, loading tasks of the related data fragments are put into a work thread to wait for queuing;
4) Once the working node of the fusion engine receives the throughput data of the data source from the greenplum bottom layer, the greenplum data format is firstly adapted and converted into the unified data format of the fusion engine, and then the unified data format is returned to the calculation stage of the upper layer for processing;
5) If the user configures the data cache, the fusion engine caches the data related to the table in the memory of the working node of each fusion engine, and when the user acquires greenplum the same data source again, the data source is directly acquired from the memory, so that the consumption of fetching from the terminal is reduced.
6. The computing stage (stage) of the terminal acquires a data source and transmits the data source to the downstream computing stage (stage) for one-step processing until a final result set is returned to the main control node (coordinate) of presto;
7. the big data fusion engine internally performs batch calculation and returns the calculation result to the main control node (coordinate) of presto in batches until the result set is returned to the client (client) completely.
The fusion engine in the technical scheme provided by the embodiment is used for solving the problem of massive OLAP interaction analysis scenes, meeting the architecture of calculation and storage separation, enabling storage resources to be effectively isolated from the computing resources, enabling the computing nodes to be free of state transverse expansion, free of the limitation and influence of a storage layer, solving the problem of capacity bottleneck of the online analysis scenes, enabling the computing resources and the storage resources to be fully utilized, and laying a foundation for subsequently building a cloud-native big data analysis engine; meanwhile, the fusion engine is communicated with a common ROLAP and mr offline batch processing analysis engine based on presto calculation engines, so that a user can directly use the ROLAP native engine for analysis tasks with higher efficiency requirements, and common analysis data are put into ROLAP storage; for other task users, the fusion engine is directly used for analyzing the common service data in the ROLAP and the mass data in the hdfs, so that the ROLAP cluster analysis pressure is effectively split, and the efficiency of on-line analysis tasks from data production to result output is improved.
It should be noted that, the use of presto native engines may accomplish data communication between different heterogeneous engines. But the native computing engine has a short performance board apparent on the data acquisition of the ROLAP-type online interaction analysis engine. In actual production and use, there is a serious data delay of a part GB and more than TB in data slicing (split) scheduling of a data source, and a service unavailability phenomenon may exist.
Corresponding to the above method embodiment, the present invention further provides a big data fusion engine, and a big data fusion engine described below and a request response method described above may be referred to correspondingly.
As shown in fig. 3, the big data fusion engine includes: an engine master control node; one end of the first connector is connected with the master control node, and the other end of the first connector is used for connecting a ROLAP (greenplum database is taken as an example in fig. 3) master control node (such as a bidirectional connection between an upper engine master control node and a greenplum master control node in fig. 3); an engine working node; one end is connected to the engine working node, and the other end is used for connecting a second connector (such as a bidirectional connection between a task in a lower engine master control node and greenplum working nodes in fig. 3) of the ROLAP working node.
Wherein, engine master control node is used for: controlling a surviving computing instance in the current ROLAP cluster of the first connector; after receiving the returned data source list through the first connector, encapsulating the data source list in the data fragments, and distributing the data fragments to idle engine working nodes according to a cluster task distribution strategy; after receiving the result set, feeding back the result set to a client side initiating the ROLAP cluster query request;
the engine working node is used for: adding a loading task of the data fragment into a working thread to wait for queuing processing; and after receiving the data of the ROLAP cluster working node through the second connector, returning the data to a downstream computing stage for computing so as to generate a result set.
The steps in the request response method described above may be implemented by the structure of the big data fusion engine, and the specific implementation manner may refer to the description of the above embodiment, which is not repeated herein.
Corresponding to the above method embodiments, the present invention further provides a readable storage medium, where a readable storage medium described below and a request response method described above may be referred to correspondingly.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the request response method of the above-described method embodiments.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which may store various program codes.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be limiting.

Claims (10)

1. A request response method, comprising:
after receiving a ROLAP cluster query request initiated by a client, the big data fusion engine controls a first connector through an engine main control node to acquire a surviving calculation instance in the current ROLAP cluster through the ROLAP cluster main control node;
After receiving the returned data source list through the first connector, the engine main control node encapsulates the data source list into data fragments, and distributes the data fragments to idle engine working nodes according to a cluster task distribution strategy;
The engine working node adds the loading task of the data fragment to a working thread to wait for queuing;
After receiving the data of the ROLAP cluster working node through the second connector, the engine working node returns the data to a downstream computing stage for computing so as to generate a result set;
and after receiving the result set, the engine main control node feeds back the result set to the client.
2. The request response method of claim 1, further comprising, prior to said returning said data to the downstream computing stage for computing:
and performing data format adaptation on the data, and converting the data into a unified data format of the big data fusion engine.
3. The request response method according to claim 1, further comprising, after the engine working node receives data of the ROLAP cluster working node through the second connector:
judging whether to configure data cache;
if data caching is configured, caching the data to a memory of the engine working node;
Correspondingly, before the first connector is controlled by the engine master node to obtain the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster master node, the method further comprises:
Judging whether the acquired data source is cache data or not;
If yes, extracting corresponding cache data from the internal memory of the engine working node, and executing the step of returning the data to a downstream computing stage for computing;
If not, executing the step of controlling the first connector to obtain the surviving calculation instance in the current ROLAP cluster through the ROLAP cluster master control node by the engine master control node.
4. A request response method according to claim 3, further comprising, prior to said caching said data in the memory of said engine working node:
judging whether the engine working node has enough memory to load the data;
And if yes, executing the step of caching the data to the internal memory of the engine working node.
5. The method of claim 4, wherein said determining whether said engine worker node has sufficient memory to load said data comprises:
Determining an empty space and a dirty data space in the engine working node;
judging whether the total space amount of the free space and the dirty data space reaches the loading capacity of the data or not;
If so, judging that the engine working node has enough memory to load the data; if not, judging that the engine working node does not have enough memory to load the data;
Correspondingly, before the step of caching the data into the memory of the engine working node is performed, the method further comprises: and clearing dirty data of the engine working node.
6. The request response method according to claim 3, further comprising, after said caching said data to said engine working node's memory:
Judging whether the data in the ROLAP cluster is updated or not;
If yes, the operation request is updated to the ROLAP cluster master control node data.
7. The request response method according to claim 3, further comprising, prior to said caching said data to said engine working node's memory:
judging whether the reading times of the data in a designated time period reach a threshold value or not;
And if so, executing the step of caching the data to the internal memory of the engine working node.
8. A big data fusion engine, comprising: an engine master control node; one end of the first connector is connected with the main control node, and the other end of the first connector is used for connecting with the ROLAP main control node; an engine working node; one end of the second connector is connected with the engine working node, and the other end of the second connector is used for connecting with the ROLAP working node;
Wherein, the engine master control node is used for: controlling a computing instance of the first connector that is currently surviving in a ROLAP cluster; after receiving the returned data source list through the first connector, encapsulating the data source list in data fragments, and distributing the data fragments to idle engine working nodes according to a cluster task distribution strategy; after receiving the result set, feeding back the result set to a client side initiating a ROLAP cluster query request;
The engine working node is configured to: adding the loading task of the data fragment to a working thread to wait for queuing processing; and after receiving the data of the ROLAP cluster working node through the second connector, returning the data to a downstream computing stage for computing so as to generate a result set.
9. A request response system, comprising: the big data fusion engine of claim 8, and a ROLAP cluster;
The first connector of the big data fusion engine is connected with the main control node of the ROLAP cluster, and the second connector of the big data fusion engine is connected with the working node of the ROLAP cluster.
10. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the request response method according to any of claims 1 to 7.
CN202111440837.4A 2021-11-30 2021-11-30 A request response method, device, system and readable storage medium Active CN114116223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111440837.4A CN114116223B (en) 2021-11-30 2021-11-30 A request response method, device, system and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111440837.4A CN114116223B (en) 2021-11-30 2021-11-30 A request response method, device, system and readable storage medium

Publications (2)

Publication Number Publication Date
CN114116223A CN114116223A (en) 2022-03-01
CN114116223B true CN114116223B (en) 2024-11-19

Family

ID=80368248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111440837.4A Active CN114116223B (en) 2021-11-30 2021-11-30 A request response method, device, system and readable storage medium

Country Status (1)

Country Link
CN (1) CN114116223B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312373B (en) * 2023-09-27 2025-08-19 度小满科技(北京)有限公司 Data processing method, device and data processing system of concurrent connector
CN119415607A (en) * 2025-01-08 2025-02-11 亚信科技(中国)有限公司 A data sharing method and related device for digital twin city

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2401348A1 (en) * 2000-02-28 2001-09-13 Hyperroll Israel Ltd. Multi-dimensional database and integrated aggregation server
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020029207A1 (en) * 2000-02-28 2002-03-07 Hyperroll, Inc. Data aggregation server for managing a multi-dimensional database and database management system having data aggregation server integrated therein
US9342557B2 (en) * 2013-03-13 2016-05-17 Cloudera, Inc. Low latency query engine for Apache Hadoop
CN108875042B (en) * 2018-06-27 2021-06-08 中国农业银行股份有限公司 Hybrid online analysis processing system and data query method
CN112099977A (en) * 2020-09-30 2020-12-18 浙江工商大学 A Real-time Data Analysis Engine for Distributed Tracking System

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2401348A1 (en) * 2000-02-28 2001-09-13 Hyperroll Israel Ltd. Multi-dimensional database and integrated aggregation server
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system

Also Published As

Publication number Publication date
CN114116223A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN112199427B (en) A data processing method and system
US8738568B2 (en) User-defined parallelization in transactional replication of in-memory database
US12314251B2 (en) Transaction processing method and apparatus, computing device, and storage medium
CN102663117B (en) OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
US12014248B2 (en) Machine learning performance and workload management
US10152500B2 (en) Read mostly instances
CN108536778B (en) Data application sharing platform and method
US9971820B2 (en) Distributed system with accelerator-created containers
WO2019109854A1 (en) Data processing method and device for distributed database, storage medium, and electronic device
CN117677943A (en) Data consistency mechanism for mixed data processing
CN116108057B (en) Distributed database access method, device, equipment and storage medium
WO2023015809A1 (en) Method and device for optimizing distributed memory data query
US12292899B2 (en) Method for scheduling multi-node cluster of K-DB database, device, and medium thereof
CN108073696B (en) GIS Application Method Based on Distributed Memory Database
CN114116223B (en) A request response method, device, system and readable storage medium
CN113672556A (en) Method and device for migrating batch files
WO2018035799A1 (en) Data query method, application and database servers, middleware, and system
CN105608126A (en) Method and apparatus for establishing secondary indexes for massive databases
US20200242118A1 (en) Managing persistent database result sets
CN117787432A (en) Machine learning method and device based on lake-warehouse integration
CN112905676A (en) Data file importing method and device
CN110781137A (en) Directory reading method and device for distributed system, server and storage medium
KR20100132752A (en) Query data distribution processing system for improving service performance through database distribution
CN117271583A (en) Systems and methods for optimizing big data queries
CN118916431B (en) A method, system, device and medium for collecting data of GoldenDB database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant