Disclosure of Invention
In view of this, the present invention provides a method and a system for data distribution and parallel processing, which can not only ensure the consistency of data, but also improve the speed and efficiency of data query.
The invention provides a data distribution and parallel processing method based on the above purpose, which comprises the following steps:
receiving an initial query instruction input by a user side;
analyzing the initial query instruction to obtain a real query statement which meets the standard and is used as a real query instruction;
sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
executing the received real query instruction on each data node respectively, returning a query result to the work distribution node, and keeping the update of the data information table on each server node;
on each server node, backing up a query table corresponding to the real query instruction, so that a query table copy is stored on each server node;
on each server node, when a concurrency conflict is detected, the real query instruction is executed concurrently by calling the copy of the query table on the current server node.
Optionally, when it is detected that there is a concurrency conflict, the step of concurrently executing the real query instruction by calling the copy of the lookup table on the current server node includes:
detecting whether the real inquiry command has concurrency conflict in each server node;
if the concurrency conflict does not exist, submitting the query task; the query task is a query task corresponding to a real query instruction;
if the concurrency conflict exists, calling the self-adaptive backup on the current server node to be compared with the current query task, and executing the query task on the self-adaptive backup;
and if the query tasks in all the server nodes are successfully executed, returning a decision for submitting the query tasks, and otherwise, returning a decision for withdrawing the query tasks.
Optionally, the method for dividing all server nodes in the database into work distribution nodes and data nodes according to preset functions includes:
detecting the local storage capacity and the computing processing capacity of each server node;
judging whether the calculation processing capacity of the current server node is greater than a preset performance threshold value, if so, setting the current server node as a work distribution node, otherwise, setting the current server node as a data node;
or,
judging whether the local storage capacity of the current server node is greater than a preset storage threshold value or not, if so, setting the current server node as a data node; otherwise, the current server node is set as the working distribution node.
Optionally, the step of sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node further includes:
detecting the specific use condition of each server node;
and distributing data nodes and corresponding spaces on each server node by adopting a preset balance strategy according to the detection result.
Optionally, the balancing policy is a policy with the lowest space utilization rate.
Optionally, the step of maintaining the update of the data information table on each server node further includes:
and copying and storing the position information and the related operation information of the data information table.
Optionally, the step of maintaining the update of the data information table on each server node further includes:
all data in each data node are backed up to the designated data node;
keeping all data in each data node and synchronous updating of corresponding data backup;
the location information of each data node is stored on each data node.
Optionally, the step of parsing the initial query instruction further includes:
performing lexical analysis on characters input by a user side to obtain words meeting the standard;
carrying out syntactic analysis on a plurality of continuous words to obtain sentences conforming to syntactic logic, and simultaneously constructing to obtain an abstract syntactic tree;
and converting the logic SQL statement into a real SQL statement according to the abstract syntax tree to obtain a real query statement which meets the standard and is used as a real query instruction.
The invention also provides a data distribution and parallel processing system, which comprises:
the input unit is used for receiving an initial query instruction input by a user side;
the analysis unit is used for analyzing the initial query instruction to obtain a real query statement meeting the standard and using the real query statement as a real query instruction;
the distribution unit is used for sending the real query instruction to a work distribution node and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
the processing unit is used for respectively executing the received real query instruction on each data node, returning a query result to the work distribution node, and simultaneously keeping the update of the data information table on each server node;
the backup unit is used for backing up the query table corresponding to the real query instruction on each server node, so that a query table copy is stored on each server node;
and the concurrency unit is used for executing the real query instruction in a concurrent manner by calling the copy of the query table on the current server node when the concurrency conflict is detected on each server node.
As can be seen from the above, according to the data distribution and parallel processing method and system provided by the invention, all server nodes in the database are divided into the work distribution nodes and the data nodes, so that the work distribution nodes can manage task distribution of the data nodes and return of query results; in this way, the data nodes can achieve load balancing according to different use states. The consistent synchronization of data in the server nodes can be ensured by keeping the data information table updated. By backing up the copy of the query table on each server node, each server node can execute the query instruction concurrently, so that the corresponding time is shortened, and the query efficiency is improved. By calling the adaptive backup to compare with the current query task when the concurrency conflicts, the consistency of the distributed data can be ensured. Therefore, the data distribution and parallel processing method and system not only can ensure the consistency of the data, but also can improve the speed and efficiency of data query.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
Referring to fig. 1, a flowchart of an embodiment of a data distribution and parallel processing method provided by the present invention is shown. The data distribution and parallel processing method comprises the following steps:
step 101, receiving an initial query instruction input by a user side; the user usually inputs some continuous character strings at the user end, and the database usually needs to be parsed to obtain recognizable sentence or instruction information.
102, analyzing the initial query instruction to obtain a real query statement meeting the standard, and taking the real query statement as a real query instruction; the common query analysis part mainly comprises the steps of syntactic analysis, lexical analysis, semantic analysis, statement analysis and the like. The effect of the parsing is to convert an input character string into a structure describing the character string, so that the computer can more easily understand what the user input character string means. This stage includes three processes, lexical analysis, syntactic analysis, and output of abstract syntax trees. The lexical analyzer is a deterministic finite Automata (deterministic finite Automata) that can convert an input character set into words according to a lexical method defined by us. The lexical analysis is followed by a syntactic analysis, the result of which is used as input for the syntactic analysis, and the syntactic analysis is based on the lexical analysis to determine whether the word input by the user conforms to the syntactic logic. Taking SQL as an example, semantic analysis relates to relevant theories and concepts of SQL standard and SQL optimization, and includes logical analysis and physical analysis, and the logical analysis is a general mathematical analysis process and is independent of the underlying distributed environment. The physical analysis transforms the result of the logical analysis and is closely related to the underlying execution environment.
An Abstract Syntax Tree (AST) is a representation of a Tree structure of a user input sentence, each node on the Tree is a word, and the structure of the Tree embodies Syntax. The abstract syntax tree is constructed along with the process of syntax analysis, after the syntax analysis is finished, the syntax analyzer outputs an abstract syntax tree, and the input of a user corresponds to the structural content of the abstract syntax tree.
103, sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes; in the MPP architecture, the whole database includes a large number of relatively independent servers, each serving as an independent node, so that the single server is referred to as a server node, a server, and the like, and the meaning of the actual representation is the same. Generally, a work distribution node can be arranged in a database to manage the rest of server nodes, namely data nodes; a plurality of work distribution nodes may be provided to manage the data nodes to which they belong, respectively. By setting the work distribution node management data node in this way, the factors such as the storage amount and the access pressure of each server node can be approximately equalized, for example: data in the data nodes with high load can be migrated to the data nodes with low load, and load balancing is achieved.
Step 104, executing the received real query instruction on each data node, returning the query result to the work distribution node, and keeping the update of the data information table on each server node; the data information table is used for recording the update of data and maintaining the consistency of the data on each server node through the communication between the server nodes; that is, any modification made to the data is recorded in the data information table, and maintaining the update of the data information table as described herein means that the modification is synchronized on the backup or copy thereof, so that the data information table can be guaranteed to be consistent.
Step 105, backing up the query table corresponding to the real query instruction on each server node, so that a query table copy is stored on each server node; the query table is a data table for storing the obtained query result after the real query instruction (for example, select [ id ] instruction and the like) is subjected to instruction query; all query instructions are also included in the lookup table. Based on a query task in a distributed database, a sequence of operations, any operation to a distribution eventually becomes a sequence of operations to the database store, the execution of which is also distributed. Therefore, the query table is backed up at each server node, so that each server node can execute the query instruction through the query table or the query table copy, the concurrent execution of the query instruction can be realized, and the query efficiency can be improved by maintaining the query table copy.
And 106, on each server node, when the concurrency conflict is detected, concurrently executing the real query instruction by calling the query table copy on the current server node. In the prior art, when a concurrency conflict occurs, the query can be performed only by waiting for global correspondence, so that the data query speed and efficiency are low, and the data consistency is poor. According to the invention, the query table copy is stored in advance, and then the query task is executed by calling the query table copy when the concurrency conflict is met, so that the concurrency execution of the query statement is realized, the response time can be greatly shortened, and the speed and the efficiency of data query are further improved.
According to the data distribution and parallel processing method provided by the embodiment of the invention, all server nodes in the database are divided into the work distribution nodes and the data nodes, so that the work distribution nodes can manage task distribution of the data nodes and return of query results; in this way, the data nodes can achieve load balancing according to different use states. The consistent synchronization of data in the server nodes can be ensured by keeping the data information table updated. By backing up the copy of the query table on each server node, each server node can execute the query instruction concurrently, so that the corresponding time is shortened, and the query efficiency is improved. By calling the adaptive backup to compare with the current query task when the concurrency conflicts, the consistency of the distributed data can be ensured. Therefore, the data distribution and parallel processing method provided by the invention not only can ensure the consistency of the data, but also can improve the speed and efficiency of data query. Meanwhile, based on the steps, the invention also overcomes the problem of concurrency conflict in the query, so that the whole query process is more stable and reliable.
In some alternative embodiments of the present invention, the database needs to be first divided into several subsets according to preset rules, for example: uniformly dividing all data in a database into disjoint n equal parts according to columns, wherein each part consists of a plurality of columns; or, dividing all data into n disjoint equal parts according to rows, wherein each part consists of a plurality of rows; or, dividing the data into n disjoint subsets according to the similarity of the data, for example, the data larger than the average value is classified into the same subset, and other data is classified into another subset; or, dividing the data into n disjoint subsets according to the difference of the data, for example, the data with variance larger than a certain constant k in the column is classified into the same subset, and other data is classified into another subset; this allows the entire database to be divided into several individuals, which can be independent of each other, and subsets, each subset being processed on the same server, for a total of n servers. Enabling independent decisions to be made on each server, giving a decision whether the current task should be aborted or submitted. The principle of judgment is as follows: the query task may be submitted when there is no concurrency conflict, and if there is a concurrency conflict, for example, a file being written by one program, it will not be possible to be written by another program at the same time. And calling the self-adaptive copy backup of the subset on the current server, comparing and verifying the self-adaptive copy backup with the current task, and giving a decision.
The self-adaptive backup refers to that important data are backed up on a plurality of servers in a distributed system for disaster tolerance and fault tolerance. There are many methods for adaptive copy backup, including data block backup, dynamic backup, etc. If data in a plurality of servers in distributed deployment is required to be consistent, writing operation of data in all the servers is guaranteed, and all the writing operation is not executed or all the writing operation is not executed. However, one server cannot know the execution result of the local transaction in the other server when executing the local transaction. It is not known whether the transaction should be aborted or committed at all. If there is a concurrency conflict on a subset, i.e. it is occupied by the write operation of a program, and other programs cannot execute the write operation on the current subset, the present invention proposes to replace executing the local transaction on the current subset by accessing the backup of the current subset, so that the local transaction execution results of all servers can be obtained, and a decision is given, and the specific method is as follows: if the transaction operation of all the participating subsets of the task is executed successfully, returning to the decision of submitting the query task; if the transaction operation for all participating subsets of the task actually fails to execute, it returns a decision to revoke the query task. Thus, MPP-based shared-nothing resource architectures require global deadlock detection mechanisms, two-phase commit protocols, etc. to ensure integrity and recovery of transactions. The scheme adopted by the invention does not need the time-consuming methods, so that the speed of the query task is increased.
In some optional embodiments of the present invention, as shown in fig. 2, the step 106 of, when it is detected that there is a concurrency conflict, concurrently executing the real query instruction by invoking the copy of the lookup table on the current server node further includes:
step 201, detecting the real query instruction in each server node, determining whether a concurrency conflict exists, if not, executing step 202, and if so, executing step 203;
step 202, according to step 201, if there is no concurrency conflict, submitting a query task; the query task is a query task corresponding to a real query instruction;
step 203, according to step 201, if there is a concurrency conflict, calling a self-adaptive backup on the current server node to compare with the current query task, and executing the query task on the self-adaptive backup;
step 204, judging whether the query tasks in all the server nodes are successfully executed, if so, executing step 205, otherwise, executing step 206;
step 205, according to step 204, if the query tasks in all the server nodes are successfully executed, returning to the decision of submitting the query tasks;
step 206, according to step 204, if there is an unsuccessful query task, a decision to withdraw the query task is returned.
Thus, aiming at the problem of concurrency conflict in the server node, the query task can be executed by calling the self-adaptive backup. Specifically, each server has a self-adaptive backup record information, when there is a query command concurrency conflict and the self-adaptive backup needs to be called, the current server can determine the storage position of the record by accessing the self-adaptive backup record on the current server, that is, the self-adaptive backup is stored in which server node, and then the self-adaptive backup is compared with the current query task and a decision is given, so that the query task of executing the self-adaptive backup data on the current server is realized, the data consistency is further maintained, meanwhile, the whole query task can be effectively executed concurrently, and the speed and the efficiency of data query are further greatly improved.
In some optional embodiments of the present invention, the method for dividing all server nodes in the database into the work distribution node and the data node according to a preset function includes:
detecting the local storage capacity and the computing processing capacity of each server node;
judging whether the calculation processing capacity of the current server node is greater than a preset performance threshold value, if so, setting the current server node as a work distribution node, otherwise, setting the current server node as a data node;
or,
judging whether the local storage capacity of the current server node is greater than a preset storage threshold value or not, if so, setting the current server node as a data node; otherwise, the current server node is set as the working distribution node.
Therefore, different server nodes can correspondingly realize different functions according to the characteristics of the server nodes, and the resources of the server can be utilized to the maximum extent. Meanwhile, the distribution mode enables the management of the database to be more orderly and efficient.
In some optional embodiments of the present invention, the step of sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node further includes:
detecting the specific use condition of each server node;
and distributing data nodes and corresponding spaces on each server node by adopting a preset balance strategy according to the detection result.
Wherein the specific use cases comprise: the communication condition of the server, the storage space utilization rate of each data node and other conditions or data related to the performance of the server nodes can realize different query task distribution according to different conditions of the server nodes. The processing speed and efficiency of the task can be optimized.
Further, in some optional embodiments, the balancing policy is a lowest space usage policy.
In some optional embodiments of the invention, the step of maintaining the update of the data information table at each server node further comprises: and copying and storing the position information and the related operation information of the data information table. In this way, the position information and the related operation information of the data information table are saved, so that information recording can be realized for all data processing, and the user can realize processing such as withdrawing operation according to the information.
In some optional embodiments of the invention, the step of maintaining the update of the data information table at each server node further comprises:
all data in each data node are backed up to the designated data node; namely, the safety of the data is ensured through the backup data, and the self-adaptive backup data is obtained.
Keeping all data in each data node and synchronous updating of corresponding data backup; by synchronous updating of the saved data, all data in each data node and the backed-up data are kept consistent.
The position information of each data node is stored on each data node, so that the data node can be found through the position information, and the withdrawal operation is convenient to realize.
Location information for each data node may be maintained in the database by an update of the primary database copy specifying the database address. All information of the original data can be stored in the data nodes, and according to the information, the retrieval of the backup data is carried out in the data of the corresponding data nodes or backup data nodes.
In some optional embodiments of the present invention, the step of parsing the initial query instruction further includes:
performing lexical analysis on characters input by a user side to obtain words meeting the standard;
carrying out syntactic analysis on a plurality of continuous words to obtain sentences conforming to syntactic logic, and simultaneously constructing to obtain an abstract syntactic tree;
and converting the logic SQL statement into a real SQL statement according to the abstract syntax tree to obtain a real query statement which meets the standard and is used as a real query instruction.
Therefore, the real query instruction can be obtained through the gradual analysis of the initial query instruction, and the subsequent data query task is further executed.
Optionally, in the data distribution and parallel processing method of the present invention, a copy of the lookup table is pre-stored on each server node, so that when a concurrency conflict exists in the query instruction in the current server node, the current server node cannot execute the query instruction, and after a result that the processing node cannot query is returned to, the processing node finds the backup server where the adaptive backup corresponding to the current server data is located through the data information table, and then instructs the backup server to query the backup data according to the query instruction in the copy of the lookup table, thereby completing the query of the data.
Referring to fig. 3, a flowchart of an embodiment of a data distribution and parallel processing system according to the present invention is shown. The data distribution and parallel processing system comprises:
an input unit 301, configured to receive an initial query instruction input by a user end;
an analyzing unit 302, configured to analyze the initial query instruction to obtain a real query statement meeting a standard, and use the real query statement as a real query instruction;
the distributing unit 303 is configured to send the real query instruction to a work distribution node, and distribute the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
the processing unit 304 is configured to execute the received real query instruction on each data node, return a query result to the work distribution node, and keep updating the data information table on each server node;
a backup unit 305, configured to backup, on each server node, a lookup table corresponding to the real query instruction, so that a lookup table copy is stored on each server node;
and the concurrency unit 306 is used for executing the real query instruction concurrently by calling the copy of the query table on the current server node when the concurrency conflict is detected on each server node.
As can be seen from the foregoing embodiments, the data distribution and parallel processing system according to the present invention backs up the lookup table through the backup unit 305, and then concurrently processes the query task through the concurrent unit 306, so that the data distribution and parallel processing system can not only ensure data consistency, but also improve data query speed and efficiency.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.