CN116150179A

CN116150179A - Method and device for comparing data consistency between databases

Info

Publication number: CN116150179A
Application number: CN202310394989.8A
Authority: CN
Inventors: 卜洪涛; 刘金鑫
Original assignee: Tianjin Nankai University General Data Technologies Co ltd
Current assignee: Tianjin Nankai University General Data Technologies Co ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-23
Also published as: WO2024212312A1

Abstract

The application provides a method and a device for data consistency comparison among databases, which relate to the field of data consistency comparison among databases and comprise the following steps: selecting a field as a condition column for calculating the boundary of the data block based on the table data, and calculating the maximum value and the minimum value of the condition column; calculating the boundary of the data block according to the minimum value, marking the maximum value as the next boundary inquiry minimum value, and repeatedly calculating the boundary of the data block of the whole table; 2n threads are configured and started, n threads are respectively allocated to be responsible for processing source table data and target table data, and a data block boundary value is obtained from a condition queue; and inquiring all the primary key values in the boundary value range of the source table, and calculating difference data of the source table and the target table in the data corresponding to the same data block boundary according to the primary key. According to the data block boundary comparison method, the data is decomposed into the plurality of data block boundaries through the algorithm, each data block boundary can be independently subjected to data query comparison, and the plurality of data block boundaries can be compared in parallel, so that the performance is improved, the comparison speed is improved, and the comparison difficulty is reduced.

Description

Method and device for comparing data consistency between databases

Technical Field

The present disclosure relates to the field of data consistency comparison in databases, and in particular, to a method for data consistency comparison between databases. The application also relates to a device for comparing data consistency among databases.

Background

With the development of big data, operations of data synchronization are involved in many business scenarios.

In the prior art, it is generally necessary to synchronize the primary node data to the backup node data, or to synchronize the data of one type of database table to the other type of database table. If data inconsistency occurs in synchronization, the difference data is usually compared by a manual method.

The defects in the prior art are that the difference data are difficult to compare by adopting a manual method, and particularly, heterogeneous databases are not mutually communicated, so that the operation is more difficult.

Disclosure of Invention

The method for comparing the data consistency among the databases aims to overcome the defect that the manual method is difficult to compare the difference data in the prior art. The application also relates to a device for comparing data consistency among databases.

The data consistency comparison method between databases provided by the application comprises the following steps:

selecting a field from the table data as a condition column for calculating the boundary of the data block, and calculating the maximum value and the minimum value of the condition column of the table;

calculating a data block boundary according to the minimum value, marking the maximum value as the next boundary inquiry minimum value, and repeating and calculating the data block boundary of the whole table;

2n threads are configured and started, n threads are respectively allocated to process source table data and target table data, and a data block boundary value is obtained from a condition queue;

and inquiring all the primary key values in the boundary value range of the source table, and calculating difference data of the source table and the target table in the data corresponding to the same data block boundary according to the primary key.

Optionally, the condition column is indexed.

Optionally, said calculating the maximum value and the minimum value of the condition columns of the table includes:

calculating the maximum value and the minimum value of the condition columns of the table by [ select min (c 1), max (c 1) from t ];

wherein c1 represents a condition column.

Optionally, the calculating the data block boundary includes:

calculating the boundary of the data block as [ max (c 1) value-the value of the query condition column of sql ] by [ select max (c 1) from t where c1 > = boundary query minimum value order by c1 limit 1000 ];

wherein c1 represents a condition column.

Optionally, the querying all primary key values within the source table boundary value range are as follows:

select primary key column 1..the primary key column n from t where comparison column > = boundary minimum and comparison column < = boundary maximum order by comparison column desc.

Optionally, the method further comprises: all primary key values of the boundary of the source table are queried and recorded into a source table block data container.

Optionally, the recording into the source table block data container includes:

the usage size of the data container is controlled by configuration.

Optionally, the calculating calculates difference data between the source table and the target table in the data corresponding to the same data block boundary: comprising the following steps:

marking the source table data blocks and the target table data blocks of the same data block boundary as the same group;

and reading the data marked as the same group, performing bidirectional comparison, calculating the data of the differential primary key, and then landing the data to form a file.

Optionally, the floor-forming file includes:

a main key existing in the source table, if the target table does not exist, recording the main key data into the file 1;

and if the main key exists in the target table, if the source table does not exist, recording the main key data into the file 2.

The application also provides a data consistency comparison device between databases, which comprises:

the first calculation module is used for selecting a field from the table data as a condition column for calculating the boundary of the data block and calculating the maximum value and the minimum value of the condition column of the table;

the second calculation module calculates the boundary of the data block according to the minimum value, marks the maximum value as the minimum value of the next boundary inquiry, and repeats and calculates the boundary of the data block of the whole table;

the configuration inquiry module is used for configuring and starting 2n threads, respectively distributing n threads to be responsible for processing source table data and target table data, and acquiring a data block boundary value from a condition queue;

and the comparison module is used for inquiring all the primary key values in the boundary value range of the source table and calculating difference data of the source table and the target table in the data corresponding to the same data block boundary according to the primary key.

The application has the advantages and beneficial effects that:

the data consistency comparison method between databases provided by the application comprises the following steps: selecting a field from the table data as a condition column for calculating the boundary of the data block, and calculating the maximum value and the minimum value of the condition column of the table; calculating a data block boundary according to the minimum value, marking the maximum value as the next boundary inquiry minimum value, and repeating and calculating the data block boundary of the whole table; 2n threads are configured and started, n threads are respectively allocated to process source table data and target table data, and a data block boundary value is obtained from a condition queue; and inquiring all the primary key values in the boundary value range of the source table, and calculating difference data of the source table and the target table in the data corresponding to the same data block boundary according to the primary key. According to the data block boundary comparison method and device, the data is rapidly decomposed into the plurality of data block boundaries through the algorithm, each data block boundary can be independently subjected to data query comparison, the plurality of data block boundaries can be compared in parallel, so that the performance is improved, the comparison speed is improved, and the comparison difficulty is reduced.

Drawings

FIG. 1 is a diagram of data consistency comparison flow between databases in the present application.

FIG. 2 is a schematic diagram of data consistency comparison logic between databases in the present application.

FIG. 3 is a schematic diagram of a data consistency comparison device between databases in the present application.

Detailed Description

The present application is further described in conjunction with the drawings and detailed embodiments so that those skilled in the art may better understand the present application and practice it.

The following are examples of specific implementation provided for the purpose of illustrating the technical solutions to be protected in this application in detail, but this application may also be implemented in other ways than described herein, and one skilled in the art may implement this application by using different technical means under the guidance of the conception of this application, so this application is not limited by the following specific embodiments.

Referring to fig. 1, the present application aims to solve the problem of slow data comparison in the conventional method. The data is rapidly decomposed into a plurality of data block boundaries through an algorithm, each data block (chunk) boundary can independently perform data query comparison, and the data block boundaries can be compared in parallel, so that the performance is improved. In the comparison process, only the primary key is compared, and the bidirectional comparison is carried out.

For a primary key that exists in the source table, if the target table does not exist, the primary key data is recorded in file 1.

For a primary key that exists in the target table, if the source table does not exist, the primary key data is recorded in file 2.

According to the technical scheme, the condition of cutting the boundary of the data block is not required to be a primary key, so that the use and the efficiency are not affected even if the joint primary key exists, and meanwhile, on the basis of comprehensively considering the memory and the resource occupation of the CPU, the optimal performance of the comparison task is realized through reasonable configuration.

As shown in fig. 1, S101 selects a field from the table data as a condition column for calculating the boundary of the data block, and calculates the maximum value and the minimum value of the condition column in the table.

Calculating a data block boundary of a table, the table comprising: a source table and a target table.

In the application, the data block boundary of the computation table is the most important step, and the data of the data block boundary can be queried in parallel through multiple threads after the boundary is computed rapidly to compare and improve the performance.

Specifically, a field is first selected as a conditional column for calculating the boundary of the data block, and typically the column requires an index and the data is not repeated as much as possible. In this application, this condition is denoted as c1.

The maximum and minimum values of the condition columns described in the table are calculated by [ select min (c 1), max (c 1) from t ]. Where min (c 1) is noted as the initialized boundary query minimum.

As shown in fig. 1, S102 calculates a data block boundary according to the minimum value, marks the maximum value as the next boundary query minimum value, and repeats and calculates the data block boundary of the entire table.

The data block boundary is calculated as [ max (c 1) value-the value of the query condition column of sql ] by [ select max (c 1) from t where c1 > = boundary query minimum value order by c1 limit 1000 ], and the value of the mark max (c 1) is the next boundary query minimum value.

Finally, the data block boundaries of the entire table are repeated and calculated.

A specific example illustrates the results of the above steps as follows:

assuming that the t table has 1000 columns of c1, c2 and c3, for convenience of demonstration, assuming that the content of the column of c1 data is 1000 columns of data1-data1000 in total, calculating the limit condition as 100, and recording the split data blocks after calculation according to the rule as follows:

data block Range	Data block ID
		data1-data100	1
data100-data200	2
		... ...	... ...
data800-data900	9
		data900-data1000	10

And calculating and obtaining the boundary of the data block with the structure and putting the boundary into a condition queue for processing in the subsequent step.

As shown in fig. 1, S103 is configured to start 2n threads, allocate n threads to process source table data and target table data, respectively, and acquire a data block boundary value from a condition queue.

The method comprises the steps of multithreading, wherein each thread is responsible for reading main key data to be compared from a table after acquiring a data block boundary from a condition queue, and storing the main key data into a memory for subsequent comparison.

The fact that the data block boundary is the conditional column is not mandatory, because the primary key may theoretically be a joint primary key, and if multiple columns are used as conditional columns, the difficulty of calculating the boundary is increased and the performance is affected. The basic algorithm process is as follows:

as shown in FIG. 2, S201 queries the thread for boundaries.

S202 prepares a determined source table condition column and target condition column.

S203, starting 2n threads through configuration, wherein n threads are responsible for processing source table data and n threads are responsible for processing target table data aiming at the source table.

S204, each thread of the source table is responsible for acquiring the boundary value of the data block from the condition queue, and then inquiring all the main key values of the boundary of the source table and recording the main key values into a data container of the source table block. Its sql is of the form:

Each thread of the target table is responsible for acquiring a data block boundary value from the condition queue, then querying all primary key values of the boundary of the target table, and recording the primary key values into a target table block data container. Its sql is of the form:

When the size exceeds the specified size, the data is blocked when being put into the block data container, and only the blocked data can be put into the block data container after the data is destroyed by the comparison processing of the subsequent threads. To control the use of memory, the size of the use of the data container can be controlled by configuration.

As shown in fig. 1, S104 queries all primary key values within the boundary value range of the source table, and calculates difference data between the source table and the target table in the data corresponding to the same data block boundary according to the primary key.

And calculating difference data of the source table and the target table in the data corresponding to the same data block boundary according to the primary key.

With continued reference to fig. 2, S205 indicates that the source table data block and the target table data block of the same data block boundary are marked as the same group, and the thread is responsible for acquiring the data marked as the same group after having been read for bidirectional comparison.

S206, calculating difference primary key data, then landing the data into a file, and destroying the data blocks after the data are compared, so as to release space.

Finally, the difference data is landed to generate a file.

For the comparison result, the file is landed according to the following rule:

As shown in fig. 3, the present application further provides a device for comparing data consistency between databases, where the device is configured to execute the above method.

The first calculation module 301 calculates the maximum value and the minimum value of the condition columns of the table based on selecting a field in the table data as the condition column for calculating the boundary of the data block.

Specifically, a field is selected as a conditional column for calculating the boundary of a block of data, which generally requires indexing and as little duplication of data as possible. In this application, this condition is denoted as c1.

The second calculation module 302 calculates the data block boundary according to the minimum value, marks the maximum value as the next minimum value of the boundary query, and repeats and calculates the data block boundary of the whole table.

The data block boundary is calculated as [ max (c 1) value-the value of the query condition column of sql ] by [ select max (c 1) from t where c1 > = boundary query minimum value order by limit 1000 ], and the value of the mark max (c 1) is the next boundary query minimum value.

The configuration query module 303 configures to start 2n threads, allocate n threads to be responsible for processing the source table data and the target table data, and acquire the data block boundary value from the condition queue, respectively.

The above-described condition columns do not necessarily require that the condition columns must be primary keys, because primary keys may theoretically be joint primary keys, which would increase the difficulty of computing boundaries and affect performance if multiple columns were the condition columns. The basic algorithm is as follows:

by configuring to start 2n threads, n threads are responsible for processing source table data and n threads are responsible for processing target table data for the source table

Each thread of the source table is responsible for acquiring a data block boundary value from the condition queue, then querying all primary key values of the boundary of the source table, and recording the primary key values into a source table block data container. Its sql is of the form:

After the specified size is exceeded, the data is blocked when being put into the block data container, and only the blocked data can be put into after the block number is destroyed by the comparison processing of the subsequent threads. To control the use of memory, the size of the use of the data container can be controlled by configuration.

And the comparison module 304 queries all the primary key values in the boundary value range of the source table, and calculates difference data between the source table and the target table in the data corresponding to the same data block boundary according to the primary key.

Specifically, the source table data block and the target table data block on the same data block boundary are marked as the same group, the thread is responsible for acquiring the read data marked as the same group for bidirectional comparison, calculating the differential primary key data, and then landing the data to form a file, and destroying the data block after the comparison of the data blocks, and releasing the space.

Finally, the difference data is landed to generate a file.

For the comparison result, the file is landed according to the following rule:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The data consistency comparison method between databases is characterized by comprising the following steps:

2. The method of claim 1, wherein the condition columns are indexed.

3. The method for matching data consistency between databases according to claim 1, wherein said calculating the maximum and minimum values of the condition columns of the table comprises:

wherein c1 represents a condition column, min (c 1) is marked as an initialized minimum value of boundary query, and max (c 1) is the minimum value of the next boundary query.

4. A method of data consistency comparison between databases as claimed in claim 3, wherein said calculating data block boundaries comprises:

calculating the value of a query condition column with the data block boundary being the value of [ max (c 1) -sql ] through [ select max (c 1) from t where c1 > = boundary query minimum value order by c1 limit 1000 ], wherein max (c 1) is the next boundary query minimum value;

wherein c1 represents a condition column.

5. The method for comparing data consistency among databases according to claim 1, wherein the query is of the form of all primary key values within the range of source table boundary values as follows:

6. The method for comparing data consistency among databases according to claim 1, further comprising: all primary key values of the boundary of the source table are queried and recorded into a source table block data container.

7. The method for matching data consistency between databases according to claim 6, wherein said recording into a source table block data container comprises:

the usage size of the data container is controlled by configuration.

8. The method for comparing data consistency between databases according to any one of claims 1 to 7, wherein the calculating the difference data between the source table and the target table in the data corresponding to the same data block boundary is characterized in that: comprising the following steps:

9. The method for comparing data consistency among databases according to claim 8, wherein the file is formed by landing, comprising:

10. A data consistency comparison apparatus between databases, comprising: