[go: up one dir, main page]

CN116010412A - Database full synchronization breakpoint continuous transmission method, system and storage medium - Google Patents

Database full synchronization breakpoint continuous transmission method, system and storage medium Download PDF

Info

Publication number
CN116010412A
CN116010412A CN202211681968.6A CN202211681968A CN116010412A CN 116010412 A CN116010412 A CN 116010412A CN 202211681968 A CN202211681968 A CN 202211681968A CN 116010412 A CN116010412 A CN 116010412A
Authority
CN
China
Prior art keywords
synchronization
database
batch
field
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211681968.6A
Other languages
Chinese (zh)
Inventor
林佩航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Cloud Technology Co Ltd
Original Assignee
China Telecom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Cloud Technology Co Ltd filed Critical China Telecom Cloud Technology Co Ltd
Priority to CN202211681968.6A priority Critical patent/CN116010412A/en
Publication of CN116010412A publication Critical patent/CN116010412A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及信息技术领域,具体涉及一种数据库全量同步断点续传方法、系统及存储介质,所述方法包括以下步骤:选定数据库的每个表的切分字段,切分为若干个批次,将批次排序;记录起始值和结束值,持久化存储到外部存储;建立子任务线程池和与子任务线程的数量匹配的发送窗口;当出现断点恢复时,从外部存储读取对应批次的切分字段的起始值和结束值;构建查询条件,从数据库对应表中查询满足切分字段的起始值和结束值的全部数据,作为恢复的批次,将恢复的批次提交给子任务线程进行同步。本发明的有益技术效果包括:能够合理地控制重传的粒度,达到重传数据量和并发量之间的平衡;断点恢复时速度快,易于实践。

Figure 202211681968

The present invention relates to the field of information technology, and in particular to a method, system and storage medium for full synchronous breakpoint resume transmission of a database. times, sort the batches; record the start value and end value, and store them persistently in external storage; establish a subtask thread pool and a sending window that matches the number of subtask threads; when a breakpoint recovery occurs, read from the external storage Get the start value and end value of the segmentation field corresponding to the batch; construct the query condition, and query all the data satisfying the start value and end value of the segmentation field from the corresponding table in the database, as the restored batch, the restored Batches are submitted to subtask threads for synchronization. The beneficial technical effects of the present invention include: the granularity of retransmission can be reasonably controlled, and the balance between the amount of retransmitted data and the amount of concurrency can be achieved; the speed of breakpoint recovery is fast, and it is easy to practice.

Figure 202211681968

Description

一种数据库全量同步断点续传方法、系统及存储介质A method, system, and storage medium for full-volume synchronous breakpoint resuming of database

技术领域technical field

本发明涉及信息技术领域,具体涉及一种数据库全量同步断点续传方法、系统及存储介质。The present invention relates to the field of information technology, in particular to a method, system and storage medium for full-scale synchronous breakpoint resume transmission of databases.

背景技术Background technique

当下主流的数据库全量同步技术无法实现断点续传或者不能很好地控制断点续传的粒度,不能够在并发度与重传数据量之间达到均衡。传统的数据库全量同步方法当出现部分数据入库失败时通过重试的方式进行再入库,对程序进程奔溃、机器重启等场景无能为力,无法实现断点续传。虽然存在公开技术实现数据库全量同步断点续传的其中一种方式是在同步过程中实时标记同步进度,恢复时从标记的断点处继续传输。但此种方式存在多线程并发同步场景难以实施,单线程同步场景效率低下的问题。为此需要研究能够提高数据同步效率的数据库同步技术。The current mainstream database full-volume synchronization technology cannot implement resuming transmission from breakpoints or control the granularity of resuming transmission from breakpoints well, and cannot achieve a balance between the degree of concurrency and the amount of retransmitted data. The traditional full database synchronization method re-enters the database by retrying when some data fails to be stored in the database. It is helpless for scenarios such as program process crashes and machine restarts, and cannot achieve resuming transmission from breakpoints. Although there is a public technology, one of the ways to realize the resuming of the full database synchronization breakpoint is to mark the synchronization progress in real time during the synchronization process, and continue the transmission from the marked breakpoint when restoring. However, this method has the problems of difficulty in implementing multi-threaded concurrent synchronization scenarios and low efficiency in single-threaded synchronization scenarios. Therefore, it is necessary to study database synchronization technology that can improve the efficiency of data synchronization.

现有技术公开了一种基于HTTP协议高速度、高稳定以及断点续传的数据同步方法,包括中心系统数据同步模块和子节点数据同步模块。中心系统的数据录入管理模块用于对更新的业务数据进行处理,处理存储后调用中心系统的数据更新通知模块,通知模块通过消息队列链路向子节点的数据更新监听模块发送通知,数据更新监听模块再调用子节点的数据同步请求模块来向中心系统的同步数据下载模块发起请求,获取到请求的版本数据,再循环请求处理直到子节点此种业务数据的最大版本号与中心系统最大版本号一致后停止数据同步,等待下一次中心系统的更新通知或者到达定时周期再向中心系统的同步数据下载模块请求新的版本数据。其技术方案能够在网络不稳定情况下一中心系统与多子节点之间也能够快速高效稳定的进行增量数据同步。但其技术方案不能实现断点续传粒度的控制,且不适合多线程并行执行。The prior art discloses a high-speed, high-stability data synchronization method based on the HTTP protocol, which includes a central system data synchronization module and a sub-node data synchronization module. The data entry management module of the central system is used to process the updated business data, and call the data update notification module of the central system after processing and storage. The notification module sends a notification to the data update monitoring module of the child node through the message queue link, and the data update monitoring module The module then calls the data synchronization request module of the child node to initiate a request to the synchronization data download module of the central system, obtains the requested version data, and recycles the request until the maximum version number of the business data of the child node and the maximum version number of the central system Stop data synchronization after consistency, wait for the next update notification from the central system or arrive at the timing period, and then request new version data from the synchronization data download module of the central system. Its technical solution can quickly, efficiently and stably perform incremental data synchronization between a central system and multiple sub-nodes even when the network is unstable. However, its technical solution cannot realize the control of the granularity of resuming transmission from breakpoints, and is not suitable for multi-threaded parallel execution.

发明内容Contents of the invention

本发明所要解决的技术问题:目前缺乏支持断点续传且适合多线程并行执行的数据库同步方案的技术问题。提出了一种数据库全量同步断点续传方法、系统及存储介质,能够实现断点续传,且断点粒度可控,能够适合多线程并行执行。The technical problem to be solved by the present invention is the lack of a database synchronization solution that supports resumable transfer from breakpoints and is suitable for multi-thread parallel execution. A method, system, and storage medium for full synchronous breakpoint resume transmission of databases are proposed, which can realize breakpoint resume transmission, and the breakpoint granularity is controllable, which is suitable for multi-thread parallel execution.

为解决上述技术问题,本发明采用如下技术方案:一种数据库全量同步断点续传方法,包括以下步骤:In order to solve the above-mentioned technical problems, the present invention adopts the following technical solution: a method for resuming data from a breakpoint in full synchronization of a database, comprising the following steps:

选定所述数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将所述批次排序;Select the segmentation field of each table of the database, divide the table data into several batches according to the segmentation field, and sort the batches;

记录每个所述批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储;Record the start value and end value of each batch corresponding to the segmentation field, and store the sort number of the batch and the corresponding table name, start value and end value in an external storage persistently;

建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于所述发送窗口内的批次提交子任务线程进行同步,所述批次同步完成后更新发送窗口的位置;Set up a subtask thread pool and a sending window matched with the number of subtask threads, and submit subtask threads within the sending window for synchronization, and update the sending window position after the batch synchronization is completed;

当出现断点恢复时,从外部存储读取对应批次的切分字段的起始值和结束值;When a breakpoint recovery occurs, read the start value and end value of the segmentation field of the corresponding batch from the external storage;

构建查询条件,从所述数据库对应表中查询满足所述切分字段的起始值和结束值的全部数据,作为恢复的批次,将恢复的所述批次提交给子任务线程进行同步。Construct a query condition, query all data satisfying the start value and end value of the segmentation field from the corresponding table of the database, and submit the recovered batch to the subtask thread for synchronization as a restored batch.

作为优选,选定切分字段的方法包括:Preferably, the method for selecting the segmentation field includes:

若数据库的表为单主键表,则选定所述表的切分字段为所述主键;If the table of the database is a single primary key table, then select the segmentation field of the table as the primary key;

若数据库的表为联合主键,则选择区分度最高的字段作为所述表的切分字段。If the table of the database is a joint primary key, the field with the highest degree of discrimination is selected as the splitting field of the table.

作为优选,计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。As a preference, the method for calculating the degree of discrimination of a field is: degree of discrimination=count(distinct field name)/count(field name), count(distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, count (field name) refers to the number of entries in the table.

作为优选,依据切分字段将表数据切分为若干个批次的方法包括:Preferably, the method of dividing the table data into several batches according to the division field includes:

设定划分批次数量为n,读取切分字段的最小数值变化步长m;Set the number of divided batches to n, and read the minimum value change step m of the segmented field;

计算划分步长step,step=max(count(切分字段)/n,m);Calculate the division step step, step=max(count(segmentation field)/n,m);

按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch.

作为优选,建立发送窗口的方法包括:Preferably, the method for establishing the sending window includes:

设置所述发送窗口的大小等于所述子任务线程的数量;Setting the size of the sending window equal to the number of subtask threads;

将所述发送窗口的起始位置设置在第一个批次,建立ack指针指向所述发送窗口的起始位置,所述ack指针持久化存储在外部存储。The starting position of the sending window is set in the first batch, and an ack pointer is established to point to the starting position of the sending window, and the ack pointer is persistently stored in an external storage.

作为优选,所述外部存储还存储由批次的同步状态,所述同步状态初始值为等待调度,建立子任务线程池后,将提交给子任务线程的批次的同步状态设置为发送中,所述批次同步完成后,更新对应批次的同步状态为成功,所述批次同步中断,则立即重试同步直至同步完成,更新所述发送窗口的位置的方法为:当所述发送窗口起始位置的批次同步成功后,将ack指针自增1,直至所述发送窗口的结束位置到达最后一个批次。As a preference, the external storage also stores the synchronization state of the batches, the initial value of the synchronization state is waiting for scheduling, after the subtask thread pool is established, the synchronization state of the batches submitted to the subtask threads is set as sending, After the batch synchronization is completed, update the synchronization status of the corresponding batch as successful, and if the batch synchronization is interrupted, retry the synchronization immediately until the synchronization is completed. The method for updating the position of the sending window is: when the sending window After the batch synchronization at the starting position is successful, the ack pointer is incremented by 1 until the end position of the sending window reaches the last batch.

作为优选,当出现断点恢复时,根据所述ack指针恢复所述发送窗口的位置,根据恢复的所述发送窗口的位置确定当前需要恢复的批次,将恢复的所述批次提交给子任务线程进行同步。Preferably, when a breakpoint recovery occurs, restore the position of the sending window according to the ack pointer, determine the current batch that needs to be restored according to the restored position of the sending window, and submit the restored batch to the child Task threads are synchronized.

一种数据库全量同步断点续传系统,用于执行如前述的一种数据库全量同步断点续传方法,包括管理节点集群、协调器集群和同步节点集群,所述管理节点集群包括至少一个管理节点,所述协调器集群包括至少一个协调器,所述同步节点集群包括至少两个同步节点,A system for resuming full-scale synchronous breakpoint transfer of databases, which is used to perform the above-mentioned method for resuming full-scale synchronous breakpoint transfer of databases, including a cluster of management nodes, a cluster of coordinators, and a cluster of synchronization nodes. The cluster of management nodes includes at least one management nodes, the coordinator cluster includes at least one coordinator, the sync node cluster includes at least two sync nodes,

所述管理节点选定所数据源的数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将所述批次排序,记录每个所述批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储,The management node selects the segmentation field of each table of the database of the data source, divides the table data into several batches according to the segmentation field, sorts the batches, and records the corresponding Segment the start value and end value of the field, and store the batch sort number and the corresponding table name, start value, and end value persistently to the external storage.

两个所述同步节点运行分别在源数据库及目标数据库所在主机,所述源数据库对应的所述同步节点建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于所述发送窗口内的批次提交子任务线程进行同步,所述批次同步完成后所述源数据库对应的所述同步节点更新发送窗口的位置,当出现断点恢复时,所述源数据库对应的所述同步节点从外部存储读取对应批次的切分字段的起始值和结束值,所述源数据库对应的所述同步节点构建查询条件,从所述数据库对应表中查询满足所述切分字段的起始值和结束值的全部数据,作为恢复的批次,所述源数据库对应的所述同步节点将恢复的所述批次提交给子任务线程进行同步,The two synchronization nodes run respectively on the host where the source database and the target database are located, and the synchronization node corresponding to the source database establishes a subtask thread pool and a sending window matching the number of subtask threads, and will be in the sending window The batch submission subtask thread in the batch synchronization is performed. After the batch synchronization is completed, the synchronization node corresponding to the source database updates the position of the sending window. When a breakpoint recovery occurs, the synchronization node corresponding to the source database The node reads the start value and end value of the segmentation field corresponding to the batch from the external storage, and the synchronization node corresponding to the source database constructs a query condition, and queries the data satisfying the segmentation field from the corresponding table of the database. All the data of the start value and the end value are used as a restored batch, and the synchronization node corresponding to the source database submits the restored batch to the subtask thread for synchronization,

所述协调器在所述管理节点及若干个所述同步节点之间建立网络连接。The coordinator establishes a network connection between the management node and several synchronization nodes.

作为优选,所述源数据库对应的所述同步节点提取批次的数字指纹发送给所述目标数据库对应的所述同步节点,所述目标数据库对应的所述同步节点的子任务线程完成批次的传输后,核对所述数字指纹,若核对一致,则反馈所述批次同步成功,反之,若核对不一致,则反馈所述批次同步失败。Preferably, the synchronization node corresponding to the source database extracts the digital fingerprint of the batch and sends it to the synchronization node corresponding to the target database, and the subtask thread of the synchronization node corresponding to the target database completes the batch After the transmission, the digital fingerprints are checked, and if the verification is consistent, it is fed back that the batch synchronization is successful; otherwise, if the verification is inconsistent, it is fed back that the batch synchronization fails.

作为优选,所述管理节点选定所数据源的数据库的每个表的切分字段时,执行以下步骤:As a preference, when the management node selects the segmentation field of each table of the database of the data source, the following steps are performed:

若数据库的表为单主键表,则选定所述表的切分字段为所述主键;If the table of the database is a single primary key table, then select the segmentation field of the table as the primary key;

若数据库的表为联合主键,则选择区分度最高的字段作为所述表的切分字段,其中计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。If the table of the database is a joint primary key, select the field with the highest degree of discrimination as the segmentation field of the table, and the method of calculating the degree of discrimination of the field is: degree of discrimination=count(distinct field name)/count(field name), count(distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, and count(field name) refers to the number of entries in the table.

作为优选,所述管理节点依据切分字段将表数据切分为若干个批次时,执行以下步骤:Preferably, when the management node divides the table data into several batches according to the segmentation fields, the following steps are performed:

设定划分批次数量为n,读取切分字段的最小数值变化步长m;Set the number of divided batches to n, and read the minimum value change step m of the segmented field;

计算划分步长step,step=max(count(切分字段)/n,m);Calculate the division step step, step=max(count(segmentation field)/n,m);

按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch.

一种计算机系统,所述计算机系统包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如前述的一种数据库全量同步断点续传方法。A computer system, the computer system includes a memory, a processor, and a computer program stored in the memory and operable on the processor, when the computer program is executed by the processor, the aforementioned one A database full synchronization breakpoint resume upload method.

一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如前述的一种数据库全量同步断点续传方法。A computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned method for full-volume synchronous resumable download of a database is implemented.

本发明的有益技术效果包括:全量同步时切分得到的每批数据的数据量大致相等,能按照字符型字段进行切分,且不受字符集的影响;能够合理地控制重传的粒度,达到重传数据量和并发量之间的平衡;只需要存储数据切分和少量状态信息到外部存储系统,断点恢复时速度快,易于实践。The beneficial technical effects of the present invention include: the data volume of each batch of data obtained by segmentation during full synchronization is approximately equal, and can be segmented according to character fields without being affected by character sets; the granularity of retransmission can be reasonably controlled, Reach the balance between the amount of retransmitted data and the amount of concurrency; only need to store data segmentation and a small amount of state information to the external storage system, the speed of recovery from breakpoints is fast, and it is easy to practice.

本发明的其他特点和优点将会在下面的具体实施方式、附图中详细的揭露。Other features and advantages of the present invention will be disclosed in detail in the following specific embodiments and drawings.

附图说明Description of drawings

下面结合附图对本发明做进一步的说明:Below in conjunction with accompanying drawing, the present invention will be further described:

图1为本发明实施例同步断点续传方法流程示意图。FIG. 1 is a schematic flowchart of a method for synchronous breakpoint resume transmission according to an embodiment of the present invention.

图2为本发明实施例将表数据切分为若干个批次方法流程示意图。FIG. 2 is a schematic flowchart of a method for dividing table data into several batches according to an embodiment of the present invention.

图3为本发明实施例同步断点续传系统结构示意图。FIG. 3 is a schematic structural diagram of a synchronous breakpoint resume transmission system according to an embodiment of the present invention.

图4为本发明实施例计算机系统结构示意图。FIG. 4 is a schematic structural diagram of a computer system according to an embodiment of the present invention.

其中:10、管理节点集群,20、协调器集群,30、同步节点集群,40、计算机系统,41、存储器,42、计算机程序,43、处理器。Among them: 10. Management node cluster, 20. Coordinator cluster, 30. Synchronization node cluster, 40. Computer system, 41. Memory, 42. Computer program, 43. Processor.

具体实施方式Detailed ways

下面结合本发明实施例的附图对本发明实施例的技术方案进行解释和说明,但下述实施例仅为本发明的优选实施例,并非全部。基于实施方式中的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得其他实施例,都属于本发明的保护范围。The technical solutions of the embodiments of the present invention will be explained and described below in conjunction with the accompanying drawings of the embodiments of the present invention, but the following embodiments are only preferred embodiments of the present invention, not all of them. Based on the examples in the implementation manners, other examples obtained by those skilled in the art without making creative efforts all belong to the protection scope of the present invention.

在下文描述中,出现诸如术语“内”、“外”、“上”、“下”、“左”、“右”等指示方位或者位置关系仅是为了方便描述实施例和简化描述,而不是指示或暗示所指的装置或者元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In the following description, terms such as "inner", "outer", "upper", "lower", "left", "right" etc. appearing to indicate orientation or positional relationship are only for the convenience of describing the embodiment and simplifying the description, rather than Nothing indicating or implying that a referenced device or element must have a particular orientation, be constructed, and operate in a particular orientation should therefore not be construed as limiting the invention.

一种数据库全量同步断点续传方法,请参阅附图1,包括以下步骤:A method for resuming a full-volume synchronization of a database, please refer to Figure 1, comprising the following steps:

步骤A01)选定数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将批次排序;Step A01) Select the segmentation field of each table of the database, divide the table data into several batches according to the segmentation field, and sort the batches;

步骤A02)记录每个批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储;Step A02) Record the start value and end value of each batch corresponding to the segmentation field, and store the batch sort sequence number and corresponding table name, start value, and end value persistently in external storage;

步骤A03)建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于发送窗口内的批次提交子任务线程进行同步,批次同步完成后更新发送窗口的位置;Step A03) Establishing a subtask thread pool and a sending window matching the number of subtask threads, synchronizing the batch submission subtask threads within the sending window, and updating the position of the sending window after the batch synchronization is completed;

步骤A04)当出现断点恢复时,从外部存储读取对应批次的切分字段的起始值和结束值;Step A04) When a breakpoint recovery occurs, read the start value and end value of the segmentation field of the corresponding batch from the external storage;

步骤A05)构建查询条件,从数据库对应表中查询满足切分字段的起始值和结束值的全部数据,作为恢复的批次,将恢复的批次提交给子任务线程进行同步。构造一批起止区间信息并持久化存储到外部系统。每个区间代表一个可并发执行的同步子任务。从而实现断点续传的粒度可控,并适合多线程并行执行的数据同步方法。每个线程可以负责多个子任务的同步,在时间上将子任务进行排序即可。Step A05) Construct the query condition, query all the data satisfying the start value and end value of the segmentation field from the database corresponding table, and submit the restored batch to the subtask thread for synchronization as a restored batch. Construct a batch of start-stop interval information and store it persistently in an external system. Each interval represents a synchronous subtask that can be executed concurrently. In this way, the granularity of resuming transmission from breakpoints can be controlled, and it is suitable for data synchronization methods executed in parallel by multiple threads. Each thread can be responsible for the synchronization of multiple subtasks, and the subtasks can be sorted in time.

选定切分字段的方法包括:若数据库的表为单主键表,则选定表的切分字段为主键;若数据库的表为联合主键,则选择区分度最高的字段作为表的切分字段。The method of selecting the split field includes: if the database table is a single primary key table, select the split field of the table as the primary key; if the database table is a joint primary key, select the field with the highest degree of discrimination as the split field of the table .

计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。The method of calculating the discrimination degree of a field is: discrimination degree=count(distinct field name)/count(field name), count(distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, count(field name ) refers to the number of entries in the table.

请参阅附图2,依据切分字段将表数据切分为若干个批次的方法包括:Please refer to Figure 2, the method of dividing the table data into several batches according to the segmentation field includes:

步骤B01)设定划分批次数量为n,读取切分字段的最小数值变化步长m;Step B01) Set the number of divided batches to n, and read the minimum value change step m of the segmented field;

步骤B02)计算划分步长step,step=max(count(切分字段)/n,m);Step B02) Calculate the division step step, step=max(count(segmentation field)/n,m);

步骤B03)按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Step B03) Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch.

以数字型字段的切分为例:Take the segmentation of numeric fields as an example:

以100条记录为例,该字段的值分别为1,2,3...99,5000,最小值为1,最大值为5000,以切分5份为目标。传统或其他专利方法:计算字段最大最小值并构造等差数列来切分数据,切分结果为[1,1000],[1001,2000],[2001,3000],[3001,4000],[4001,5000]。本实施例方法的数据切分:结果为[1,20],[21,40],[41,60],[61,80],[81,5000]。Taking 100 records as an example, the values of this field are 1, 2, 3...99,5000, the minimum value is 1, and the maximum value is 5000. The goal is to split into 5 parts. Traditional or other patented methods: Calculate the maximum and minimum values of the field and construct an arithmetic sequence to segment the data. The segmentation results are [1,1000],[1001,2000],[2001,3000],[3001,4000],[ 4001,5000]. The data segmentation of the method in this embodiment: the result is [1,20], [21,40], [41,60], [61,80], [81,5000].

可见在表数据分布不均匀的情况下,传统或其他专利方法的切分存在较大误差,会导致大量数据堆积在某个批次,而有些批次的数据量又极少,无法做到数据量上分配的相对均衡。本实施例方法中的数据切分方法能够解决数据倾斜问题。It can be seen that in the case of uneven distribution of table data, there are large errors in the segmentation of traditional or other patented methods, which will cause a large amount of data to accumulate in a certain batch, and the amount of data in some batches is so small that it is impossible to achieve data The relative balance of distribution. The data segmentation method in the method of this embodiment can solve the problem of data skew.

以字符型字段的切分为例:Take the segmentation of character fields as an example:

以1000条随机产生的6位只包含0-9,a-z的字符串记录为例,该字段的值按字典序最小值为001222,最大值为ffe732,以切分10份为目标。传统方法:无法切分。开源数据库全量同步软件dataX的切分方法:做字符串映射,具体过程为从左到右遍历字符串,字符串长度为k,第i位字符串的acsii值与128^(k-i)的乘积构造一个实数,然后累加转化为数字型字段进行切分,切分完成后反向映射回字符串。本实施例方法:利用索引的天然有序性,在无需回表的情况下扫描索引进行快速切分。Take 1000 randomly generated 6-digit string records containing only 0-9, a-z as an example. The minimum value of this field is 001222 and the maximum value is ffe732 in lexicographical order. The goal is to divide into 10 parts. Traditional method: cannot be segmented. Segmentation method of the open source database full synchronization software dataX: do string mapping, the specific process is to traverse the string from left to right, the length of the string is k, and the product construction of the acsii value of the i-th string and 128^(k-i) A real number, then accumulated and transformed into a numeric field for segmentation, and reversely mapped back to a string after the segmentation is completed. The method in this embodiment: use the natural order of the index to scan the index for fast segmentation without returning to the table.

从切分结果可以看到,dataX采用的切分方法存在字符串到数字之间的映射,得到的区间映射回字符串后在不同的字符集下排序不一定比前一个区间的字串小,即受到字符集影响,存在区间重复和数据倾斜问题,本实施例方法切分结果与字符集无关,且可视化好。As can be seen from the segmentation results, the segmentation method used by dataX has a mapping between strings and numbers, and the resulting intervals are mapped back to strings and sorted under different character sets. The strings in the previous interval are not necessarily smaller than the strings in the previous interval. That is, affected by the character set, there are interval repetitions and data skew problems. The segmentation result of the method in this embodiment has nothing to do with the character set, and the visualization is good.

数据切分方法充分利用了数据库的索引,在只扫描索引的情况下不用回表,速度非常快。相比较于计算字段最大最小值并构造等差数列来切分数据的方式,本实施例的数据切分方法能在表数据分布不均衡的场景下避免数据倾斜问题,实现每个批次中的数据量大致相等;相比较于传统方法,解决了无法根据字符串类型字段进行切分的问题;相比较于将字符串映射为数字再进行切分的方式,本实施例利用了索引天然有序的特性,解决了受到字符集影响而不准确的问题。The data segmentation method makes full use of the index of the database, and does not need to return to the table when only scanning the index, and the speed is very fast. Compared with the method of calculating the maximum and minimum values of fields and constructing an arithmetic sequence to segment data, the data segmentation method in this embodiment can avoid the problem of data skew in the case of unbalanced table data distribution, and realize the data in each batch. The amount of data is roughly equal; compared with the traditional method, it solves the problem that it cannot be segmented according to the string type field; compared with the way of mapping strings to numbers and then segmenting, this embodiment uses the natural order of the index The feature solves the problem of being inaccurately affected by the character set.

其中,本实施例建立发送窗口的方法包括:设置发送窗口的大小等于子任务线程的数量;将发送窗口的起始位置设置在第一个批次,建立ack指针指向发送窗口的起始位置,ack指针持久化存储在外部存储。Wherein, the method for setting up the sending window in this embodiment includes: setting the size of the sending window equal to the number of subtask threads; setting the starting position of the sending window in the first batch, setting up the ack pointer to point to the starting position of the sending window, The ack pointer is persistently stored in external storage.

外部存储还存储由批次的同步状态,同步状态初始值为等待调度,建立子任务线程池后,将提交给子任务线程的批次的同步状态设置为发送中,批次同步完成后,更新对应批次的同步状态为成功,批次同步中断,则立即重试同步直至同步完成,更新发送窗口的位置的方法为:当发送窗口起始位置的批次同步成功后,将ack指针自增1,直至发送窗口的结束位置到达最后一个批次。The external storage also stores the synchronization state of the batch. The initial value of the synchronization state is waiting for scheduling. After the subtask thread pool is established, the synchronization state of the batch submitted to the subtask thread is set as sending. After the batch synchronization is completed, update The synchronization status of the corresponding batch is successful, if the batch synchronization is interrupted, retry the synchronization immediately until the synchronization is completed. The method of updating the position of the sending window is: when the batch synchronization at the starting position of the sending window is successful, the ack pointer will be incremented 1, until the end of the sending window reaches the last batch.

同步过程中根据设定的子任务任务线程池大小,构造同等大小的发送窗口,并发地将发送窗口内对应编号的子任务提交到线程池。子任务若出现失败或者超时,则重新加入任务线程池进行重试,每当有子任务完成时更新发送窗口内编号数组的标志,并尝试更新ack指针并将发送窗口向后滑动,ack指针当前及之前的位置表示已结束的任务编号。每次更新ack指针后将指针的位置持久化到存储系统。During the synchronization process, according to the set subtask task thread pool size, a sending window of the same size is constructed, and the corresponding numbered subtasks in the sending window are submitted to the thread pool concurrently. If the subtask fails or times out, it will rejoin the task thread pool for retrying. Whenever a subtask is completed, update the sign of the number array in the sending window, and try to update the ack pointer and slide the sending window backward. The ack pointer is currently The position before and after indicates the task number that has been completed. The location of the pointer is persisted to the storage system after each update of the ack pointer.

当出现断点恢复时,根据ack指针恢复发送窗口的位置,根据恢复的发送窗口的位置确定当前需要恢复的批次,将恢复的批次提交给子任务线程进行同步。When a breakpoint recovery occurs, the position of the sending window is restored according to the ack pointer, the batches to be restored are determined according to the position of the restored sending window, and the restored batches are submitted to the subtask thread for synchronization.

本实施例的有益技术效果包括:全量同步时切分得到的每批数据的数据量大致相等,能按照字符型字段进行切分,且不受字符集的影响;能够合理地控制重传的粒度,达到重传数据量和并发量之间的平衡;只需要存储数据切分和少量状态信息到外部存储系统,断点恢复时速度快,易于实践。The beneficial technical effects of this embodiment include: the data volume of each batch of data obtained during full synchronization is approximately equal, and can be segmented according to character fields without being affected by character sets; the granularity of retransmission can be reasonably controlled , to achieve a balance between the amount of retransmitted data and the amount of concurrency; only need to store data segmentation and a small amount of state information to the external storage system, the speed of recovery from breakpoints is fast, and it is easy to practice.

一种数据库全量同步断点续传系统,用于执行如前述的一种数据库全量同步断点续传方法,请参阅附图3,包括管理节点集群10、协调器集群20和同步节点集群30,管理节点集群10包括至少一个管理节点,协调器集群20包括至少一个协调器,同步节点集群30包括至少两个同步节点,A database full-volume synchronous resuming breakpoint transfer system, used to execute the above-mentioned full-scale database synchronous breakpoint resuming method, please refer to the accompanying drawing 3, including a management node cluster 10, a coordinator cluster 20 and a synchronization node cluster 30, The management node cluster 10 includes at least one management node, the coordinator cluster 20 includes at least one coordinator, and the synchronization node cluster 30 includes at least two synchronization nodes,

管理节点选定所数据源的数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将批次排序,记录每个批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储,The management node selects the segmentation field of each table in the database of the data source, divides the table data into several batches according to the segmentation field, sorts the batches, and records the start of each batch corresponding to the segmentation field value and end value, to store the batch sort number and corresponding table name, start value and end value persistently to external storage,

两个同步节点运行分别在源数据库及目标数据库所在主机,源数据库对应的同步节点建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于发送窗口内的批次提交子任务线程进行同步,批次同步完成后源数据库对应的同步节点更新发送窗口的位置,当出现断点恢复时,源数据库对应的同步节点从外部存储读取对应批次的切分字段的起始值和结束值,源数据库对应的同步节点构建查询条件,从数据库对应表中查询满足切分字段的起始值和结束值的全部数据,作为恢复的批次,源数据库对应的同步节点将恢复的批次提交给子任务线程进行同步,Two synchronization nodes run on the host where the source database and the target database are located respectively. The synchronization node corresponding to the source database establishes a subtask thread pool and a sending window matching the number of subtask threads, and submits the batches within the sending window to the subtask thread After the batch synchronization is completed, the synchronization node corresponding to the source database updates the position of the sending window. When a breakpoint recovery occurs, the synchronization node corresponding to the source database reads the initial value and The end value, the synchronization node corresponding to the source database constructs the query condition, and queries all the data satisfying the start value and end value of the segmentation field from the corresponding table of the database. As the restored batch, the synchronization node corresponding to the source database will restore the batch submitted to the subtask thread for synchronization,

协调器在管理节点及若干个同步节点之间建立网络连接。The coordinator establishes network connections between the management node and several synchronization nodes.

源数据库对应的同步节点提取批次的数字指纹发送给目标数据库对应的同步节点,目标数据库对应的同步节点的子任务线程完成批次的传输后,核对数字指纹,若核对一致,则反馈批次同步成功,反之,若核对不一致,则反馈批次同步失败。The synchronization node corresponding to the source database extracts the digital fingerprint of the batch and sends it to the synchronization node corresponding to the target database. After the subtask thread of the synchronization node corresponding to the target database completes the batch transmission, it checks the digital fingerprint. If the verification is consistent, the batch is fed back If the synchronization is successful, otherwise, if the verification is inconsistent, the feedback batch synchronization fails.

管理节点选定所数据源的数据库的每个表的切分字段时,执行以下步骤:When the management node selects the segmentation fields of each table in the database of the data source, perform the following steps:

若数据库的表为单主键表,则选定表的切分字段为主键;If the database table is a single primary key table, the split field of the selected table is the primary key;

若数据库的表为联合主键,则选择区分度最高的字段作为表的切分字段,其中计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。If the table of the database is a joint primary key, select the field with the highest degree of discrimination as the splitting field of the table, and the method of calculating the degree of discrimination of the field is: degree of discrimination=count(distinct field name)/count(field name), count( distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, and count (field name) refers to the number of entries in the table.

管理节点依据切分字段将表数据切分为若干个批次时,执行以下步骤:When the management node divides the table data into several batches according to the segmentation field, perform the following steps:

设定划分批次数量为n,读取切分字段的最小数值变化步长m;Set the number of divided batches to n, and read the minimum value change step m of the segmented field;

计算划分步长step,step=max(count(切分字段)/n,m);Calculate the division step step, step=max(count(segmentation field)/n,m);

按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch.

另一方面,本实施例提供了数据库全量同步断点续传方法的验证实例。On the other hand, this embodiment provides a verification example of the method for resuming the breakpoint and full synchronization of the database.

源数据库为mysql数据库,目标据库业务系统为postgresql类型数据库,demo库test.table表的建表语句如下:The source database is a mysql database, the target database business system is a postgresql type database, and the table creation statement of the test.table table in the demo library is as follows:

CREATE TABLE `test_table` (CREATE TABLE `test_table` (

  `sid` varchar (6) NOT NULL,`sid` varchar (6) NOT NULL,

  `name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,

  `type` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`type` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,

  `city` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`city` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,

  PRIMARY KEY (`id`)PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

随机产生1000条6位只包含0-9,a-z的字符串记录作为字段sid的值,其他字段内容为任意随机值。sid字段的值按字典序最小值为001222,最大值为ffe732,设置切分预划分批次数为10,最小步长为10000,同步子任务线程池大小为5。现对此表进行全量同步操作。Randomly generate 1000 6-digit string records containing only 0-9, a-z as the value of the field sid, and the contents of other fields are random values. The value of the sid field is lexicographically the minimum value is 001222, the maximum value is ffe732, the number of split pre-batches is set to 10, the minimum step size is 10000, and the thread pool size of the synchronization subtask is 5. Now perform a full synchronization operation on this table.

实施步骤:Implementation steps:

1. 使用本专利系统进行表数据同步,填入相对应的源库信息和目标服务器信息,就可以一键自动化生成同步映射关系;1. Use this patented system to synchronize table data, fill in the corresponding source database information and target server information, and then automatically generate a synchronization mapping relationship with one click;

2.选定切分字段,由于本张表为单主键表,因此选定sid字段作为切分字段。2. Select the segmentation field. Since this table is a single primary key table, select the sid field as the segmentation field.

3.执行字符型切分算法,第一次查询执行 select sid from test_table wheresid > '001222' limit 100,1; 得到结果1b52a7 ,第二次查询执行select sid fromtest_table where sid > '1b52a7' limit 100,1; 以此类推,迭代直至结果集为空。收集每次查询的结果,构造同步区间[001222,1b52a7),[1b52a7,332ab6),[332ab6,50799a),[50799a,69a3d2),[69a3d2,84bcfb),[84bcfb,a1a75d),[a1a75d,b8398c),[b8398c,d35be2),[d35be2,ecae00),[ecae00,ffe732]。编号分别为1,2...,10,将区间信息持久化到外部存储系统。每个同步区间代表一个可以并发执行的子任务,子任务独立负责本区间范围内的数据同步。3. Execute the character segmentation algorithm, execute select sid from test_table wheresid > '001222' limit 100,1 for the first query; get the result 1b52a7, and execute select sid from test_table where sid > '1b52a7' limit 100,1 for the second query ; And so on, iterating until the result set is empty. Collect the results of each query and construct the synchronization interval [001222,1b52a7),[1b52a7,332ab6),[332ab6,50799a),[50799a,69a3d2),[69a3d2,84bcfb),[84bcfb,a1a75d),[a1a75d,b8398c ),[b8398c,d35be2),[d35be2,ecae00),[ecae00,ffe732]. The numbers are 1, 2..., 10 respectively, and the interval information is persisted to the external storage system. Each synchronization interval represents a subtask that can be executed concurrently, and the subtasks are independently responsible for data synchronization within the interval.

4.初始化ack指针到0的位置,构造大小为5的发送窗口,并发地将5个子任务提交到线程池进行全量同步,子任务若出现失败或者超时,则重新加入任务线程池进行重试,每当有子任务完成时更新发送窗口内编号数组的标志,并尝试更新ack指针并将发送窗口向后滑动,ack指针当前及之前的位置表示已结束的任务编号。每次更新ack指针后将指针的位置持久化到存储系统。经过t1秒后编号为3的子任务完成了同步,此时ack指针不满足更新条件,仍然指向0的位置。经过t2秒后编号为2的子任务出现失败,则该子任务被重新加入任务线程池进行重试。经过t3秒后编号为2的子任务完成了同步,此时ack指针不满足更新条件,仍然指向0的位置。经过t4秒后编号为1的子任务完成了同步, 此时ack指针满足更新条件,将指针移动到3的位置,并持久化指针位置到外部存储。将发送窗口向后滑动3个单位。4. Initialize the ack pointer to the position of 0, construct a sending window with a size of 5, and concurrently submit 5 subtasks to the thread pool for full synchronization. If a subtask fails or times out, it will rejoin the task thread pool for retrying. Whenever a subtask is completed, update the sign of the number array in the sending window, and try to update the ack pointer and slide the sending window backward. The current and previous positions of the ack pointer indicate the number of the task that has ended. The location of the pointer is persisted to the storage system after each update of the ack pointer. After t1 seconds, the subtask numbered 3 completes the synchronization. At this time, the ack pointer does not meet the update condition and still points to the position of 0. If the subtask numbered 2 fails after t2 seconds, the subtask will be added to the task thread pool for retrying. After t3 seconds, the subtask numbered 2 has completed the synchronization. At this time, the ack pointer does not meet the update condition and still points to the position of 0. After t4 seconds, the subtask numbered 1 completes the synchronization. At this time, the ack pointer meets the update condition, the pointer is moved to the position of 3, and the pointer position is persisted to the external storage. Slide the sending window back 3 units.

5.经过t5秒后,同步系统所在的机器发生断电事件,同步被中断。5. After t5 seconds, a power failure event occurs on the machine where the synchronization system is located, and the synchronization is interrupted.

6.机器重启后,再次发起全量同步,从外部存储系统读取区间信息和ack指针的位置,重新构造运行状态数组,重新构造后的发送窗口起始位置位于编号4,ack指针的位置位于编号3。并发地将编号为4-8的共计5个子任务提交到子任务线程池,然后重复“同步及状态标记”过程,直至所有批次的数据同步完成。6. After the machine restarts, initiate full synchronization again, read the interval information and the position of the ack pointer from the external storage system, and reconstruct the running status array. The starting position of the reconstructed sending window is at number 4, and the position of the ack pointer is at number 3. Concurrently submit a total of 5 subtasks numbered 4-8 to the subtask thread pool, and then repeat the process of "synchronization and status marking" until the data synchronization of all batches is completed.

另一方面,本申请实施例提供了一种计算机系统,请参阅附图4,计算机系统40包括存储器41、处理器43以及存储在存储器41中并可在处理器43上运行的计算机程序42,计算机程序42被处理器43执行时实现如前述的一种数据库全量同步断点续传方法。On the other hand, the embodiment of the present application provides a computer system, referring to FIG. 4 , the computer system 40 includes a memory 41, a processor 43, and a computer program 42 stored in the memory 41 and operable on the processor 43, When the computer program 42 is executed by the processor 43, it realizes the above-mentioned method for full data synchronization and resuming transmission from breakpoints.

计算机系统可以是一个通用计算机系统或一个专用计算机系统。在具体实现中,计算机系统可以是包括有多个服务器的服务器集群,如可以是包括有多个节点的区块链系统。本领域技术人员可以理解,图4仅仅是计算机系统的举例,并不构成对计算机系统的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,比如还可以包括输入输出设备、网络接入设备等。The computer system can be a general purpose computer system or a special purpose computer system. In a specific implementation, the computer system may be a server cluster including multiple servers, such as a blockchain system including multiple nodes. Those skilled in the art can understand that FIG. 4 is only an example of a computer system, and does not constitute a limitation to the computer system. It may include more or less components than those shown in the illustration, or combine certain components, or different components, such as It may also include input and output devices, network access devices, etc.

处理器43可以是中央处理单元(Central Processing Unit,CPU),处理器43还可以是其他通用处理器43、数字信号处理器43(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器43可以是微处理器43或者也可以是任何常规的处理器43。The processor 43 can be a central processing unit (Central Processing Unit, CPU), and the processor 43 can also be other general processors 43, a digital signal processor 43 (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general purpose processor 43 may be a microprocessor 43 or any conventional processor 43 .

存储器41在一些实施例中可以是计算机系统的内部存储单元,比如计算机系统的硬盘或内存。存储器41在另一些实施例中也可以是计算机系统的外部存储设备,比如计算机系统上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(SecureDigital,SD)卡、闪存卡(Flash Card)等。进一步地,存储器41还可以既包括计算机系统的内部存储单元也包括外部存储设备。存储器41用于存储操作系统、应用程序、引导装载程序(Boot Loader)、数据以及其他程序等。存储器41还可以用于暂时地存储已经输出或者将要输出的数据。The memory 41 may be an internal storage unit of the computer system in some embodiments, such as a hard disk or memory of the computer system. The memory 41 may also be an external storage device of the computer system in other embodiments, such as a plug-in hard disk equipped on the computer system, a smart memory card (Smart Media Card, SMC), a secure digital (SecureDigital, SD) card, a flash memory Card (Flash Card), etc. Further, the memory 41 may also include both an internal storage unit of the computer system and an external storage device. The memory 41 is used to store operating system, application program, boot loader (Boot Loader), data and other programs, etc. The memory 41 can also be used to temporarily store data that has been output or will be output.

另一方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序42,计算机程序42被处理器43执行时实现如前述的一种数据库全量同步断点续传方法。On the other hand, the embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program 42. When the computer program 42 is executed by the processor 43, it realizes the above-mentioned full-scale database synchronization breakpoint continuation. pass method.

以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,熟悉该本领域的技术人员应该明白本发明包括但不限于附图和上面具体实施方式中描述的内容。任何不偏离本发明的功能和结构原理的修改都将包括在权利要求书的范围中。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, and those skilled in the art should understand that the present invention includes but is not limited to the accompanying drawings and the description in the above specific embodiments content. Any modifications that do not depart from the functional and structural principles of the present invention will be included in the scope of the claims.

Claims (13)

1.一种数据库全量同步断点续传方法,其特征在于,1. A method for full synchronous breakpoint resume transmission of a database, characterized in that, 包括以下步骤:Include the following steps: 选定所述数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将所述批次排序;Select the segmentation field of each table of the database, divide the table data into several batches according to the segmentation field, and sort the batches; 记录每个所述批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储;Record the start value and end value of each batch corresponding to the segmentation field, and store the sort number of the batch and the corresponding table name, start value and end value in an external storage persistently; 建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于所述发送窗口内的批次提交子任务线程进行同步,所述批次同步完成后更新发送窗口的位置;Set up a subtask thread pool and a sending window matched with the number of subtask threads, and submit subtask threads within the sending window for synchronization, and update the sending window position after the batch synchronization is completed; 当出现断点恢复时,从外部存储读取对应批次的切分字段的起始值和结束值;When a breakpoint recovery occurs, read the start value and end value of the segmentation field of the corresponding batch from the external storage; 构建查询条件,从所述数据库对应表中查询满足所述切分字段的起始值和结束值的全部数据,作为恢复的批次,将恢复的所述批次提交给子任务线程进行同步。Construct a query condition, query all data satisfying the start value and end value of the segmentation field from the corresponding table of the database, and submit the recovered batch to the subtask thread for synchronization as a restored batch. 2.根据权利要求1所述的一种数据库全量同步断点续传方法,其特征在于,2. a kind of database full amount synchronous breakpoint resume transmission method according to claim 1, is characterized in that, 选定切分字段的方法包括:Methods for selecting split fields include: 若数据库的表为单主键表,则选定所述表的切分字段为所述主键;If the table of the database is a single primary key table, then select the segmentation field of the table as the primary key; 若数据库的表为联合主键,则选择区分度最高的字段作为所述表的切分字段。If the table of the database is a joint primary key, the field with the highest degree of discrimination is selected as the splitting field of the table. 3.根据权利要求2所述的一种数据库全量同步断点续传方法,其特征在于,3. a kind of database full amount synchronous breakpoint resume transmission method according to claim 2, is characterized in that, 计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。The method of calculating the discrimination degree of a field is: discrimination degree=count(distinct field name)/count(field name), count(distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, count(field name ) refers to the number of entries in the table. 4.根据权利要求1至3任一项所述的一种数据库全量同步断点续传方法,其特征在于,4. A method for resuming transmission of a full amount of database synchronously according to any one of claims 1 to 3, characterized in that, 依据切分字段将表数据切分为若干个批次的方法包括:The methods of dividing table data into several batches according to the division fields include: 设定划分批次数量为n,读取切分字段的最小数值变化步长m;Set the number of divided batches to n, and read the minimum value change step m of the segmented field; 计算划分步长step,step=max(count(切分字段)/n,m);Calculate the division step step, step=max(count(segmentation field)/n,m); 按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch. 5.根据权利要求1至3任一项所述的一种数据库全量同步断点续传方法,其特征在于,5. A method for resuming transmission of a full amount of database synchronously according to any one of claims 1 to 3, characterized in that, 建立发送窗口的方法包括:Methods for establishing a send window include: 设置所述发送窗口的大小等于所述子任务线程的数量;Setting the size of the sending window equal to the number of subtask threads; 将所述发送窗口的起始位置设置在第一个批次,建立ack指针指向所述发送窗口的起始位置,所述ack指针持久化存储在外部存储。The starting position of the sending window is set in the first batch, and an ack pointer is established to point to the starting position of the sending window, and the ack pointer is persistently stored in an external storage. 6.根据权利要求5所述的一种数据库全量同步断点续传方法,其特征在于,6. a kind of database full amount synchronous breakpoint resume transmission method according to claim 5, is characterized in that, 所述外部存储还存储由批次的同步状态,所述同步状态初始值为等待调度,建立子任务线程池后,将提交给子任务线程的批次的同步状态设置为发送中,所述批次同步完成后,更新对应批次的同步状态为成功,所述批次同步中断,则立即重试同步直至同步完成,更新所述发送窗口的位置的方法为:当所述发送窗口起始位置的批次同步成功后,将ack指针自增1,直至所述发送窗口的结束位置到达最后一个批次。The external storage also stores the synchronization state of the batch, the initial value of the synchronization state is waiting for scheduling, after the subtask thread pool is established, the synchronization state of the batch submitted to the subtask thread is set as sending, and the batch After the second synchronization is completed, update the synchronization status of the corresponding batch as successful, and if the batch synchronization is interrupted, retry the synchronization immediately until the synchronization is completed. The method for updating the position of the sending window is: when the starting position of the sending window After the batch synchronization is successful, the ack pointer is incremented by 1 until the end position of the sending window reaches the last batch. 7.根据权利要求5所述的一种数据库全量同步断点续传方法,其特征在于,7. a kind of database full amount synchronous breakpoint resume transmission method according to claim 5, is characterized in that, 当出现断点恢复时,根据所述ack指针恢复所述发送窗口的位置,根据恢复的所述发送窗口的位置确定当前需要恢复的批次,将恢复的所述批次提交给子任务线程进行同步。When a breakpoint recovery occurs, restore the position of the sending window according to the ack pointer, determine the batches that currently need to be restored according to the restored position of the sending window, and submit the restored batches to the subtask thread for execution. Synchronize. 8.一种数据库全量同步断点续传系统,用于执行如权利要求1至7任一项所述的一种数据库全量同步断点续传方法,其特征在于,8. A system for full-volume synchronous breakpoint resuming of databases, which is used to execute a method for full-scale synchronous breakpoint resuming of databases according to any one of claims 1 to 7, wherein: 包括管理节点集群、协调器集群和同步节点集群,所述管理节点集群包括至少一个管理节点,所述协调器集群包括至少一个协调器,所述同步节点集群包括至少两个同步节点,Including a management node cluster, a coordinator cluster and a synchronization node cluster, the management node cluster includes at least one management node, the coordinator cluster includes at least one coordinator, and the synchronization node cluster includes at least two synchronization nodes, 所述管理节点选定所数据源的数据库的每个表的切分字段,依据切分字段将表数据切分为若干个批次,将所述批次排序,记录每个所述批次对应切分字段的起始值和结束值,将批次的排序序号及对应的表名、起始值和结束值持久化存储到外部存储,The management node selects the segmentation field of each table of the database of the data source, divides the table data into several batches according to the segmentation field, sorts the batches, and records the corresponding Segment the start value and end value of the field, and store the batch sort number and the corresponding table name, start value, and end value persistently to the external storage. 两个所述同步节点运行分别在源数据库及目标数据库所在主机,所述源数据库对应的所述同步节点建立子任务线程池和与子任务线程的数量匹配的发送窗口,将处于所述发送窗口内的批次提交子任务线程进行同步,所述批次同步完成后所述源数据库对应的所述同步节点更新发送窗口的位置,当出现断点恢复时,所述源数据库对应的所述同步节点从外部存储读取对应批次的切分字段的起始值和结束值,所述源数据库对应的所述同步节点构建查询条件,从所述数据库对应表中查询满足所述切分字段的起始值和结束值的全部数据,作为恢复的批次,所述源数据库对应的所述同步节点将恢复的所述批次提交给子任务线程进行同步,The two synchronization nodes run respectively on the host where the source database and the target database are located, and the synchronization node corresponding to the source database establishes a subtask thread pool and a sending window matching the number of subtask threads, and will be in the sending window The batch submission subtask thread in the batch synchronization is performed. After the batch synchronization is completed, the synchronization node corresponding to the source database updates the position of the sending window. When a breakpoint recovery occurs, the synchronization node corresponding to the source database The node reads the start value and end value of the segmentation field corresponding to the batch from the external storage, and the synchronization node corresponding to the source database constructs a query condition, and queries the data satisfying the segmentation field from the corresponding table of the database. All the data of the start value and the end value are used as a restored batch, and the synchronization node corresponding to the source database submits the restored batch to the subtask thread for synchronization, 所述协调器在所述管理节点及若干个所述同步节点之间建立网络连接。The coordinator establishes a network connection between the management node and several synchronization nodes. 9.根据权利要求8所述的一种数据库全量同步断点续传系统,其特征在于,9. A kind of database full amount synchronous breakpoint resume transmission system according to claim 8, is characterized in that, 所述源数据库对应的所述同步节点提取批次的数字指纹发送给所述目标数据库对应的所述同步节点,所述目标数据库对应的所述同步节点的子任务线程完成批次的传输后,核对所述数字指纹,若核对一致,则反馈所述批次同步成功,反之,若核对不一致,则反馈所述批次同步失败。The synchronization node corresponding to the source database extracts the digital fingerprint of the batch and sends it to the synchronization node corresponding to the target database, and after the subtask thread of the synchronization node corresponding to the target database completes the batch transmission, The digital fingerprints are checked, and if the verification is consistent, it is fed back that the batch synchronization is successful; otherwise, if the verification is inconsistent, the batch synchronization failure is fed back. 10.根据权利要求8或9所述的一种数据库全量同步断点续传系统,其特征在于,10. A kind of database full synchronization breakpoint resume transmission system according to claim 8 or 9, characterized in that, 所述管理节点选定所数据源的数据库的每个表的切分字段时,执行以下步骤:When the management node selects the segmentation fields of each table of the database of the data source, the following steps are performed: 若数据库的表为单主键表,则选定所述表的切分字段为所述主键;If the table of the database is a single primary key table, then select the segmentation field of the table as the primary key; 若数据库的表为联合主键,则选择区分度最高的字段作为所述表的切分字段,其中计算字段的区分度的方法为:区分度=count(distinct 字段名)/count(字段名),count(distinct 字段名)指将表中字段值去重后的表的条目数量,count(字段名)指表的条目数量。If the table of the database is a joint primary key, select the field with the highest degree of discrimination as the segmentation field of the table, and the method of calculating the degree of discrimination of the field is: degree of discrimination=count(distinct field name)/count(field name), count(distinct field name) refers to the number of entries in the table after deduplicating the field values in the table, and count(field name) refers to the number of entries in the table. 11.根据权利要求10所述的一种数据库全量同步断点续传系统,其特征在于,11. A kind of database full amount synchronous breakpoint resume transmission system according to claim 10, is characterized in that, 所述管理节点依据切分字段将表数据切分为若干个批次时,执行以下步骤:When the management node divides the table data into several batches according to the segmentation fields, the following steps are performed: 设定划分批次数量为n,读取切分字段的最小数值变化步长m;Set the number of divided batches to n, and read the minimum value change step m of the segmented field; 计算划分步长step,step=max(count(切分字段)/n,m);Calculate the division step step, step=max(count(segmentation field)/n,m); 按照划分步长step将切分字段的取值划分为若干个区间,每个区间对应的表的条目作为一个批次。Divide the value of the segmentation field into several intervals according to the division step step, and the table entries corresponding to each interval are regarded as a batch. 12.一种计算机系统,其特征在于,所述计算机系统包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7任一项所述的一种数据库全量同步断点续传方法。12. A computer system, characterized in that the computer system comprises a memory, a processor, and a computer program stored in the memory and operable on the processor, the computer program being executed by the processor Realize a kind of database full amount synchronous breakpoint resume transmission method as described in any one of claims 1 to 7 at the same time. 13.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的一种数据库全量同步断点续传方法。13. A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, one of the methods described in any one of claims 1 to 7 is implemented. The method of resuming the breakpoint and full synchronization of the database.
CN202211681968.6A 2022-12-27 2022-12-27 Database full synchronization breakpoint continuous transmission method, system and storage medium Pending CN116010412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211681968.6A CN116010412A (en) 2022-12-27 2022-12-27 Database full synchronization breakpoint continuous transmission method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211681968.6A CN116010412A (en) 2022-12-27 2022-12-27 Database full synchronization breakpoint continuous transmission method, system and storage medium

Publications (1)

Publication Number Publication Date
CN116010412A true CN116010412A (en) 2023-04-25

Family

ID=86026124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211681968.6A Pending CN116010412A (en) 2022-12-27 2022-12-27 Database full synchronization breakpoint continuous transmission method, system and storage medium

Country Status (1)

Country Link
CN (1) CN116010412A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119537472A (en) * 2024-11-08 2025-02-28 成都虚谷伟业科技有限公司 A data synchronization method, system, medium and device based on Xugu database

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119537472A (en) * 2024-11-08 2025-02-28 成都虚谷伟业科技有限公司 A data synchronization method, system, medium and device based on Xugu database

Similar Documents

Publication Publication Date Title
CN111338766B (en) Transaction processing method, apparatus, computer equipment and storage medium
US10732836B2 (en) Remote one-sided persistent writes
US11341115B2 (en) Multi-database log with multi-item transaction support
CN106462449B (en) Multi-database logging with multi-project transaction support
JP2023546249A (en) Transaction processing methods, devices, computer equipment and computer programs
US10802766B2 (en) Database with NVDIMM as persistent storage
CN111949633A (en) ICT system operation log analysis method based on parallel stream processing
WO2014008495A2 (en) Managing dependencies between operations in a distributed system
US12111817B2 (en) Log execution method and apparatus, computer device and storage medium
CN107818129B (en) query restartability
US11210277B2 (en) Distributing and processing streams over one or more networks for on-the-fly schema evolution
WO2022048358A1 (en) Data processing method and device, and storage medium
CN113032477B (en) GTID-based long-distance data synchronization method, device and computing equipment
CN113672556A (en) Method and device for migrating batch files
US20250200070A1 (en) Consensus Protocol For Asynchronous Database Transaction Replication With Fast, Automatic Failover, Zero Data Loss, Strong Consistency, Full SQL Support And Horizontal Scalability
US11494347B1 (en) Efficient update-anywhere replication of queue operations on a replicated message queue
WO2024051454A1 (en) Method and apparatus for processing transaction log
CN116010412A (en) Database full synchronization breakpoint continuous transmission method, system and storage medium
CN112689831B (en) Method, apparatus and system for non-destructive upgrade of a distributed coordination engine in a distributed computing environment
CN114444719A (en) Model updating method, apparatus, storage medium and electronic device
CN118689888A (en) Database concurrent transaction management method, device, management system and medium
US12277140B2 (en) Consensus protocol for asynchronous database transaction replication with fast, automatic failover, zero data loss, strong consistency, full SQL support and horizontal scalability
CN113326268B (en) Data writing and reading method and device
EP4400982A1 (en) Method and apparatus for processing photovoltaic data, and system for managing photovoltaic data
US20250139163A1 (en) Estimating Graph Size And Memory Consumption Of Distributed Graph For Efficient Resource Management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination