CN119513205B

CN119513205B - Data synchronization method and device

Info

Publication number: CN119513205B
Application number: CN202411793442.6A
Authority: CN
Inventors: 曹鑫; 邵先凯; 尹迎昭
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2024-12-06
Filing date: 2024-12-06
Publication date: 2025-08-19
Anticipated expiration: 2044-12-06
Also published as: CN119513205A

Abstract

The invention discloses a data synchronization method and device, and relates to the technical field of computers. The method comprises the steps of obtaining a service change log in real time, determining one or more first change data corresponding to the service change log, determining a timestamp of each first change data, determining a first minimum timestamp corresponding to a current preset time period based on each timestamp, storing the first change data corresponding to the current preset time period to an intermediate database according to the timestamp in response to the first minimum timestamp not reaching a preset time node, obtaining second change data corresponding to a historical preset time period before the current preset time period in the intermediate database in response to the first minimum timestamp reaching the preset time node, and carrying out zipper operation on the second change data and the historical data in a data warehouse. The embodiment realizes the streaming processing and simultaneously ensures the data integrity in the data zipper process.

Description

Data synchronization method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for data synchronization.

Background

The data warehouse zipper technology is a method for tracking data change, and records data by storing data with different time stamps and data states in a table, so that historic versions of the data can be reserved, and meanwhile, the storage cost can be reduced.

However, in the existing data-synchronized zipper process, it is often necessary to compare and correlate the full data tables of different dates, identify the valid period of each data, and write the data back to the data directory, resulting in a long data zipper time. In addition, since consistency of data of multiple data sources at the same time point cannot be ensured, multiple full-volume data tables of the zipper can only correspond to a single data source, for example, the zipper can only be performed on commodity price tables of different dates, so that a zipped commodity price table is obtained, and the zipper cannot be performed on commodity price and logistics information at the same time.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for data synchronization, by acquiring first change data corresponding to a service change log in a current preset time period, a full data zipper process may be disassembled into a zipper process of change data in multiple time periods, so as to reduce a peak value of resource processing, and realize an effect of streaming processing along with real-time acquisition of the service change log. In addition, the data integrity of the second change data in the data zipper process is ensured through the comparison between the first minimum timestamp corresponding to the current preset time period and the preset time node, and the zipper error caused by data delay in an abnormal scene is prevented.

To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method of data synchronization.

The data synchronization method comprises the steps of receiving a service change log in real time, determining one or more first change data corresponding to the service change log, determining a time stamp of each first change data, determining a first minimum time stamp corresponding to a current preset time period based on each time stamp, storing the first change data corresponding to the current preset time period to an intermediate database according to the time stamp when the first minimum time stamp does not reach a preset time node, obtaining second change data corresponding to a historical preset time period located before the current preset time period in the intermediate database when the first minimum time stamp reaches the preset time node, and carrying out zipper operation on the second change data and historical data in a data warehouse so as to carry out data synchronization on the historical data by using the second change data.

Optionally, the determining of the one or more first change data corresponding to the service change log includes determining one or more target service data tables corresponding to the service change log from a service database, wherein the service data tables corresponding to different data sources are stored in the service database, and the first change data corresponding to the service change log is obtained from each target service data table.

Optionally, after determining the first minimum time stamp corresponding to the current preset time period based on each time stamp, the method further comprises the steps of writing the first minimum time stamp corresponding to the current preset time period into a metadata file of a data lake, wherein second minimum time stamps corresponding to a plurality of historical time periods are stored in the metadata file, monitoring the metadata file in real time, determining one or more target second minimum time stamps with time earlier than the first minimum time stamp from the second minimum time stamps, and taking the historical time periods corresponding to the one or more target second minimum time stamps as the historical preset time periods.

The method comprises the steps of carrying out a zipper operation on the second change data and the historical data in a data warehouse, wherein the steps of sequencing the historical preset time periods according to time, sequentially executing the steps of updating a source code field and a cursor field corresponding to the historical data according to the second change data corresponding to the historical preset time period for each historical preset time period from early to late, and updating the latest state of the historical data according to the source code field and the historical track of the historical data according to the cursor field.

Optionally, the step of performing zipper operation on the second change data and the historical data in the data warehouse comprises the steps of positioning the modification position of the second change data in the historical data based on an index constructed in the data warehouse in advance, and performing zipper operation on the historical data according to a positioning result.

Optionally, the indexes comprise two stages, wherein the first stage index is a bucket index of the data warehouse, and the second stage index is a step index corresponding to each bucket respectively and indicating a plurality of step coding intervals in each bucket.

Optionally, the positioning the modification position of the second change data in the historical data based on the index constructed in the data warehouse in advance comprises determining a target storage bucket corresponding to the second change data from a plurality of storage buckets according to the first-level index, determining a target step length coding section corresponding to the second change data from a plurality of step length coding sections corresponding to the target storage bucket according to the second-level index, and determining the modification position from the target step length section according to the file name of the second change data.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an apparatus for data synchronization.

The data synchronization device comprises an acquisition module, a determination module, an updating module and a data synchronization module, wherein the acquisition module is used for receiving a service change log in real time and determining one or more first change data corresponding to the service change log, the determination module is used for determining a time stamp of each first change data and determining a first minimum time stamp corresponding to a current preset time period based on each time stamp, the updating module is used for storing the first change data corresponding to the current preset time period into an intermediate database according to the time stamp when the first minimum time stamp does not reach a preset time node, and acquiring second change data corresponding to a historical preset time period before the current preset time period in the intermediate database and carrying out zipper operation on the second change data and the historical data in a data warehouse so as to carry out data synchronization on the historical data by using the second change data when the first minimum time stamp reaches the preset time node.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic device for data synchronization.

The electronic equipment for data synchronization comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for data synchronization of the embodiment of the invention.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium.

A computer readable storage medium of an embodiment of the present invention has stored thereon a computer program which, when executed by a processor, implements a method of data synchronization of an embodiment of the present invention.

The embodiment of the invention has the advantages that the whole data zipper process can be disassembled into the zipper process of the change data in a plurality of time periods by acquiring the first change data corresponding to the service change log in the current preset time period, the peak value of resource processing is reduced, and the effect of stream processing is realized along with the real-time acquisition of the service change log. In addition, the data integrity of the second change data in the data zipper process is ensured through the comparison between the first minimum timestamp corresponding to the current preset time period and the preset time node, and the zipper error caused by data delay in an abnormal scene is prevented.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic flow diagram of a method of data synchronization according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow for acquiring first modification data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the setup of a message queue according to an embodiment of the invention;

FIG. 4 is a schematic diagram of interactions between multiple systems involved in data synchronization according to an embodiment of the invention;

FIG. 5 is a schematic diagram of the interaction process between a compute operator and a commit operator according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a main flow of determining a historical preset time period according to an embodiment of the present invention;

FIG. 7 is a flow diagram of a particular embodiment of data synchronization according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a main flow of zipper a plurality of second modification data corresponding to a plurality of historical preset time periods, respectively, according to an embodiment of the present invention;

FIG. 9 is a diagram of the resulting data structure of a data zipper in accordance with an embodiment of the present invention;

FIG. 10 is a main flow diagram of another method of data synchronization according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a main flow for index-based positioning according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of the main modules of an apparatus for data synchronization according to an embodiment of the present invention;

FIG. 13 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

Fig. 14 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features in the embodiments may be combined with each other without collision.

It should be noted that, in the technical solution of the present disclosure, the related aspects of collecting, updating, analyzing, processing, using, transmitting, storing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and are used for legal purposes without violating the public order colloquial. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

For easy understanding, a specific scenario to which the embodiments of the present invention are applied will be described first, and offline data synchronization is usually performed in units of "day", that is, data on all days of T is synchronized to a data warehouse on day t+1, so as to align data in the data warehouse with data on day T. Specifically, there are many pieces of data generated in different time in the T day, so that a large amount of time is required to be consumed to perform one-time zipper on the data in the T day after the T day is finished, and the data synchronization efficiency is low. For the data generated on the T day, the data sources corresponding to the different data may be different, for example, the name of the commodity is modified by 0:00 (the corresponding data source is the commodity name data table in the service system), the price of the commodity is modified by 12:00 (the corresponding data source is the commodity price data table in the service system), and the rest inventory of the commodity is finally modified at 18:00 as the commodity is sold continuously (the corresponding data source is the commodity inventory data table in the service system). In the data synchronization process, the business system is sequentially compared with the commodity name data table, the commodity price data table and the commodity inventory data table in the data warehouse, and the data table in the data warehouse is updated through the difference data obtained through comparison, so that each data source can only be updated independently, namely, the commodity inventory data table in the data warehouse is updated by utilizing the commodity inventory data table in the business system, and the data synchronization efficiency is further affected. Therefore, the embodiment of the invention provides a new data synchronization method, which can disassemble the data to be updated in the whole quantity of the T day before the end of the T day, namely, the time of the whole day of the T day is disassembled into a plurality of current preset time periods, and the plurality of data in each current preset time period are subjected to streaming processing, so that the peak value of resource processing is reduced. However, the problem of data synchronization consistency must exist in the process of splitting the full-data one-time zipper synchronization process into a plurality of current preset time periods, namely, the situation that certain data are written in a delayed manner due to a delay problem exists, so that the embodiment of the invention ensures the data integrity in the process of data zipper by comparing the first minimum timestamp corresponding to each current preset time period and the comparison between the first minimum timestamp and the preset time period.

Fig. 1 is a schematic diagram of main steps of a method of data synchronization according to an embodiment of the present invention.

As shown in fig. 1, the method for data synchronization according to the embodiment of the present invention mainly includes the following steps:

step S101, receiving a service change log in real time and determining one or more first change data corresponding to the service change log;

step S102, determining a time stamp of each first change data, and determining a first minimum time stamp corresponding to a current preset time period based on each time stamp;

step S103, storing first change data corresponding to the current preset time period to an intermediate database according to the time stamp in response to the first minimum time stamp not reaching the preset time node;

and step S104, responding to the fact that the first minimum time stamp reaches a preset time node, acquiring second change data corresponding to a historical preset time period before the current preset time period in the middle database, and carrying out zipper operation on the second change data and the historical data in the data warehouse so as to carry out data synchronization on the historical data by utilizing the second change data.

The service change log refers to a log file generated when the service is changed, namely, the log file is generated after each operation and submission of a user. For example, the user fills in the name, price, quantity and size of the commodity on the commodity management page, and after clicking the confirmation submit button, the background generates a corresponding business change log, and the content modified by the operation, the specific time modified, the operation type and the like are recorded in detail in the business change log. In an exemplary embodiment, a user puts a new commodity on a commodity management page, and fills in a total of 100 commodities, namely commodity 1, selling price of 100 yuan, and inventory quantity, and then after clicking a confirmation submitting button, a service change log including all operation information is generated, wherein each modified item of content is a first change data. Specifically, in the actual application process, the service data is stored in the MySQL database, and the generated service change log is correspondingly stored in the Binlog file of the MySQL database, so as to record the event of changing the database. The Binlog file is the core of data replication and data recovery in the MySQL database, and records all DDL and DML operations (e.g., INSERT, UPDATE, DELETE).

The first change data refers to data of each change in the service change log. For example, the user sets the name of the commodity in the commodity management page from a to B, the price is adjusted to 100 yuan, the size is adjusted to 50 cm ×30 cm, and the confirmation submit button is clicked after the setting is completed. Then three first change data are actually included in the generated service change log, and the names are respectively changed from a to B, the price is adjusted to 100 yuan, and the size is adjusted to 50 cm x 30 cm, and the three first change data are actually stored as three source data tables, namely a commodity name table, a commodity price table and a commodity size table in the data storage process.

Thus, the process of acquiring the first change data in step 101 may, as shown in fig. 2, include:

step S201, acquiring a service change log in real time;

Step S202, determining one or more target service data tables corresponding to the service change log from a service database, wherein the service data tables respectively corresponding to different data sources are stored in the service database;

Step S203, first change data corresponding to the service change log are respectively obtained from each target service data table.

As can be seen from the above-mentioned process, in the service database, the service data tables (source data tables) corresponding to different services (i.e., different data sources) are stored, and besides the above-mentioned commodity name table, commodity price table, and commodity size table, the service data tables may also include, for example, a commodity logistics transportation table, a commodity after-sales state table, and the like, and only the corresponding target service data table needs to be accessed to obtain the first change data along with the different specific operation contents included in the service change log, and the service data table irrelevant to the actual change operation does not need to be accessed.

For the time stamp in step S102, in an alternative embodiment, the operation time or change time of the first change data may be equivalent. Also, taking the adjustment operation on the commodity name, price and size as an example, since the three first change data are actually submitted synchronously by the user clicking the confirmation submitting button, the time stamps of the three first change data are the same, and are all the times when the user clicks the confirmation submitting button. However, if the user changes the three pieces of first change data in different operations, the time stamp corresponding to each piece of first change data is different. For example, after the commodity name is changed, the user clicks one submission for 10:00, then the user changes the commodity price and clicks one submission again for 10:01, then the user changes the commodity size and clicks the submission for 10:02, then the timestamp of the first change data corresponding to the commodity name is 10:00, the timestamp of the first change data corresponding to the commodity price is 10:01, and the timestamp of the first change data corresponding to the commodity size is 10:02.

In an alternative embodiment, the obtained first change data may be respectively put into different message queues according to different services, and the subsequent asynchronous consumption processing for the first change data is implemented through the message queues. Illustratively, as shown in fig. 3, after the service change log indicates that the user 23:50 creates a commodity with a size of 20×30, a price of 60, and a name of commodity 1, the commodity basic information table, the commodity attribute information table, and the commodity price information table corresponding to the service change log are called from the service database, so that corresponding first change data is obtained. Since the data sources of the three first change data are different, they may be placed in message queues corresponding to them (a message queue 1 corresponding to the commodity basic information table, a message queue 2 corresponding to the commodity attribute information table, and a message queue 3 corresponding to the commodity price information table, respectively, from top to bottom in fig. 3), where the message queues may be kafka queues. Then, the user performs an operation at 23:52 again, and adjusts the size of the commodity 1, and the corresponding first change data is obtained from the commodity attribute information table and put into the corresponding message queue 2 in the same way as the previous process. It will be appreciated that for data at different time points in the same message queue, the processing is typically performed sequentially in time order according to a first-in-first-out order.

For the current preset time period and the first minimum timestamp in step S102, as described above, the preset time period is actually a division of the time of day of the T day, and thus may be specifically set according to the actual requirement of the user, for example, 1 minute, 5 minutes, 10 minutes, and so on. However, in order to achieve the purpose of stream data synchronization for the service change log acquired in real time as far as possible, the preset time period is not recommended to be set too long, and is preferably 1-5 minutes. That is, for T days, 0:00-0:01 is a preset time period, 0:01-0:02 is a current preset time period, 0:02-0:03 is a preset time period, and so on, the whole day time of T days can be divided into a plurality of preset time periods, and along with the change of the current time, different first change data can fall into different current preset time periods, namely, the data of 0:00 belongs to the current preset time period of 0:00-0:01. The current preset time period in the invention takes the left side as the closed interval and the right side as the open interval, namely 0:00 epsilon [0:00, 0:01), and the right side can be set as the closed interval and the left side as the open interval according to different settings, so long as the continuity of a plurality of current preset time periods and the whole day time of T days can be covered completely are ensured, and the first change data which do not belong to the preset time period can not occur. The first minimum timestamp may be understood as a minimum timestamp corresponding to one or more first change data in the preset time period, that is, a timestamp corresponding to the earliest first change data in the one or more first change data. For this, only the time stamps of one or more first change data falling within the current preset time period need to be compared, which is not further explained in the present invention.

In a further alternative embodiment, the determination of the first minimum timestamp and subsequent storing of the first timestamp to the intermediate database may be implemented using an open-source flank stream processing framework. Specifically, as shown in fig. 4, the link flow processing framework firstly obtains one or more first change data from the service database according to the service change log, and puts the first change data into a corresponding message queue, and then calculates a first minimum timestamp corresponding to the current preset time period in each current preset time period according to the timestamp corresponding to each first change data in the message queue. Specifically, in the flank stream processing framework, there are a number of operators of different functions, such as read operators, calculate operators, commit operators, and so on. The reading operator is used for reading the timestamp of the first variable data from the message queue, the calculating operator calculates the minimum timestamp of different fields in the current preset time period according to a preset calculating rule (such as a calculating rule with minimum time), and finally sends the minimum timestamp to the submitting operator, and the submitting operator submits and stores the first minimum timestamp.

Illustratively, the preset time period may be set to a minute level, for example, 1min, 3min, 5min, or 10min, etc. The interaction process between the calculation operator and the submission operator is shown in fig. 5, for each calculation operator, a certain field is continuously monitored in a current preset time period corresponding to the current time, and the time stamps of the field are continuously compared according to the service change log to obtain the minimum time stamp of each field. And after the current preset time period is finished, each calculation operator sends the minimum timestamp of the corresponding field to the unique submitting operator, and the submitting operator calculates to obtain the first minimum timestamp corresponding to all the fields.

It should be noted that, since the link stream processing framework itself does not have a storage function, and is essentially a calculation engine, in an alternative embodiment, after the first minimum timestamp is obtained, the first minimum timestamp needs to be stored in a metadata file of the database, a historical preset time period that can be used for the data zipper is determined by monitoring the metadata file, and a subsequent data zipper process for the historical preset time period and the historical data in the database is scheduled. Specifically, the process of obtaining the history preset time period is shown in fig. 6, and includes:

Step S601, acquiring a service change log in real time;

step S602, determining one or more target business data tables corresponding to the business change logs from a business database;

step 603, respectively obtaining first change data corresponding to the service change logs from each target service data table, and putting each first change log into a corresponding message queue;

Step S604, sequentially acquiring time stamps corresponding to a plurality of first change data in a message queue in a current preset time period, and determining a first minimum time stamp;

Step S605, writing a first minimum timestamp corresponding to a current preset time period into a metadata file of a data lake, wherein the metadata file stores second minimum timestamps corresponding to a plurality of historical time periods respectively;

step S606, monitoring the metadata file in real time, and determining one or more target second minimum time stamps with time earlier than the first minimum time stamp from the second minimum time stamps;

In step S607, the historical time periods corresponding to the one or more target second minimum time stamps are used as the historical preset time periods.

It will be appreciated that, as time goes on, the current time period is continuously changed, taking setting the preset time period to 10min as an example, when the time is 0:08, the current preset time period is 0:00-0:10, and when the time is 0:12, the current preset time period is 0:10-0:20, and at this time, 0:00-0:10 becomes the historical preset time period. Therefore, as the first minimum time stamps corresponding to the current preset time periods do not reach the preset node, the current preset time periods once become the historical preset time periods, and the corresponding first change data are stored in the intermediate database.

In the embodiment of the invention, the first minimum timestamp is submitted through a commit operator (commit operator) of the Flink, so that the metadata file may be a commit file, and the storage format is similar to json format. The data lake refers to Hudi data management framework, designed specifically for streaming and batch data processing in large data lakes. In a further alternative embodiment, for interception of metadata, interception may be performed according to a preconfigured interception rule, and illustratively, the type of the intercepted file, the data table mode, the data path, the metadata file name, and the like may be set in advance.

It will be appreciated that, in order to reduce the peak value of the centralized processing of data, the preset time period may be set to a minute level so as to implement processing of the service change log obtained by implementation, but the essence of the data zipper technology is to reduce the number of zippers and improve the data processing effect through a one-time zipper process. Thus, in the embodiment of the present invention, the preset time node may be set according to the date, i.e. 00:00 points per day. Taking a preset time period as 10min as an example, a first minimum timestamp is submitted by a submitting operator in the Flink stream processing frame every 10min, and a metadata file is correspondingly generated, wherein each metadata file indicates the modification time and the modification content executed by the service change log in the current 10 min. That is, before 00:00 on the t+1 day, there are actually a plurality of historical preset time periods on the T day, i.e., 0:00-0:10, 0:10-0:20, 0:20-0:30, the number of the historical preset time periods is 23:40-23:50, 23:50-24:00, and the data of the plurality of historical preset time periods is performed after the first current preset time period after 00:00.

For ease of understanding, we will specifically describe with reference to fig. 7:

In fig. 7, the users 23:50, 23:52 and 00:01 of the next day respectively perform three modification operations, the preset time period is set to 1 minute, and the preset time node is set to 00:00 of each day.

For the current preset time period of 23:50-23:51, based on the service change log of 23:50, it is determined that the first minimum timestamp is 23:50, and through judgment with the preset time node, it is found that the preset time node is not reached, so that first change data (commodity 1 Src Map [ ] Cru Map [20 x30, 60 ]) corresponding to 23:50-23:51 is stored in the intermediate database.

For the current preset time period of 23:52-23:53, based on the service change log of 23:52, it is determined that the first minimum timestamp is 23:52, and through judgment with the preset time node, it is found that the preset time node is not reached, so that first change data (commodity 1 Src Map [ ] Cru Map [50 x 30 ]) corresponding to 23:52-23:53 is stored in the intermediate database. It will be appreciated that 23:50-23:51 is now a historical preset time period that is prior to the current preset time period as compared to the current preset time period.

For the current preset time period of 00:01-00:02, based on the service change log of 00:01, the first minimum time stamp is determined to be 00:01, the preset time node is found to be reached through judgment with the preset time node, that means that the data of the previous day are stored in the service database completely, and data synchronization between the service database and the data warehouse can be executed. Then, at this time, the zipper operation is performed on the second modified data corresponding to the historical preset time period located before 00:01-00:02 in the middle database, that is, the synchronization process of the service data of the previous day is completed. It is understood that 23:50-23:51 and 23:52-23:53 are each historical preset time periods as compared to 00:01-00:02.

In a further alternative embodiment, in response to the condition that the first minimum timestamp reaches the preset time node, the first change data is also required to be stored in the intermediate database, and the second change data corresponding to the historical preset time period in the intermediate database is synchronously deleted, that is, the change data which is not subjected to data synchronization is always stored in the intermediate database.

It should be noted that, in the process of storing the actual service data, there may be a case where data is delayed due to some anomalies, that is, after the user performs the related operation, the first change data is not obtained in time from the service database in the current preset time period of the current process, but is obtained in the current preset time period thereafter. Aiming at the situation, the data synchronization method provided by the invention can smoothly perform data synchronization, and the problem of performing data zipper under the condition of incomplete data is avoided. Specifically, also taking the embodiment of fig. 7 as an example, when the delayed first change data of 23:58 is received within the current preset time period of 00:01-00:02, then when the first minimum timestamp is calculated, the obtained first minimum timestamp is 23:58, and the preset time node is not reached, then even if the current preset time period of 00:01-00:02 is not performed, the data zipper technology is not performed, and the previous historical preset time period is zipped when the next current preset time period is, for example, 00:02-00:03.

Through fig. 1 to fig. 7, a process of performing data synchronization on second change data corresponding to a historical preset time period according to first change data corresponding to a current preset time period when the preset time period is used for dividing full data is described in the embodiment of the invention, but since a plurality of second change data are usually present in an actual process, unlike the prior art that data zipper is performed by comparing metadata tables, the invention provides a new data zipper mode, and an atomization zipper process can be realized according to time corresponding to the second change data.

In an alternative embodiment, in response to a situation that the historical preset time period is multiple, the specific process of performing the data zipper may be as shown in fig. 8, including:

Step S801, sorting a plurality of historical preset time periods according to time;

step S802, for each historical preset time period from early to late, sequentially executing the steps of updating a source code field and a cursor field corresponding to the historical data according to second change data corresponding to the historical preset time period;

step 803, updating the latest state of the history data according to the source code field and updating the history track of the history data according to the cursor field.

It will be appreciated that the data zipper process may represent a data change process, and therefore, data synchronization needs to be performed sequentially in time sequence of the historical preset time period. For the data table to be data zipped, a secondary data partition dp partition, a start date field and an end date field are generally set. In dp partition, two enumerated values ACTIVE and EXPIRED are further included, wherein ACTIVE is used for storing the latest state of data, EXPIRED is used for storing the historical track of data, and data closed loop is realized through the two-level partitions of ACTIVE and EXPIRED. Illustratively, as shown in FIG. 9, sku_id represents a unique code for an item, e.g., code 10000 represents an item of brand 01, and code 10001 represents an item of brand 02. As can be seen from fig. 9, the trade name of the commodity of 10000 in brand 01 is a from 1st 2022 to 31 nd 12 nd 2022, the trade name of the commodity from 1st 2023 to 31 nd 2023 is changed to B, and the trade name of the commodity from 1st 2024 to 1st 2024 is so far more named C, wherein EXPIRED indicates the historical track state, and ACTIVE indicates the current state. Therefore, in the embodiment of the present invention, through the steps S801 to S803, the source code field and the cursor field are updated sequentially according to each history preset time period in time sequence, so that the update of the latest state and the history track is realized, and finally, the effect of the data zipper is achieved.

When data is synchronized in a data lake, delta data is typically written to the log file alone, stock data is written to the parquet file, and the combining of delta data and stock data is accomplished based on the table service Compaction and the merge Mor when read. In the prior art, updating is completed by covering old data with new data, in the embodiment of the invention, as shown in fig. 7, by defining Src and Cur two MAPs to store the current state and the historical track state respectively, the data zipper can be realized by only continuously updating Src MAP and Cur MAP, and compared with the prior art, the method has the advantages that the data zipper is more efficient and faster without the process of data coverage.

The embodiment of the invention performs streaming data synchronization according to the service change log received in real time, so that the method is particularly important for quickly and efficiently identifying the position of the needed change data in the historical data in the data synchronization process. In an alternative embodiment, a data synchronization method provided in the embodiment of the present invention is shown in fig. 10, and specifically includes:

step S1001, acquiring a service change log in real time, and determining one or more first change data corresponding to the service change log;

Step S1002, determining a time stamp of each first change data, and determining a first minimum time stamp corresponding to a current preset time period based on each time stamp;

step S1003, responding to the fact that the first minimum time stamp reaches a preset time node, and acquiring second change data corresponding to a historical preset time period before the current preset time period in the middle database;

Step S1004, positioning the modification position of the second change data in the historical data based on an index constructed in the data warehouse in advance;

Step S1005, performing zipper operation on the second change data and the historical data in the data warehouse according to the positioning result.

Wherein, in order to quickly realize the positioning of the modification position, the invention optimizes the index in the data warehouse. Also taking the data lake as an example, in the existing data lake framework, a bucket index is usually carried by the existing data lake framework, that is, a plurality of buckets are arranged, and each bucket stores a plurality of data with the same hash value. By locating the storage barrel, the first locating of the stored data is realized, and then the first locating is carried out from the storage barrel according to the file name. In the embodiment of the invention, the indexes comprise two stages, wherein the first stage of indexes are bucket indexes of a data warehouse, and the second stage of indexes are step indexes corresponding to each bucket respectively and indicate a plurality of step coding intervals in each bucket. The step index may be understood as setting a continuous code for each data, for example, a step code interval corresponding to each bucket is ten thousand, the codes stored in the first bucket are 00001-10001, the codes stored in the second bucket are 10001-20001, and so on, and the plurality of data in the plurality of buckets are sequentially coded.

In a further alternative embodiment, the process of locating based on the index is as shown in fig. 11, and includes:

step 1101, determining a target storage bucket corresponding to the second change data from the plurality of storage buckets according to the first-level index;

step 1102, determining a target step length coding section corresponding to second change data from a plurality of step length coding sections corresponding to the target storage bucket according to the second-level index;

And step S1103, determining the modification position from the target step interval according to the file name of the second change data.

It is to be understood that, for step S1103, the determination of the modification position is actually achieved by comparing the similarity between the file name of the second change data and the file name of the data stored in the database. Through the process, the target storage bucket corresponding to the second change data can be quickly positioned through the first-stage index, the target step length coding section where the second change data is located is further quickly positioned in the target storage bucket by utilizing the second-stage index, compared with the prior art that only the first-stage index is set, the section required to be compared with the file name can be further reduced through the setting of the second-stage index, and the speed of comparing the second-stage index is far higher than that of comparing the file name, so that the positioning speed of the second change data is further improved through the process.

It should be noted that, because the resource consumption for coding and sorting different data is very large, and the data lake itself does not belong to the engine, the index dictionary for coding and sorting cannot be set and maintained in the memory, so in the embodiment of the invention, the step-size coding is performed by using the service self-increment ID, thereby realizing the sorting effect, and the step-size index coding is associated with the data file. In addition, the first level index needs to be maintained in the Step field, and the second level index is preferably maintained in the partition field, otherwise, a large number of sorting operations are required to implement the indexing function.

According to the data synchronization method provided by the embodiment of the invention, the first change data corresponding to the service change log of the current preset time period is obtained, so that the whole data zipper process can be disassembled into the zipper processes of changing data in a plurality of time periods, the peak value of resource processing is reduced, and the effect of stream processing is realized along with the real-time acquisition of the service change log. In addition, the data integrity of the second change data in the data zipper process is ensured through the comparison between the first minimum timestamp corresponding to the current preset time period and the preset time node, and the zipper error caused by data delay in an abnormal scene is prevented.

Fig. 12 is a schematic diagram of main modules of an apparatus for data synchronization according to an embodiment of the present invention.

As shown in fig. 12, an apparatus 1200 for data synchronization according to an embodiment of the present invention includes:

An obtaining module 1201, configured to receive a service change log in real time, and determine one or more first change data corresponding to the service change log;

A determining module 1202, configured to determine a timestamp of each first change data, and determine a first minimum timestamp corresponding to a current preset time period based on each timestamp;

The updating module 1203 is configured to store first change data corresponding to the current preset time period to an intermediate database according to the timestamp in response to the first minimum timestamp not reaching a preset time node, acquire second change data corresponding to a historical preset time period before the current preset time period in the intermediate database in response to the first minimum timestamp reaching the preset time node, and perform zipper operation on the second change data and historical data in a data warehouse to perform data synchronization on the historical data by using the second change data.

In an optional embodiment of the present invention, the obtaining module 1201 is further configured to determine one or more target service data tables corresponding to the service change log from a service database, where the service data tables corresponding to different data sources are stored in the service database, and obtain first change data corresponding to the service change log from each target service data table.

In an optional embodiment of the present invention, the updating module 1203 is further configured to, after determining, based on each of the timestamps, a first minimum timestamp corresponding to the current preset time period, write the first minimum timestamp corresponding to the current preset time period into a metadata file of a data lake, where a plurality of second minimum timestamps corresponding to each of a plurality of historical time periods are stored in the metadata file, monitor the metadata file in real time, determine one or more target second minimum timestamps with time earlier than the first minimum timestamp from the second minimum timestamps, and use the historical time periods corresponding to the one or more target second minimum timestamps as the historical preset time periods.

In an alternative embodiment of the present invention, the updating module 1203 is further configured to sort the plurality of preset time periods according to time, and sequentially execute, for each preset time period from early to late, updating a source code field and a cursor field corresponding to the history data according to second change data corresponding to the preset time period, updating an up-to-date state of the history data according to the source code field, and updating a history track of the history data according to the cursor field.

In an alternative embodiment of the present invention, the updating module 1203 is further configured to locate a modification position of the second change data in the history data based on an index previously constructed in the data warehouse, and perform a zipper operation on the second change data and the history data in the data warehouse according to a locating result.

In an alternative embodiment of the present invention, the indexes include two levels, wherein the first level index is a bucket index of the data warehouse, and the second level index is a step index corresponding to each bucket, and indicates a plurality of step coding intervals in each bucket.

In an optional embodiment of the present invention, the updating module 1203 is further configured to determine, according to the first level index, a target bucket corresponding to the second change data from a plurality of buckets, determine, according to the second level index, a target step length coding section corresponding to the second change data from a plurality of step length coding sections corresponding to the target bucket, and determine, according to a file name of the second change data, the modification position from the target step length coding sections.

According to the data synchronization device provided by the embodiment of the invention, the first change data corresponding to the service change log of the current preset time period is obtained, so that the whole data zipper process can be disassembled into the zipper processes of changing data in a plurality of time periods, the peak value of resource processing is reduced, and the effect of stream processing is realized along with the real-time acquisition of the service change log. In addition, the data integrity of the second change data in the data zipper process is ensured through the comparison between the first minimum timestamp corresponding to the current preset time period and the preset time node, and the zipper error caused by data delay in an abnormal scene is prevented.

Fig. 13 illustrates an exemplary system architecture 1300 of a data synchronization method or apparatus to which embodiments of the present invention may be applied.

As shown in fig. 13, system architecture 1300 may include terminal devices 1301, 1302, 1303, a network 1304, and a server 1305. The network 1304 is used as a medium to provide communication links between the terminal devices 1301, 1302, 1303 and the server 1305. The network 1304 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 1305 through the network 1304 using the terminal devices 1301, 1302, 1303 to receive or transmit data, etc. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 1301, 1302, 1303.

The terminal devices 1301, 1302, 1303 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 1305 may be a server providing various services, for example, a background management server supporting service data modified by the user using the terminal devices 1301, 1302, 1303, and the background management server may analyze and process received data such as a service change log.

It should be noted that, the method for data synchronization provided in the embodiment of the present invention is generally performed by the server 1305, and accordingly, the device for data synchronization is generally disposed in the server 1305.

It should be understood that the number of terminal devices, networks and servers in fig. 13 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 14, there is illustrated a schematic diagram of a computer system 1400 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 14, the computer system 1400 includes a Central Processing Unit (CPU) 1401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the system 1400 are also stored. The CPU 1401, ROM 1402, and RAM 1403 are connected to each other through a bus 1404. An input/output (I/O) first interface 1405 is also connected to the bus 1404.

Connected to the I/O first interface 1405 are an input portion 1406 including a keyboard, a mouse, and the like, an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like, a storage portion 1408 including a hard disk, and the like, and a communication portion 1409 including a network first interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The driver 1410 is also connected to the I/O first interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1401.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, a processor may be described as including an acquisition module, a determination module, and an update module. The names of these modules do not in any way limit the module itself, and the acquisition module may also be described as a "module that receives a service change log in real time and determines one or more first change data corresponding to the service change log", for example.

As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs, when the one or more programs are executed by the equipment, the equipment comprises the steps of acquiring a business change log in real time, determining one or more first change data corresponding to the business change log, determining a time stamp of each first change data, determining a first minimum time stamp corresponding to a current preset time period based on each time stamp, storing the first change data corresponding to the current preset time period to an intermediate database according to the time stamp when the first minimum time stamp does not reach a preset time node, acquiring second change data corresponding to a historical preset time period which is located before the current preset time period in the intermediate database when the first minimum time stamp reaches the preset time node, and performing zipper operation on the second change data and the historical data in a data warehouse so as to perform data synchronization on the historical data by using the second change data.

According to the technical scheme provided by the embodiment of the invention, the first change data corresponding to the service change log of the current preset time period is obtained, so that the whole data zipper process can be disassembled into the zipper processes of the change data in a plurality of time periods, the peak value of resource processing is reduced, and the effect of stream processing is realized along with the real-time acquisition of the service change log. In addition, the data integrity of the second change data in the data zipper process is ensured through the comparison between the first minimum timestamp corresponding to the current preset time period and the preset time node, and the zipper error caused by data delay in an abnormal scene is prevented.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of data synchronization, comprising:

Acquiring a service change log in real time, and determining one or more first change data corresponding to the service change log;

determining a time stamp of each first change data, and determining a first minimum time stamp corresponding to the current preset time period based on each time stamp;

Responding to the fact that the first minimum time stamp does not reach a preset time node, and storing first change data corresponding to the current preset time period to an intermediate database according to the time stamp;

And responding to the first minimum time stamp reaching a preset time node, acquiring second change data corresponding to a historical preset time period before the current preset time period in the intermediate database, and carrying out zipper operation on the second change data and the historical data in a data warehouse so as to carry out data synchronization on the historical data by utilizing the second change data.

2. The method of claim 1, wherein the determining one or more first change data corresponding to the business change log comprises:

Determining one or more target service data tables corresponding to the service change log from a service database, wherein the service database stores service data tables corresponding to different data sources respectively;

And respectively acquiring first change data corresponding to the service change log from each target service data table.

3. The method of claim 1, further comprising, after said determining a first minimum timestamp corresponding to said current preset time period based on each of said timestamps:

Writing the first minimum time stamp corresponding to the current preset time period into a metadata file of a data lake, wherein the metadata file stores second minimum time stamps corresponding to a plurality of historical time periods respectively;

monitoring the metadata file in real time, and determining one or more target second minimum time stamps with time earlier than the first minimum time stamp from the second minimum time stamps;

And taking the historical time periods corresponding to one or more target second minimum time stamps as the historical preset time periods.

4. The method of claim 3, wherein said zipping said second change data with historical data in the data warehouse in response to said historical preset time period being multiple, comprises:

sorting a plurality of historical preset time periods according to time;

For each historical preset time period from early to late, sequentially executing the steps of updating a source code field and a cursor field corresponding to the historical data according to second change data corresponding to the historical preset time period;

And updating the latest state of the historical data according to the source code field, and updating the historical track of the historical data according to the cursor field.

5. The method of claim 1, wherein said zipping the second change data and the historical data in the data warehouse comprises:

locating a modified position of the second change data in the history data based on an index constructed in advance in the data warehouse;

And carrying out zipper operation on the second change data and the historical data in the data warehouse according to the positioning result.

6. The method of claim 5, wherein the index comprises two levels, wherein,

The first-level index is a bucket index of the data warehouse;

The second-level index is a step index corresponding to each storage bucket respectively and indicates a plurality of step coding intervals in each storage bucket.

7. The method of claim 6, wherein locating the modified location of the second change data in the history data based on an index previously built in the data warehouse comprises:

determining a target storage bucket corresponding to the second change data from a plurality of storage buckets according to the first-level index;

determining a target step length coding section corresponding to the second change data from a plurality of step length coding sections corresponding to the target storage bucket according to the second-level index;

and determining the modification position from the target step length interval according to the file name of the second change data.

8. An apparatus for data synchronization, comprising:

The acquisition module is used for receiving the service change log in real time and determining one or more first change data corresponding to the service change log;

the determining module is used for determining the time stamp of each first change data and determining a first minimum time stamp corresponding to the current preset time period based on each time stamp;

And the updating module is used for responding to the condition that the first minimum time stamp does not reach a preset time node, storing first change data corresponding to the current preset time period into an intermediate database according to the time stamp, responding to the condition that the first minimum time stamp reaches the preset time node, acquiring second change data corresponding to a historical preset time period before the current preset time period in the intermediate database, and carrying out zipper operation on the second change data and the historical data in a data warehouse so as to carry out data synchronization on the historical data by utilizing the second change data.

9. An electronic device for data synchronization, comprising:

one or more processors;

storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-7.