CN119474089A

CN119474089A - Data replication method, device, electronic device and storage medium

Info

Publication number: CN119474089A
Application number: CN202411514062.4A
Authority: CN
Inventors: 李培林
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2024-10-28
Filing date: 2024-10-28
Publication date: 2025-02-18

Abstract

The embodiment of the present application provides a data replication method, device, electronic device and storage medium, which belongs to the field of big data technology. The method includes: obtaining configuration information, the configuration information includes data information of the source data table and the identifier of the target library, the data information includes the first table name and table structure of the source data table; processing the table structure according to the first rule to obtain the first identifier; matching the first table name with the second table name in the metadata record table to obtain the target record item, the metadata record table includes multiple record items, each record item includes the second table name and the second identifier, the second identifier is the table structure corresponding to the second table name processed according to the first rule; if the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table; and the data of the source data table is copied to the first target table. By copying the data of the source data table to the first target table, data synchronization between the source data table and the first target table is achieved.

Description

Data copying method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data replication method, a device, an electronic apparatus, and a storage medium.

Background

Hive and impala are two common tools in the big data field, and in the common scheme using hive and impala, hive is used as a data warehouse for data storage and batch processing. Impala can be used for real-time query and interactive analysis, and faster query response time can be obtained by directly executing the query on hadoop data nodes. These two tools can work cooperatively to provide a comprehensive data analysis solution, creating impala tables and pointing data file locations to hive files is a common derivative solution, but this solution is prone to data consistency problems when multiple impala tables operate on the same hive.

Disclosure of Invention

The embodiment of the application mainly aims to provide a data replication method, a device, electronic equipment and a storage medium, which can solve the problem of inconsistent data when a plurality of impala tables operate the same hive.

To achieve the above object, a first aspect of an embodiment of the present application provides a data replication method, including:

acquiring configuration information, wherein the configuration information comprises data information of a source data table and identification of a target library, and the data information comprises a first table name and a table structure of the source data table;

processing the table structure according to a first rule to obtain a first mark;

Matching the first table name with a second table name in the metadata record table to obtain a target record item, wherein the metadata record table comprises a plurality of record items, each record item comprises the second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is the same as the first table name;

If the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table, wherein the first target table is a table in the target library, and the source data table and the first target table have the same table structure;

copying the data of the source data table into the first target table.

In some embodiments, if the second identifier of the target record is different from the first identifier, creating a first target table for the source data table includes:

If the second identifier of the target record item is different from the first identifier and the target record item meets a first condition, creating the first target table for the source data table, and a data partition corresponding to the first target table, wherein the table structure of the source data table is the same as that of the first target table, and the first condition comprises that the target record item does not comprise the identifier of the target table in the target library;

Recording the table name of the first target table and partition information of a data partition corresponding to the first target table in the metadata record table;

The copying the data of the source data table into the first target table comprises:

And copying the data in the data partition corresponding to the source data table into the data partition corresponding to the first target table.

In some embodiments, the recording, in the metadata record table, the table name of the first target table and partition information of the data partition corresponding to the first target table includes:

If the target record item does not comprise the identification of the table in the target library, recording the table name of the first target table and partition information of a data partition corresponding to the first target table in the target record item;

And if the target record item comprises the identifier of the table in the target library, a record item is newly added in the metadata record table, wherein the newly added record item comprises the first table name, the first identifier, the table name of the first target table and partition information of a data partition corresponding to the first target table.

In some embodiments, in the event that the target entry does not include an identification of a table in the target library, the table name of the first target table is the same as the first table name;

And under the condition that the target record item comprises the identification of the table in the target library, the table name of the first target table adopts a temporary table name, and the temporary table name is determined according to the first table name and the time for creating the first target table.

In some embodiments, after the copying the data of the source data table into the first target table, the method further comprises:

Recording partition information of the first target table in a partition information table in the case that the target record item does not include an identification of the target table in the target library;

Deleting partition information of an old table in the partition information table, deleting the target record item and data stored in a data partition corresponding to the old table under the condition that the target record item comprises an identifier of the table in the target library, modifying a temporary table name in a newly added record item in the metadata record table into the first table name, and recording partition information of the first target table in the partition information table;

the partition information table comprises a plurality of information items, and each information item is used for describing the storage address and/or the stored data of the corresponding data partition of one table.

In some embodiments, the target record further includes partition information of a data partition corresponding to a second target table, where the second target table is a table in the target library;

after the matching of the first table name with the second table name in the metadata record table to obtain the target record item, the method further includes:

if the second identifier of the target record item is the same as the first identifier, partition information of a data partition corresponding to the second target table is obtained from the target record item;

And copying the data in the data partition corresponding to the source data table into the data partition corresponding to the second target table.

In some embodiments, the source data table is a hive table and the first target table is a impala table.

To achieve the above object, a second aspect of an embodiment of the present application provides a data copying apparatus, including:

The first acquisition module is used for acquiring configuration information, wherein the configuration information comprises data information of a source data table and identification of a target library, and the data information comprises a first table name and a table structure of the source data table;

the second acquisition module is used for processing the table structure according to a first rule to obtain a first identifier;

The matching module is used for matching the first table name with a second table name in the metadata record table to obtain a target record item, the metadata record table comprises a plurality of record items, each record item comprises the second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is the same as the first table name;

The creating module is configured to create a first target table for the source data table if the second identifier of the target record item is different from the first identifier, where the first target table is a table in the target library, and the source data table and the first target table have the same table structure;

and the data copying module is used for copying the data of the source data table into the first target table.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect when the processor executes the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The data copying method, the device, the electronic equipment and the storage medium are characterized in that configuration information is obtained, the configuration information comprises data information of a source data table and an identifier of a target library, the data information comprises a first table name and a table structure of the source data table, the table structure is processed according to a first rule to obtain the first identifier, the first table name is matched with a second table name in the metadata record table to obtain a target record item, the metadata record table comprises a plurality of record items, each record item comprises a second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, the second table name in the target record item is identical to the first table name, if the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table, the first target table is a table in the target library, the source data table is identical to the first table structure of the source data, and the second identifier is copied to the first data in the target library. Through the above process, when the second identifier in the target record item is different from the first identifier (namely, when the table structure of the source data table is changed), the data of the source data table is copied into the first target table, so that the data synchronization of the source data table and the first target table is realized, and when the scheme is applied to the big data field using hive and impala, the problem that the data of the hive table and impala are inconsistent can be avoided.

Drawings

FIG. 1 is a flow chart of a data replication method according to an embodiment of the present application;

FIG. 2 is a flowchart of step S104 in FIG. 1 according to an embodiment of the present application;

FIG. 3 is a flowchart of step S1042 in FIG. 2 according to an embodiment of the present application;

FIG. 4 is another flowchart of a data replication method according to an embodiment of the present application;

FIG. 5 is a flow chart of a data replication method according to an embodiment of the present application;

FIG. 6 is a flow chart of a data replication method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a data replication device according to an embodiment of the present application;

fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

hive is a data warehouse tool based on Hadoop for data extraction, conversion and loading, which is a mechanism that can store, query and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map a structured data file into a database table, provide SQL query functions, and convert SQL sentences into MapReduce tasks for execution. The hive has the advantages that the learning cost is low, rapid MapReduce statistics can be realized through SQL-like sentences, so that MapReduce is simpler, a special MapReduce application program does not need to be developed, and the hive is very suitable for carrying out statistical analysis on a data warehouse;

hive tables, tables in hive;

impala-impala is a massively parallel processing (MassivelyParallelProcessing, MPP) SQL query engine for processing large amounts of data stored in Hadoop clusters. impala is a new query system developed by Cloudera company, which provides SQL semantics to query big data stored in Hadoop's distributed file system (HadoopDistributedFileSystem, HDFS);

impala table-table impala.

Hive and impala are two common tools in the big data field, and in the common scheme using hive and impala, hive is used as a data warehouse for data storage and batch processing. Impala can be used for real-time query and interactive analysis, and faster query response time can be obtained by directly executing the query on hadoop data nodes. The two tools can work cooperatively to provide a comprehensive data analysis solution, an external table is created impala, and the data file position is pointed to a hive file, which is a common derivative solution, but the solution is easy to cause data consistency problem when a multi impala table operates the same hive, and when hive field information or metadata information changes, the information cannot be synchronized to impala, a impala table is often required to be manually built according to metadata changes, that is, the use mode between the existing hive table and impala table is easy to cause data inconsistency problem.

In order to solve the above problems, the embodiments of the present application provide a data replication method, apparatus, electronic device, and storage medium.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is a flowchart of a data replication method according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S105, where:

Step S101, configuration information is obtained, wherein the configuration information comprises data information of a source data table and identification of a target library, and the data information comprises a first table name and a table structure of the source data table.

The table structure may include field names and field types of the source data table. The source data table may be a table in a source database, which is a different database than the target database.

In the embodiment of the present application, the source data table may be a table in hive, and the first target table may be a table in impala.

Step S102, processing the table structure according to a first rule to obtain a first identifier.

The first rule may be set according to an actual situation, and is not limited herein, for example, the first rule may be that a table structure is processed by using a Message digest algorithm (Message-DigestAlgorithm, MD), and a hash value is obtained after the processing, where the hash value is called a first identifier, specifically, each field name and field type of the table structure may be spliced to obtain a spliced character string, and the spliced character string is processed by using MD5 to obtain the first identifier.

The first rule may also be a table structure, and partition information of a data partition of the source data table is processed by using MD5, where the data partition of the source data table refers to a storage area for storing data of the source data table, and the partition information is used to describe a storage address and/or stored data of the data partition. For example, each field name, field type and partition information of the table structure are spliced to obtain a spliced character string, and the spliced character string is processed by adopting MD5 to obtain a first identifier.

In the foregoing, MD5 is a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value (also called a hash value) to ensure that the information is transferred completely and consistently, and in this embodiment, MD5 may be replaced by another algorithm for generating a hash value.

It should be noted that, the first rule needs to be processed at least based on the table structure of the source data table, so that it is convenient to determine whether the table structure of the source data table changes by comparing the generated hash values (i.e. the first identifier).

Step S103, matching the first table name with a second table name in the metadata record table to obtain a target record item, wherein the metadata record table comprises a plurality of record items, each record item comprises the second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is identical to the first table name.

For convenience of description, the table name in the record item is referred to as a second table name, and the identifier obtained by processing according to the first rule according to the table structure corresponding to the second table name is referred to as a second identifier. It should be noted that an entry may also include an identification of a table in the target library (e.g., the identification may be a table name, which may be the same as a second table name in the entry), but not every entry includes an identification of a table in the target library. For example, if the hive table has a corresponding impala table, the entry may include the table name of the hive table, the second identifier, the table name of the impala table, and partition information of the data partition of the impala table.

And searching in the metadata record table according to the first table name, and finding out a record item where a second table name identical to the first table name is located, wherein the record item is called a target record item.

Step S104, if the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table, wherein the first target table is a table in the target library, and the source data table and the first target table have the same table structure.

After the target record item is found, the second identifier in the target record item is matched with the first identifier, if the second identifier and the first identifier are the same, the table structure of the source data table is identical to the record in the metadata record table, the table structure of the source data table is unchanged, if the second identifier and the first identifier are different, the table structure of the source data table is different from the record in the metadata record table, the table structure of the source data table is changed, a first target table needs to be created for the source data table, for example, the source data table is a hive table, and if the second identifier in the target record item is different from the first identifier, a impala table (namely the first target table) is created for the hive table, and the table structure of the created impala table is identical to the table structure of the hive table and comprises the same field name and field type.

Step S105, copying the data of the source data table to the first target table.

The source data table and the first target table are respectively based on different HDFS, so that data coupling of the source data table and the first target table can be avoided. The data of the source data table may be business data, transaction data, payment data, etc. in the financial field, or may be data in other fields, which is not limited herein.

Specifically, the data of the source data table is stored in the data partition of the source data table, the first target table also has a corresponding data partition, and when the data is copied, the data in the data partition of the source data table is copied into the data partition of the first target table, so that the data copy of the source data table is realized.

The method comprises the steps of S101 to S105, wherein configuration information is obtained, the configuration information comprises data information of a source data table and an identifier of a target library, the data information comprises a first table name and a table structure of the source data table, the table structure is processed according to a first rule to obtain a first identifier, the first table name is matched with a second table name in the metadata record table to obtain a target record item, the metadata record table comprises a plurality of record items, each record item comprises a second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, the second table name in the target record item is identical to the first table name, if the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table, the first target table is a table in the target library, the source data table is identical to the first target table, and the source data table is copied to the first data table in the target structure. Through the above process, when the second identifier in the target record item is different from the first identifier (namely, when the table structure of the source data table is changed), the data of the source data table is copied into the first target table, so that the data synchronization of the source data table and the first target table is realized, and when the scheme is applied to the big data field using hive and impala, the problem that the data of the hive table and impala are inconsistent can be avoided.

Referring to fig. 2, in some embodiments, step S104 may include, but is not limited to, steps S1041 to S1042:

Step S1041, when the second identifier of the target record is different from the first identifier and the target record meets a first condition, creating the first target table for the source data table, and a data partition corresponding to the first target table, where the first condition includes that the target record does not include the identifier of the table in the target library, or that the target record includes the identifier of the table in the target library, and the table structure of the source data table is the same as the table structure of the first target table.

If the corresponding target table is created for the first table name corresponding to the source data table, the created identifier of the target table (which may also be a table name) is recorded in the target record item, in which case the target record item includes the identifier of the target table in the target library, and if the corresponding target table is not created for the first table name corresponding to the source data table, the target table corresponding to the first table name is not recorded in the target record item, in which case the target record item does not include the identifier of the target table in the target library.

When the first target table is created, the table structure of the first target table is created according to the table structure of the source data table, and the table name of the first target table can be set to be the first table name, so that a user can use the source data table and the first target table which are located in different databases according to the same table name.

The data partition corresponding to the first target table is used for storing the data of the first target table.

In step S1042, the table name of the first target table and the partition information of the data partition corresponding to the first target table are recorded in the metadata record table.

The partition information of the data partition corresponding to the first target table is used for describing the storage address and/or the stored data of the data partition. The partition information of the data partition corresponding to the first target table may include an identifier of the data partition, and a storage address of the data partition may be determined according to the identifier.

Accordingly, step S105, the copying the data of the source data table to the first target table includes:

step S1051, copying the data in the data partition corresponding to the source data table to the data partition corresponding to the first target table.

In the step S1041 to step S1043 illustrated in this embodiment, when the second identifier of the target record item is different from the first identifier and the target record item satisfies the first condition, it is described that the table structure of the source data table is changed, and no matter whether the corresponding table is created for the source data table or not, in this case, a first target table needs to be created for the source data table, where the first target table is a table corresponding to the source data table in the target library, and then data in a data partition corresponding to the source data table is copied to a data partition corresponding to the first target table, so that the data synchronization of the source data table and the first target table is implemented, and when the scheme is applied to the large data field using hive and impala, the problem that the data of the hive table and impala table are inconsistent can be avoided.

Referring to fig. 3, in some embodiments, step S1042 includes recording, in the metadata record table, a table name of the first target table and partition information of a data partition corresponding to the first target table, which may include, but is not limited to, steps S10421 to S10422:

In step S10421, if the target record item does not include the identifier of the table in the target library, the table name of the first target table and the partition information of the data partition corresponding to the first target table are recorded in the target record item.

The target record item does not include the identifier of the table in the target library, which indicates that the corresponding table has not been created for the source data table, in this case, after the corresponding first target table is created for the source data table, the table name of the first target table and the partition information of the data partition corresponding to the first target table are recorded in the target record item, and then when the target record item is found in the metadata record table according to the first table name again, the history record of the first target table created for the first table name can be known, which is favorable for data consistency.

In step S10422, if the target record includes the identifier of the table in the target library, a record is newly added in the metadata record, where the newly added record includes the first table name, the first identifier, the table name of the first target table, and partition information of the data partition corresponding to the first target table.

The target record item comprises an identifier of a table in the target library, which indicates that a corresponding table is created for the source data table, in this case, because the table structure of the source data table is changed, the previously created table cannot be used any more, a corresponding first target table needs to be created for the source data table, then, a record item is newly added in the metadata record table, and the newly added record item comprises a first table name, the first identifier, the table name of the first target table and partition information of a data partition corresponding to the first target table. And later searching in the metadata record table according to the first table name again, the history record of the first target table created for the first table name can be obtained, and the data consistency is facilitated.

In the foregoing, the target record item does not include an identifier of the table in the target library, which indicates that the table in the corresponding target library has not been created for the first table name previously, in this case, the first target table is directly created, and the table name of the first target table is set to be the same as the first table name, so that the user may use the source data table and the first target table located in different databases according to the same table name.

The target record includes an identifier of the target table, which indicates that the table in the corresponding target table is created for the first table name, in this case, because the table structure of the source data table is changed, the previously created table cannot be used any more, the corresponding table needs to be created again for the source data table, in order to be distinguished from the previously created table (the table name of the previously created table is the same as the first table name), the table name of the first target table created here adopts a temporary table name, which is determined according to the first table name and the time of creating the first target table, for example, the first table name is "user basic table", and the time of creating the first target table is "8 points 12 minutes 56 seconds", and the time of creating the first target table is spliced to obtain the temporary table name, i.e., "user basic table+8 points 12 minutes 56 seconds". The temporary table name may be subsequently modified to the first table name after deleting the target entry (i.e., deleting information about the previously created table in the metadata record table), so that the user may use the source data table and the first target table located in different databases according to the same table name.

Referring to fig. 4, after copying the data of the source data table into the first target table in step S105, the method further includes step S106 or step S106':

step S106, when the target record item does not include the identification of the target table in the target library, the partition information of the first target table is recorded in a partition information table;

Step S106', in the case that the target entry includes the identifier of the table in the target library, deleting the partition information of the old table in the partition information table, where the old table is a table in the target library included in the target entry, deleting the target entry (the table is a corresponding table created last for the first table name, and when deleting, may be deleting a definition of the table, for example, a table name, a table structure, etc.) and data stored in a data partition corresponding to the old table, and modifying a temporary table name in the newly added entry in the metadata record table to the first table name, and recording partition information of the first target table in the partition information table;

Wherein each information item may include a unique identification (e.g., the unique identification may be a database name+a table name where the table is located) and partition information of the table, each partition information being used to record a storage address of each data partition of the corresponding table, and/or information of data stored at each storage address, such as a data size, a storage time, and the like.

Partition information of the active data table is also recorded in the partition information table, and data of the active data table can be obtained according to the partition information.

In the foregoing, the partition information of the first target table is recorded in the partition information table, and then when the data of the first target table is acquired, the data storage address of the first target table may be determined according to the record in the partition information table, so that the data is taken out according to the data storage address.

Referring to fig. 5, in some embodiments of the present application, the target record further includes partition information of a data partition corresponding to a second target table, where the second target table is a table in the target library, and accordingly, in step S103, after matching the first table name with the second table name in the metadata record table to obtain the target record, the method further includes step S107 and step S108:

Step S107, if the second identifier of the target record item is the same as the first identifier, obtaining partition information of a data partition corresponding to the second target table from the target record item;

And S108, copying the data in the data partition corresponding to the source data table into the data partition corresponding to the second target table.

If the second identifier of the target record item is the same as the first identifier, it is indicated that the table structure of the source data table is not changed, and the data copying is directly performed, and the data partition corresponding to the source data table may be determined according to the partition information table, so as to take out the data, and the partition information of the data partition corresponding to the second target table is obtained from the target record item, and then the data of the source data table is copied into the data partition corresponding to the second target table, so as to realize the data synchronization of the source data table and the second target table.

Taking the derivative of the hive table to the impala table as an example, the data replication method provided by the embodiment of the present application is illustrated, as shown in fig. 6 which is a flowchart of the data replication method provided by the embodiment of the present application,

(1) A spark execution environment is created, the configuration information is read, and whether the configuration information is correct or not is checked, for example, whether a data source is hive or not, and whether a written target library is impala or not.

(2) And querying the table structure of the hive table, analyzing all field names, field types and partition information, and calculating to obtain an MD5 value (namely a first identifier) after splicing and sequencing the field names, the field types and the partition information. Querying an MD5 value (i.e., a second identifier) corresponding to a table name of the hive table from a metadata record table (table_meta_info), and judging whether the two MD5 values are identical or not:

1) If the two tables are consistent, the table structure of the hive table is unchanged, and the existing impala table does not need to be adjusted;

2) If the two tables are inconsistent and the hive table has no corresponding impala table, writing the current table metadata (for example, the table name of the hive table) into a metadata record table, creating a impala table and a data partition corresponding to the hive table, and setting a metadata change mark update_flag=true;

3) If the two tables are inconsistent and a impala table corresponding to the hive table exists, the table structure of the hive table is changed, a impala temporary table and a data partition corresponding to the hive are created (the table name of the temporary table adopts the naming mode of the hive table plus the time suffix), and a metadata change mark update_flag=true is set;

(3) Copying the partition file of the hive table to a partition directory corresponding to the impala table HDFS;

(4) Judging whether the metadata is updated according to the update_flag, wherein the method specifically comprises the following steps:

1) If metadata is changed (update_flag=true), deleting historical partition synchronization records of hives (i.e. partition information of old tables) in a partition information table (table_parts_info), deleting impala the old tables and data files by using a drop command, then using a alterrename command to change the table name of a impala temporary table to a formal table name (i.e. a first table name), recording partition information corresponding to a impala table in the partition information table, refreshing the metadata of a impala table (i.e. updating records in the metadata record table, and modifying temporary table names recorded in the metadata record table to the first table name).

2) If the metadata has no change (update_flag=false), the partition information table (table_parts_info) is queried to determine whether the record of the current partition of the current synchronization table (i.e. the partition information of impala tables) exists, if so, the metadata is not changed and the partition information of the table exists, and no operation is needed, if not, the partition information corresponding to impala tables is updated, and the metadata of impala tables is refreshed.

In addition, the scheme automatically detects metadata change information of the hive table (namely, detects whether the table structure and partition information of the hive table change) when the derivative is each time, and under the condition of change, the corresponding impala table is re-created by adopting the change information without manual participation, thereby improving the processing efficiency of data consistency while ensuring the data consistency.

Referring to fig. 7, an embodiment of the present application further provides a data replication device, which can implement the above data replication method, where the data replication device 900 includes:

a first obtaining module 901, configured to obtain configuration information, where the configuration information includes data information of a source data table and an identifier of a target library, and the data information includes a first table name and a table structure of the source data table;

a second obtaining module 902, configured to process the table structure according to a first rule to obtain a first identifier;

the matching module 903 is configured to match the first table name with a second table name in the metadata record table to obtain a target record item, where the metadata record table includes a plurality of record items, each record item includes a second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is the same as the first table name;

a creating module 904, configured to create a first target table for the source data table if the second identifier of the target record item is different from the first identifier, where the first target table is a table in the target library, and the source data table and the first target table have the same table structure;

a first data copying module 905, configured to copy the data of the source data table into the first target table.

In some embodiments, the creation module 904 includes:

The creating sub-module is used for creating the first target table for the source data table and a data partition corresponding to the first target table when the second identifier of the target record item is different from the first identifier and the target record item meets a first condition, wherein the first condition comprises that the target record item does not comprise the identifier of the table in the target library or that the target record item comprises the identifier of the table in the target library, and the table structure of the source data table is the same as that of the first target table;

A recording sub-module, configured to record, in the metadata record table, a table name of the first target table and partition information of a data partition corresponding to the first target table;

the first data copying module 905 is specifically configured to copy data in a data partition corresponding to the source data table to a data partition corresponding to the first target table.

In some embodiments, the recording sub-module includes:

The first recording unit is used for recording the table name of the first target table and the partition information of the data partition corresponding to the first target table in the target record item if the target record item does not comprise the identification of the table in the target library;

And the second recording unit is used for adding a record item in the metadata record table if the target record item comprises the identifier of the table in the target library, wherein the added record item comprises the first table name, the first identifier, the table name of the first target table and partition information of a data partition corresponding to the first target table.

In some embodiments, the data replication device 900 further comprises:

the first processing module is used for recording partition information of the first target table in a partition information table under the condition that the target record item does not comprise the identification of the target table in the target library;

The second processing module is configured to delete partition information of an old table in the partition information table, delete data stored in a data partition corresponding to the old table and the target record, change a temporary table name in a newly added record in the metadata record into the first table name, and record partition information of the first target table in the partition information table, where the old table is a table in the target library included in the target record;

The data replication device 900 further includes:

The third acquisition module is used for acquiring partition information of a data partition corresponding to the second target table from the target record item if the second identifier of the target record item is the same as the first identifier;

And the second data copying module is used for copying the data in the data partition corresponding to the source data table into the data partition corresponding to the second target table.

The data replication device 900 can implement the data replication method and achieve the same technical effects, and the description in the embodiment of the data replication method may be specifically referred to, which is not described herein again.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the data copying method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 is a schematic hardware structure of an electronic device according to an embodiment of the present application, where the electronic device includes:

The processor 1001 may be implemented by a general purpose central processing unit (CentralProcessingUnit, CPU), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

Memory 1002 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 1002 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 1002, and the processor 1001 invokes a data replication method for executing the embodiments of the present disclosure;

an input/output interface 1003 for implementing information input and output;

the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (such as USB, network cable, etc.), or may implement communication in a wireless manner (such as mobile network, WI F I, bluetooth, etc.);

a bus 1005 for transferring information between the various components of the device (e.g., the processor 1001, memory 1002, input/output interface 1003, and communication interface 1004);

Wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through the bus 1005.

The embodiment of the application also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the data copying method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The present application also provides a computer program product comprising computer programs/instructions which when executed by one or more processors implement the steps of the data replication method of any of the embodiments of the present application.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The storage medium includes various media capable of storing programs, such as a USB flash disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A data replication method, characterized in that the method comprises:

Acquire configuration information, the configuration information including data information of a source data table and an identifier of a target database, the data information including a first table name and a table structure of the source data table;

Processing the table structure according to a first rule to obtain a first identifier;

Matching the first table name with a second table name in a metadata record table to obtain a target record item, wherein the metadata record table includes multiple record items, each record item includes a second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is the same as the first table name;

If the second identifier of the target record item is different from the first identifier, a first target table is created for the source data table, the first target table is a table in the target database, and the source data table and the first target table have the same table structure;

The data of the source data table is copied to the first target table.

2. The data replication method according to claim 1, wherein if the second identifier of the target record item is different from the first identifier, creating a first target table for the source data table comprises:

When the second identifier of the target record item is different from the first identifier and the target record item satisfies a first condition, the first target table and a data partition corresponding to the first target table are created for the source data table, and the table structure of the source data table is the same as the table structure of the first target table, wherein the first condition includes: the target record item does not include the identifier of the table in the target library; or the target record item includes the identifier of the table in the target library;

Recording the table name of the first target table and the partition information of the data partition corresponding to the first target table in the metadata record table;

The step of copying the data in the source data table to the first target table includes:

The data in the data partition corresponding to the source data table is copied to the data partition corresponding to the first target table.

3. The data replication method according to claim 2, characterized in that recording the table name of the first target table and the partition information of the data partition corresponding to the first target table in the metadata record table comprises:

If the target record item does not include the identifier of the table in the target library, recording the table name of the first target table and the partition information of the data partition corresponding to the first target table in the target record item;

If the target record item includes the identifier of the table in the target library, a new record item is added to the metadata record table, and the newly added record item includes the first table name, the first identifier, the table name of the first target table, and the partition information of the data partition corresponding to the first target table.

4. The data replication method according to claim 3, characterized in that, when the target record item does not include the identifier of the table in the target library, the table name of the first target table is the same as the first table name;

In the case where the target record item includes an identifier of a table in the target library, the table name of the first target table adopts a temporary table name, and the temporary table name is determined according to the first table name and the time when the first target table is created.

5. The data replication method according to claim 4, characterized in that after copying the data of the source data table to the first target table, the method further comprises:

In a case where the target record item does not include an identifier of a table in the target library, recording partition information of the first target table in a partition information table;

In the case where the target record item includes an identifier of a table in the target library, the partition information of the old table in the partition information table is deleted, and the data stored in the data partition corresponding to the target record item and the old table is deleted, the temporary table name in the newly added record item in the metadata record table is modified to the first table name, and the partition information of the first target table is recorded in the partition information table, where the old table is the table in the target library included in the target record item;

The partition information table includes a plurality of information items, each of which is used to describe the storage address and/or stored data of a data partition corresponding to a table.

6. The data replication method according to claim 1, characterized in that the target record item further includes partition information of a data partition corresponding to a second target table, and the second target table is a table in the target library;

After matching the first table name with the second table name in the metadata record table to obtain the target record item, the method further includes:

If the second identifier of the target record item is the same as the first identifier, obtaining partition information of the data partition corresponding to the second target table from the target record item;

The data in the data partition corresponding to the source data table is copied to the data partition corresponding to the second target table.

7. The data replication method according to any one of claims 1 to 6, characterized in that the source data table is a hive table, and the first target table is an impala table.

8. A data replication device, characterized in that the device comprises:

A first acquisition module is used to acquire configuration information, wherein the configuration information includes data information of a source data table and an identifier of a target database, wherein the data information includes a first table name and a table structure of the source data table;

A second acquisition module, configured to process the table structure according to a first rule to obtain a first identifier;

a matching module, configured to match the first table name with a second table name in a metadata record table to obtain a target record item, wherein the metadata record table includes a plurality of record items, each record item includes a second table name and a second identifier, the second identifier is obtained by processing a table structure corresponding to the second table name according to the first rule, and the second table name in the target record item is the same as the first table name;

a creation module, configured to create a first target table for the source data table if the second identifier of the target record item is different from the first identifier, wherein the first target table is a table in the target library, and the source data table and the first target table have the same table structure;

A data replication module is used to replicate the data in the source data table to the first target table.

9. An electronic device, characterized in that the electronic device comprises a memory and a processor, the memory stores a computer program, and the processor implements the data copying method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data replication method according to any one of claims 1 to 7.