CN118069633A

CN118069633A - Table data cleaning method and device based on distributed system and server

Info

Publication number: CN118069633A
Application number: CN202410241359.1A
Authority: CN
Inventors: 宋哲
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-05-24

Abstract

The specification provides a table data cleaning method, a table data cleaning device and a table data cleaning server based on a distributed system, and the table data cleaning method, the table data cleaning device and the table data cleaning server can be used in the technical field of big data. Based on the method, configuration data of a target operation group and a target cleaning strategy can be acquired first; initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, configuring the target job set and target ending jobs; determining and utilizing a target cluster by interacting with a scheduling server according to preset processing rules and a system metadata base, and executing a target job group by using a target data warehouse tool; after the execution target job group completes the batch data processing task, the target cluster is utilized to execute target ending operation according to a target cleaning strategy; and automatically cleaning the table data generated by the target data warehouse tool when the target data warehouse tool is used when the preset cleaning trigger condition is determined to be met by executing the target ending operation. Therefore, the automatic cleaning of the table data of the target data warehouse tool can be realized efficiently and accurately.

Description

Table data cleaning method and device based on distributed system and server

Technical Field

The specification belongs to the technical field of big data, and particularly relates to a table data cleaning method, device and server based on a distributed system.

Background

In the big data field, distributed scheduling systems of business platforms (e.g., trading platforms, etc.) often require bulk data processing using a target data warehouse tool to process large amounts of business data (e.g., trading data).

In the process of processing the batch data, a large amount of invalid table data is generated when the target data warehouse tool is used, and thus, a serious burden is caused on the data storage of the platform system.

Based on existing methods, it is often necessary to rely on technicians to manually periodically review and clean up invalid table data associated with the target data warehouse tool. Thus, on the one hand, the workload of technicians is increased; on the other hand, the method is influenced by factors such as personal skills and experiences of technicians, errors easily occur when table data are cleaned, and then the safety of platform system data processing is influenced.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The specification provides a table data cleaning method, device and server based on a distributed system, which can efficiently and accurately realize automatic cleaning of table data of a target data warehouse tool.

The specification provides a table data cleaning method based on a distributed system, which is applied to a system server and comprises the following steps:

Acquiring configuration data of a target job group and a target cleaning strategy related to a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system;

Initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, and configuring the target job set; generating a target ending job for the target job group;

determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

In one embodiment, the configuration data includes at least one of: job parameters, job dependency relationships, job types;

The target cleaning strategy includes at least one of: the method comprises the steps of identifying a database to be cleaned, table names of data tables to be cleaned, cleaning partitions, cleaning marks, cleaning frequencies, cleaning mode types, cleaning index parameters and cleaning index thresholds.

In one embodiment, obtaining a target cleaning policy for a target job group with respect to a target data warehouse tool includes:

acquiring a target history record, historical time period service scene data and current time period service scene data when a target job group is executed in a historical time period;

According to the target historical record and the historical time period business scene data, determining and utilizing the association relation between the table data generated when the target operation group is executed in the historical time period and uses the target data warehouse tool and the historical time period business scene data, and establishing a corresponding association change model;

and determining a corresponding target cleaning strategy according to the business scene data of the current time period by utilizing the association change model.

In one embodiment, determining and utilizing a target cluster to execute the target job set using a target data warehouse tool according to preset processing rules and a system metadata base by interacting with a scheduling server, comprises:

the system server distributes the target job group, the target ending job and the target cleaning strategy to the scheduling server;

The scheduling server determines a matched target cluster according to the target job set; and sending a target scheduling request for executing the target job group and the target ending job to the target cluster;

And the target cluster responds to the target scheduling request, and executes the target job group by using a target data warehouse tool according to a preset processing rule and a system metadata base.

In one embodiment, the target cluster executing the target job group using a target data warehouse tool in response to the target scheduling request according to a preset processing rule and system metadata base, comprising:

The target cluster acquires and uses the connection metadata by querying a system metadata base according to a preset processing rule, and creates a target connection pool; wherein the target connection pool comprises tool connections for connecting target data warehouse tools;

the target cluster determines a plurality of consumption nodes and production nodes matched with the target operation group; creating a target working pool and a target thread pool aiming at a target job group;

And the target cluster executes a target job group to complete batch data processing tasks by utilizing a system metadata base, a target connection pool and a target thread pool through the production node and the plurality of consumption nodes.

In one embodiment, the target cluster performs a target job group to complete a batch data processing task by using a system metadata base, a target connection pool, a target thread pool, and a target work pool through the production node and the plurality of consumption nodes, including:

The current batch job in the target job set is executed as follows:

The target cluster queries through the production node and determines the current batch of operation according to the operation dependency relationship in the system metadata base;

The target cluster determines job execution parameters of the current batch job through the production node; updating the job execution parameters of the current batch job into a target working pool;

The target cluster obtains the job execution parameters of the current batch of jobs from the target working pool through the consumption node, obtains the corresponding job threads from the target thread pool, and obtains the corresponding tool connection from the target connection pool;

The target cluster calls a corresponding working thread through the consumption node, uses a target data warehouse tool through corresponding tool connection, and executes the current batch job according to corresponding job execution parameters so as to complete batch data processing tasks corresponding to the current batch job.

In one embodiment, while the target cluster performs the target job group to complete the batch data processing task through the production node and the plurality of consumption nodes using the system metadata base, the target connection pool, and the target thread pool, the method further comprises:

the target cluster monitors the execution progress of the target job group;

In the case that the execution of the target job group is determined to be completed, the target cluster switches to execute the target ending job for the target job group.

In one embodiment, after the target cluster switch performs the target closeout job for the target job group, the method further comprises:

the target cluster obtains a target cleaning strategy by inquiring a system metadata base through the production node;

The target cluster analyzes through the production node and detects whether a preset cleaning trigger condition is met or not according to a target cleaning strategy;

and under the condition that the preset cleaning trigger condition is not met currently, the target cluster ends the target ending operation.

In one embodiment, after the target cluster is parsed by the production node and whether a preset cleaning trigger condition is currently satisfied is detected according to the target cleaning policy, the method further includes:

Under the condition that the current meeting of the preset cleaning triggering conditions is determined, the target cluster determines current batch cleaning execution parameters through the production nodes according to a target cleaning strategy; updating the current batch cleaning execution parameters into a target working pool;

The target cluster obtains current batch cleaning execution parameters from a target working pool through a consumption node, obtains corresponding operation threads from a target thread pool and obtains corresponding tool connection from a target connection pool;

The target cluster calls a corresponding working thread through the consumption node, and uses a corresponding tool to connect, and the table data related to the target data warehouse tool is cleaned according to the current batch cleaning execution parameters.

In one embodiment, after the target cluster invokes the corresponding worker thread via the consuming node, clears the table data associated with the target data warehouse tool according to the current batch clear execution parameters using the corresponding tool connection, the method further comprises:

the target cluster receives and monitors the execution progress of the target ending operation according to the feedback information of the operation thread;

in the case where it is determined that the execution of the target ending job is completed, the target cluster ends the target ending job.

In one embodiment, the method further comprises:

Acquiring current service scene data at intervals of a preset time period;

and updating the target cleaning strategy according to the current business scene data.

The specification also provides a table data cleaning device based on a distributed system, which is applied to a system server and comprises:

The acquisition module is used for acquiring configuration data of the target job group and a target cleaning strategy of the target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system;

the processing module is used for initializing a system metadata base according to the configuration data of the target job set and the target cleaning strategy and configuring the target job set; generating a target ending job for the target job group;

The job module is used for determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with the scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

The present specification also provides a server comprising a processor and a memory for storing processor-executable instructions that when executed implement the relevant steps of the distributed system based table data cleansing method.

The present specification also provides a computer readable storage medium having stored thereon computer instructions which when executed by a processor implement the relevant steps of the distributed system based table data cleansing method.

The present specification also provides a computer program product comprising a computer program which, when executed by a processor, implements the relevant steps of the distributed system based table data cleaning method.

According to the table data cleaning method, device and server based on the distributed system, when a target operation group is required to be executed by using a target data warehouse tool based on the distributed system to realize batch data processing tasks, configuration data of the target operation group and a target cleaning strategy about the target data warehouse tool can be acquired first; initializing a system metadata base according to the configuration data of the target operation group and the target cleaning strategy, and configuring the target operation group; automatically generating a target ending job for the target job group; then, the target job group and the target ending job can be sequentially executed by using a target data warehouse tool according to preset processing rules and a system metadata base through interaction with a scheduling server, and the matched target cluster is determined and utilized; after the execution of the target job group completes the batch data processing task, the target cluster can be utilized to automatically execute the target ending job according to the target cleaning strategy; and automatically cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation. Therefore, the automatic cleaning of the table data of the target data warehouse tool can be efficiently and accurately realized, the data storage burden of the system and the workload of a user are effectively reduced, the error in the data cleaning process is reduced, and the safety and reliability of data processing based on the distributed system are ensured.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure, the drawings that are required for the embodiments will be briefly described below, and the drawings described below are only some embodiments described in the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow diagram of a distributed system-based table data cleansing method provided in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of one embodiment of a distributed system-based table data cleaning method provided by embodiments of the present disclosure, in one example scenario;

FIG. 3 is a schematic diagram of one embodiment of a distributed system-based table data cleaning method provided by embodiments of the present disclosure, in one example scenario;

FIG. 4 is a schematic diagram of one embodiment of a distributed system-based table data cleaning method provided by embodiments of the present disclosure, in one example scenario;

FIG. 5 is a schematic diagram of one embodiment of a distributed system-based table data cleansing method provided by embodiments of the present disclosure, in one example scenario;

FIG. 6 is a schematic diagram of one embodiment of a distributed system-based table data cleaning method provided by embodiments of the present disclosure, in one example scenario;

FIG. 7 is a schematic diagram of one embodiment of a distributed system-based table data cleaning method provided by embodiments of the present disclosure, in one example scenario;

FIG. 8 is a schematic diagram of the structural composition of a server provided in one embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the structural composition of a table data cleaning device based on a distributed system according to one embodiment of the present disclosure;

FIG. 10 is a schematic diagram of one embodiment of a distributed system-based table data cleansing method provided by embodiments of the present disclosure, in one example scenario;

FIG. 11 is a schematic diagram of one embodiment of a distributed system-based table data cleansing method provided by embodiments of the present disclosure, in one example scenario.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides a table data cleaning method based on a distributed system. The method can be applied to the system server side. In particular implementations, the method may include the following:

S101: acquiring configuration data of a target job group and a target cleaning strategy related to a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system;

S102: initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, and configuring the target job set; generating a target ending job for the target job group;

S103: determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

The target job set may include only one job, or may include a plurality of jobs belonging to the same batch data processing task (or the same batch data processing logic).

Further, the plurality of jobs may be different types of jobs. Such as insert jobs, filter jobs, statistics jobs, delete jobs, and the like.

Specifically, for example, when executing a batch data processing task concerning "counting class C data in a table a and a table B of the same day", the corresponding target job group may include a plurality of jobs of: job 1, creating a temporary table D table; 2, retrieving and extracting C-type data from the A table; operation 3, retrieving and extracting C-type data from the B table; operation 4, inserting the C-type data in the extracted A table into the D table; operation 5, inserting the C-type data in the extracted B table into the D table; and 6, counting the C-type data in the D table, and outputting a counting result. Wherein, the jobs 2 and 3 belong to the same type of jobs; job 4, job 5 belong to the same type of job; while jobs 1,2, 4,6 belong to different types of jobs.

The distributed system can be particularly a distributed system for processing batch business data by a transaction platform in a big data scene. Such as a settlement system for financial institutions, or a transaction system for shopping websites, etc.

Specifically, the distributed system may be a Hadoop-based distributed system. The Hadoop may specifically refer to a distributed system infrastructure. Based on the distributed system architecture, a user can develop a distributed program without knowing the details of a distributed bottom layer, and high-speed operation and storage are realized by fully utilizing the performance of a cluster.

The above-described target data warehouse tool may be understood in particular as a data warehouse tool application that matches a distributed system. In particular, specific statement instructions can be executed by calling the target data warehouse tool to make full use of the performance advantages of the distributed system to efficiently implement completion of relevant data processing operations such as querying, counting, inserting, etc. of database tables.

In the case where the distributed system is a Hadoop-based distributed system, the target data warehouse tool may be Hive. Hive may specifically refer to a data warehouse tool that performs operations such as data extraction, translation, loading, etc. Using the Hive data warehouse tool, structured data files can be mapped into a database table and provided with functions such as SQL (Structured Query Language ) queries, which can be executed by converting SQL statements into MapReduce tasks.

It should be noted that table data (e.g., hive table data, etc.) related to the target data warehouse tool may also be generated when a specific job is executed using the target data warehouse tool based on the distributed system. Most of this type of table data is useless data and is typically stored in a cluster server of the distributed system. Further, when the amount of business data related to the job is large or when the job is continued a plurality of times, the amount of table data generated is also large, and thus the data storage burden of the system is caused.

The above-mentioned target ending job may be specifically understood as an additional job that the system server automatically generates and configures for the target job group. The target ending job described above may be added after the target job group. After the execution of the target job set is completed, the execution of the target ending job is automatically triggered. When the target ending operation is executed specifically, whether a preset cleaning triggering condition is met or not can be automatically detected and judged according to a target cleaning strategy; and under the condition that the preset cleaning triggering condition is met, automatically cleaning invalid table data generated by the target data warehouse tool when in use.

The above system server may be specifically understood as a server responsible for job distribution and processing based on a distributed system. The above-mentioned scheduling server (or application server, which may be referred to as the etl server) may specifically be understood as a server that assists the system server in scheduling the corresponding distributed cluster of servers to execute a specific job.

Based on the above embodiment, the system server may first acquire configuration data of the target job group, and a target cleaning policy regarding the target data warehouse tool; initializing a system metadata base according to the configuration data of the target operation group and the target cleaning strategy, and configuring the target operation group; simultaneously, automatically generating and adding corresponding target ending operation after the target operation group; and the system server can determine a proper target cluster through interaction with the scheduling server, and distribute the target job group and the target ending job to target cluster processing. Furthermore, the target cluster can execute the target job group by using the target data warehouse tool according to the preset processing rule and the system metadata base; after the target cluster finishes the batch data processing task after executing the target job, the target cluster can further automatically execute the target ending job according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation. Therefore, the automatic cleaning of the table data of the target data warehouse tool can be efficiently and accurately realized, the user is not required to manually clean the related table data again, the data storage burden of the system and the workload of the user are effectively reduced, the error in the data cleaning process is reduced, and the whole running stability and reliability of the distributed system are ensured.

In this embodiment, the system server, the scheduling server, and the cluster server may specifically include a background server applied to a side of a service platform (e.g., a transaction platform, etc.), and capable of implementing functions such as data transmission and data processing. Specifically, the system server, the scheduling server, and the cluster server may be, for example, an electronic device having a data operation function, a storage function, and a network interaction function. Or the system server, the scheduling server and the cluster server can also be software programs which are operated in the electronic equipment and provide support for data processing, storage and network interaction. In the present embodiment, the number of servers included in the system server, the scheduling server, and the cluster server is not particularly limited. The system server, the scheduling server and the cluster server can be one server, or can be several servers or a server cluster formed by several servers.

In some embodiments, the target job set may further include a plurality of job sets independent from each other; accordingly, the target ending job may include a plurality of ending jobs respectively corresponding to a plurality of job groups.

In some embodiments, when a user (e.g., a technician or developer, etc.) needs to perform a specific batch data processing task, configuration data for a target job group corresponding to the batch data processing task may be configured through a user terminal; generating a corresponding target job request; wherein, the target job request at least carries configuration data of the target job group; and then the target job request is sent to a system server through the user terminal. Accordingly, the system server may obtain the configuration parameters of the target job set.

In this embodiment, the user terminal may specifically include a front end applied to a user side and capable of implementing functions such as data acquisition and data transmission. Specifically, the user terminal may be, for example, an electronic device such as a desktop computer, a tablet computer, a notebook computer, a smart phone, and the like. Or the user terminal may be a software application capable of running in the electronic device described above. For example, the user login application may be installed and run on a desktop computer.

Further, the target job request may further carry a target cleaning policy set by a user in a user-defined manner. The target cleaning policy described above may be used, among other things, to indicate how to automatically clean up table data associated with a target data warehouse tool.

Correspondingly, the system server can acquire the target cleaning strategy set by user definition by analyzing the received target job request.

In some cases, the target cleaning policy may also be automatically generated by the system server.

In some embodiments, the configuration data may specifically include at least one of: job parameters, job dependency relationships, job types, etc.;

The target cleaning strategy may specifically include at least one of: the method comprises the steps of identifying a database to be cleaned, table names of data tables to be cleaned, cleaning partitions, cleaning marks, cleaning frequencies, cleaning mode types, cleaning index parameters, cleaning index thresholds and the like.

Specifically, the job parameters may include a job name, an index parameter at the time of job execution (for example, a job execution time period, a code statement related to job execution, etc.), a common script name related to job execution, etc.

The job dependency relationship may specifically refer to an interaction relationship when different jobs in the target job group are executed. For example, job 2 needs to be dependent on the execution result of job 1 to execute.

The job types may specifically include at least one of: insert jobs, delete jobs, statistics jobs, and the like.

Based on the embodiment, the system server can obtain and utilize richer configuration data and target cleaning strategies to accurately configure and execute specific target job groups and target ending jobs.

In some embodiments, referring to fig. 2, the above-mentioned target cleaning policy for acquiring the target job group about the target data warehouse tool may include the following when implemented:

s1: acquiring a target history record, historical time period service scene data and current time period service scene data when a target job group is executed in a historical time period;

S2: according to the target historical record and the historical time period business scene data, determining and utilizing the association relation between the table data generated when the target operation group is executed in the historical time period and uses the target data warehouse tool and the historical time period business scene data, and establishing a corresponding association change model;

s3: and determining a corresponding target cleaning strategy according to the business scene data of the current time period by utilizing the association change model.

Based on the embodiment, the system server can automatically generate the target cleaning strategy with good effect and strong pertinence, so that the operation difficulty of a user can be further reduced, and the operation of the user is simplified.

In some embodiments, the historical time period business scenario data may specifically include: attribute information of service data (e.g., data amount of service data, service type to which the service data belongs, participant information of the service data, etc.) involved in executing the target job group in the history period, and environment information (e.g., data amount of service data accessed by the entire transaction platform at the time, operation state of the entire background system of the transaction platform at the time, public opinion information outside at the time, etc.) at the time of executing the target job group in the history period. Wherein the historical time period may be, for example, the last year or the like.

Correspondingly, the service scene data of the current time period specifically may include: attribute information of service data involved in executing the target job in the current time period, and environment information in executing the target job in the current time period. The current time period may be, for example, the current month or the like.

When the method is implemented, firstly, processing logic of batch data processing tasks to be realized by a target job group and/or a combination of job types contained in the target job group can be used as indexes, and an application log of a corresponding target data warehouse tool is found out by querying a historical database of a transaction platform and used as a target historical record for reference when the target job group is executed in the historical time period; the target history record at least contains relevant information of table data generated when the target data warehouse tool is used for executing the target job group in a history period. And then, the historical time period indicated by the target historical record is used as an index, the historical database is queried again, and the business scene data of the historical time period corresponding to the target historical record is obtained. Meanwhile, the service scene data of the current time period can be collected.

In specific implementation, the target history record and the service scene data in the history time period can be used jointly, and the association change model capable of reflecting the internal association relation between the table data generated by using the target data warehouse tool to execute the target job group and the service scene data is determined through data fitting analysis. Further, the association change model may be used to process the business scenario data in the current time period, and predict the data characteristics of the table data (e.g., the total amount of the table data, the law of change in the data amount of the table data over time, the impact of the data amount of the table data on the overall performance of the transaction platform, etc.) generated when the target data warehouse tool is used by the execution target job group in the current time period. And determining a matched cleaning strategy according to the predicted data characteristics of the table data, and taking the matched cleaning strategy as a target cleaning strategy.

When the matched target cleaning strategy is specifically determined, the matched preset cleaning strategy can be determined as the matched target cleaning strategy by firstly matching the predicted related information of the table data in the preset cleaning strategy set. The preset cleaning policy set may include a plurality of preset cleaning policies; each preset cleaning strategy corresponds to at least one table data situation. The target cleaning policy may specifically be a CSV format file.

For example, the table data case is: the data volume of the early (e.g., beginning one month) tabular data slowly increases; the data volume of the post-table data increases dramatically and frequently exceeds the stored performance alert value of the transaction platform. The corresponding preset cleaning strategy may be: in the early stage, whether the accumulated amount of data of the table reaches a first larger threshold value is detected according to the frequency of every five days, and when the accumulated amount reaches the first threshold value, the preset cleaning triggering condition is determined to be met; under the condition that the preset cleaning triggering condition is met, cleaning table data carrying cleaning marks in other cleaning subareas except the subarea A; later according to whether the accumulated amount of data of the frequency detection table reaches a second threshold value smaller than the first threshold value every two days, when the accumulated amount of data reaches the second threshold value, determining that a preset cleaning triggering condition is met; and cleaning table data carrying cleaning marks in all cleaning partitions under the condition that the preset cleaning triggering condition is met.

The preset cleaning policy may specifically be a template policy generated by combining expert experience in advance and performing clustering processing on a large number of historical cleaning policies.

Of course, in specific implementation, the data characteristics of the history table data and the corresponding history cleaning strategies can be utilized to perform model training on the initial model based on the neural network, so as to obtain a strategy generation model capable of automatically generating the corresponding cleaning strategies based on the data characteristics of the table data; and then the strategy generation model can be utilized to automatically generate and output a corresponding cleaning strategy as a target cleaning strategy according to the data characteristics of the table data generated when the target operation group is executed in the input current time period and uses the target data warehouse tool.

The system metadata base may be a relational database, such as an Oracle database, mysql database, etc.

In some embodiments, a system metadata base is initialized according to configuration data and a target cleaning policy of a target job group, and when the target job group is configured, the configuration data of the target job group and identification information of the target cleaning policy can be respectively initialized to a system database responsible for scheduling in a distributed system; determining a public script for executing each specific job according to the job parameters and the job types in the configuration data of the target job group; and configuring the calling relationship among the corresponding public scripts according to the job dependency relationship in the configuration data of the target job group so as to complete the configuration of the target job group.

When the target ending operation aiming at the target operation group is specifically generated, a virtual script for executing the target cleaning operation can be determined according to a target cleaning strategy; then adding the target ending operation to the tail of the target operation group according to the operation dependency relationship in the configuration data of the target operation group; and is configured to trigger execution after execution of the job included in the target job set; meanwhile, the target ending operation can be named as the operation name corresponding to the target operation group according to the corresponding naming rule, so that the generation and configuration of the target ending operation are completed.

In some embodiments, referring to fig. 3, by interacting with the scheduling server, determining and using the target cluster to execute the target job set by using the target data warehouse tool according to the preset processing rule and the system metadata base, when implementing, the method may include the following steps:

S1: the system server distributes the target job group, the target ending job and the target cleaning strategy to the scheduling server;

s2: the scheduling server determines a matched target cluster according to the target job set; and sending a target scheduling request for executing the target job group and the target ending job to the target cluster;

S3: and the target cluster responds to the target scheduling request, and executes the target job group by using a target data warehouse tool according to a preset processing rule and a system metadata base.

Based on the above embodiment, the system server may determine an appropriate target cluster server by interacting with the scheduling server; the target job group can be efficiently executed by the target cluster server, so that the target ending job for the target job group can be automatically executed subsequently after the target job group is executed.

In the implementation, the system server may first send and store the target job set, the target ending job, and the target cleaning policy under the designated directory of the scheduling server. Then, the scheduling server can estimate the job size according to the target job group; determining a target cluster in the distributed clusters with the operation bearing capacity matched with the workload from the distributed clusters in the idle state according to the workload; and sending a target scheduling request for executing the target job group and the target ending job to the target cluster. Correspondingly, the target cluster receives and responds to the target scheduling request, and uses the target data warehouse tool to execute the jobs contained in the target job group according to the preset processing rule and the system metadata base so as to complete the corresponding batch data processing task.

In some embodiments, referring to fig. 4, in response to the target scheduling request, the target cluster executes the target job set by using a target data warehouse tool according to a preset processing rule and a system metadata base, and when implemented, the target cluster may include the following:

s1: the target cluster acquires and uses the connection metadata by querying a system metadata base according to a preset processing rule, and creates a target connection pool; wherein the target connection pool comprises tool connections for connecting target data warehouse tools;

S2: the target cluster determines a plurality of consumption nodes and production nodes matched with the target operation group; creating a target working pool and a target thread pool aiming at a target job group;

S3: and the target cluster executes a target job group to complete batch data processing tasks by utilizing a system metadata base, a target connection pool and a target thread pool through the production node and the plurality of consumption nodes.

The connection metadata may be understood, in particular, as metadata stored in advance in a system metadata base for creating tool connections with the target data warehouse tool.

Based on the embodiment, the matched target cluster can be called to use the target data warehouse tool to fully utilize the performance advantages of the distributed system, and the target job set is efficiently executed to complete the corresponding batch data processing task.

In some embodiments, when implementing, the target cluster server may query and obtain, as connection metadata, job parameters such as a target job, job dependencies, a user name of a target data warehouse tool (e.g., hive user name), path and interface information of the target data warehouse tool (e.g., keytab path, krb path, ZKIp, ZKPort), etc. according to preset processing rules; and creates a corresponding target connection pool (e.g., hive connection pool, etc.) using the connection metadata.

In particular creating a target connection pool, a plurality of tool connections (e.g., hive connections) with a target database warehouse tool may be established by connecting to a corresponding target database warehouse tool (e.g., hive) through a corresponding database connection (e.g., javaDataBaseConnectivity, JDBC); at the same time, creating a corresponding tool connection pool (e.g., hive connection pool queue); and sequentially caching the tool connections into the tool connection pool by circularly executing corresponding threads (e.g., poolSize) to establish and obtain corresponding target connection pools. The target connection pool comprises a plurality of tool connections, and the tool connections can be used for accessing target data warehouse tools in specific implementation.

In implementation, the target cluster may determine, from the node servers in the target cluster, the matched node servers as consuming nodes and production nodes according to the jobs included in the target job group. In addition, a corresponding list of consuming nodes (e.g., consumer list) may be established based on the consuming nodes. Further, a matched number of threads can be created according to the number of consuming nodes to construct a corresponding target thread pool.

When specifically creating the target working pool (e.g., workerpool), a ring buffer may be created first; creating a sequence obstacle (e.g., ringBuffer) based on the ring buffer by a corresponding instruction program; the sequence obstacle can be used for constructing a target working pool. Further, specific job logic (comprising production logic and consumption logic) can be decomposed according to the target job group and the target ending job and by combining the job dependency relationship; and storing the operation logic into a target working pool to obtain a target working pool matched with the target operation group and the target ending operation.

In some embodiments, referring to fig. 5, the target cluster performs, through the production node and the plurality of consumption nodes, a batch data processing task by using a system metadata base, a target connection pool, a target thread pool, and a target work pool, and when implemented, the target job group performs a current batch job in the target job group in the following manner:

S1: the target cluster queries through the production node and determines the current batch of operation according to the operation dependency relationship in the system metadata base;

s2: the target cluster determines job execution parameters of the current batch job through the production node; updating the job execution parameters of the current batch job into a target working pool;

S3: the target cluster obtains the job execution parameters of the current batch of jobs from the target working pool through the consumption node, obtains the corresponding job threads from the target thread pool, and obtains the corresponding tool connection from the target connection pool;

S4: the target cluster calls a corresponding working thread through the consumption node, uses a target data warehouse tool through corresponding tool connection, and executes the current batch job according to corresponding job execution parameters so as to complete batch data processing tasks corresponding to the current batch job.

Based on the embodiment, each batch of jobs in the target job set can be completed step by step according to the job dependency relationship, and batch data processing tasks corresponding to the completed target job set can be accurately and efficiently executed.

In some embodiments, when the production node and the consumption node are specifically utilized to execute the target job group, the target working pool can be started first, and the production node acquires the current batch job in the target job group from the target working pool and determines to acquire the current batch job in the target job group according to the corresponding production logic and job dependency relationship; then, the system metadata base is inquired to acquire data such as the operation parameters, the operation types and the like of the current batch operation as the operation execution parameters of the current batch operation; and storing the job execution parameters of the current batch job into a target working pool as production data.

Correspondingly, the consumption node can acquire the job execution parameters of the current batch job through the target working pool, acquire the corresponding job threads from the target thread pool, and acquire the corresponding tool connection from the target connection pool; and then, the corresponding operation thread can be called, the target data warehouse tool is used by utilizing the corresponding tool connection, the current batch operation is executed according to the corresponding operation execution parameters, and the batch data processing task corresponding to the current batch operation is completed in parallel by multiple threads.

In some embodiments, while the target cluster performs the target job group to complete the batch data processing task through the production node and the plurality of consumption nodes by using the system metadata base, the target connection pool and the target thread pool, the method may further include the following when implemented:

s1: the target cluster monitors the execution progress of the target job group;

s2: in the case that the execution of the target job group is determined to be completed, the target cluster switches to execute the target ending job for the target job group.

Based on the above embodiment, the target cluster may actively monitor the execution progress of the target job set, and when it is monitored that the execution of the target job set is completed, may switch the target ending job of the execution target job set in time.

In specific implementation, the target cluster can determine the execution progress of each batch of jobs by monitoring the feedback information of the job thread in the execution process of each batch of jobs, so that the execution progress of the target job set can be monitored.

In some embodiments, referring to fig. 6, after the target cluster switches to execute the target ending job for the target job group, the method may further include the following when implemented:

s1: the target cluster obtains a target cleaning strategy by inquiring a system metadata base through the production node;

S2: the target cluster analyzes through the production node and detects whether a preset cleaning trigger condition is met or not according to a target cleaning strategy;

s3: and under the condition that the preset cleaning trigger condition is not met currently, the target cluster ends the target ending operation.

Based on the above embodiment, the target cluster may automatically switch to execute the target ending job corresponding to the target job after the target job group is executed.

In specific implementation, the production node may acquire the identification information of the target cleaning policy (e.g., the path address of the target cleaning policy) by querying a system metadata base; and then the appointed catalogue of the scheduling server can be queried according to the identification information of the target cleaning strategy, and the corresponding target cleaning strategy is obtained. Then, the production node can detect and judge whether a preset cleaning trigger condition is met or not according to the target cleaning strategy.

Specifically, for example, a production node may collect current cleaning index parameters (e.g., cumulative amount of table data) for table data according to a target cleaning policy; it is then detected whether the current cleaning index parameter is associated with a corresponding cleaning index threshold (e.g., a cumulative amount threshold of table data). When the current cleaning index parameter is determined to be smaller than the cleaning index threshold, the current condition that the preset cleaning trigger condition is not met is determined, and then the target ending operation can be ended, and the target operation group of the next round is waited to be executed. In contrast, when the current cleaning index parameter is determined to be greater than or equal to the cleaning index threshold, the preset cleaning trigger condition is determined to be met, and then the table data cleaning can be automatically performed by executing the target ending operation.

In some embodiments, after the target cluster is parsed by the production node and whether the preset cleaning trigger condition is currently met is detected according to the target cleaning policy, referring to fig. 7, when the method is implemented, the method may further include the following:

S1: under the condition that the current meeting of the preset cleaning triggering conditions is determined, the target cluster determines current batch cleaning execution parameters through the production nodes according to a target cleaning strategy; updating the current batch cleaning execution parameters into a target working pool;

S2: the target cluster obtains current batch cleaning execution parameters from a target working pool through a consumption node, obtains corresponding operation threads from a target thread pool and obtains corresponding tool connection from a target connection pool;

S3: the target cluster calls a corresponding working thread through the consumption node, and uses a corresponding tool to connect, and the table data related to the target data warehouse tool is cleaned according to the current batch cleaning execution parameters.

Based on the above embodiment, under the condition that the preset cleaning triggering condition is determined to be met currently, the target cluster can efficiently and accurately realize automatic cleaning of the table data of the target data warehouse tool through cooperation of the production node and the consumption node.

In specific implementation, the production node can circularly query and detect table data related to the target data warehouse tool according to the target cleaning strategy so as to determine invalid table data to be cleaned; generating corresponding current batch cleaning execution parameters according to the related information of the table data; and storing the current batch cleaning execution parameters into a target working pool.

Correspondingly, the consumption node can firstly acquire the current batch cleaning execution parameters from the target working pool, and simultaneously acquire the corresponding working threads from the target thread pool and acquire the corresponding tool connection from the target connection pool; and then, calling a corresponding operation thread, connecting by using a corresponding tool, positioning to a corresponding cleaning partition according to a corresponding operation execution parameter, and specifically cleaning the expression carrying the cleaning mark according to a specified cleaning mode so as to complete automatic cleaning of the table data.

In some embodiments, after the target cluster invokes the corresponding worker thread through the consuming node and uses the corresponding tool connection to clean the table data related to the target data warehouse tool according to the current batch cleaning execution parameter, the method may further include, when implemented, the following:

s1: the target cluster receives and monitors the execution progress of the target ending operation according to the feedback information of the operation thread;

s2: in the case where it is determined that the execution of the target ending job is completed, the target cluster ends the target ending job.

Based on the above embodiment, the target cluster may actively monitor the execution progress of the target ending job according to the feedback information of the job thread when the target ending job is executed, and normally end the target ending job when it is monitored that the execution of the target ending job is completed.

When the target cluster finds that the target ending operation is abnormal according to the feedback information of the operation thread when the target ending operation is executed, the operation thread which is abnormal when the target ending operation is executed can be firstly clustered to be used as an abnormal operation thread, and the cleaning operation corresponding to the abnormal operation thread; further collecting execution information of abnormal operation threads; and according to the execution information of the abnormal operation thread, after targeted correction and adjustment, calling a corresponding updated operation thread through the updated consumption node, and re-executing the cleaning operation by using a corresponding tool connection to realize the cleaning of the related table data.

In some embodiments, the method may further include the following when implemented:

s1: acquiring current service scene data at intervals of a preset time period;

S2: and updating the target cleaning strategy according to the current business scene data.

Based on the above embodiment, the latest current service scene data can be obtained at intervals of a preset time period, and the current latest data characteristics of the table data are analyzed and determined according to the current service scene data; and then, according to the current latest data characteristics of the table data, targeted optimization updating can be carried out on the target cleaning strategy, so that the updated target cleaning strategy which is more matched with the current data processing scene is obtained. Correspondingly, the target ending operation of the target operation group can be accurately executed according to the updated target cleaning strategy, so that a relatively better cleaning effect can be obtained.

In some embodiments, after determining that the target job set, target ending job execution is complete, resources associated with the target job set, target ending job may be automatically released, e.g., previously created consuming nodes, production nodes, target thread pools, target work pools, target connection pools associated with the target job set and/or target ending job may be released.

In some embodiments, after determining that the target job set and the target ending job are executed, the method may further include the following when implemented: the scheduling server may generate and initiate a corresponding check instruction to trigger checking whether the target ending job accurately and completely completes the cleaning of the table data associated with the target data warehouse tool as required by querying the table data associated with the target data warehouse tool according to the target cleaning policy. Wherein the check instruction may be an hdfs command.

When the condition that the target ending operation is not accurately and completely cleaned according to the requirement, the table data related to the target data warehouse tool is cleaned, the operation log of the public script related to the target ending operation (or the operation state data related to the target ending operation in a query system metadata base) can be determined; and then carrying out error reporting field retrieval on the operation log of the public script so as to determine whether error reporting occurs in the operation process of the public script when the target ending operation is executed.

If the fact that the fault is reported in the running process of the public script when the target ending job is executed is determined, the fact that the failed target ending job is inaccurate and complete due to the fact that the public script is run abnormally can be determined, and table data related to the target data warehouse tool is cleaned up completely according to requirements. At this time, the job associated with the public script may be determined as an erroneous job in the target ending job determined first; and then, other formal public scripts are configured and called to replace the original public scripts to re-execute the error operation, so that the target ending operation can accurately and completely finish the cleaning of the table data related to the target data warehouse tool according to the requirement.

If the fact that the report errors do not occur in the running process of the public script when the target ending operation is executed is determined, it can be determined that the failed target ending operation is not accurate and complete and the table data related to the target data warehouse tool is cleaned as required due to the fact that the target cleaning strategy is unreasonably set. At this time, the current target cleaning policy may be marked as an abnormal cleaning policy; and the exception clearing strategy; and then carrying out abnormality detection on the abnormality cleaning strategy.

Specifically, the abnormality detection can be performed on the abnormality cleaning strategy from two dimensions of the logic and the character format to obtain a corresponding abnormality detection result; modifying the target cleaning strategy according to the abnormal detection result; and then the target ending operation can be reconfigured and executed according to the modified target cleaning strategy.

When the method is implemented, a corresponding strategy error report prompt can be generated; the strategy error reporting prompt can carry an abnormal cleaning strategy; and sending the strategy error reporting prompt to a user terminal, and checking and modifying by a user. Correspondingly, the modified target cleaning strategy fed back by the user terminal can be received, and the target ending operation is reconfigured and executed according to the modified target cleaning strategy.

As can be seen from the above, according to the table data cleaning method based on the distributed system provided in the embodiments of the present disclosure, when a batch data processing task is required to be implemented by using a target data warehouse tool to execute a target job group based on the distributed system, configuration data of the target job group and a target cleaning policy about the target data warehouse tool may be acquired first; initializing a system metadata base according to the configuration data of the target operation group and the target cleaning strategy, and configuring the target operation group; automatically generating a target ending job for the target job group; then, the target job group and the target ending job can be sequentially executed by using a target data warehouse tool through interaction with a scheduling server according to preset processing rules and a system metadata base by determining and utilizing a target cluster; after the execution of the target job group completes the batch data processing task, the target cluster can be utilized to automatically execute the target ending job according to the target cleaning strategy; and automatically cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation. Therefore, the automatic cleaning of the table data of the target data warehouse tool can be efficiently and accurately realized, the data storage burden of the system and the workload of a user are effectively reduced, the error in the cleaning process is reduced, and the safety and reliability of the overall data processing based on the distributed system are ensured.

The embodiment of the specification also provides a server, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor can execute the following steps according to the instructions when being implemented: acquiring configuration data of a target job group and a target cleaning strategy related to a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system; initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, and configuring the target job set; generating a target ending job for the target job group; determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

In order to more accurately complete the above instructions, referring to fig. 8, another specific server is provided in this embodiment of the present disclosure, where the server includes a network communication port 801, a processor 802, and a memory 803, and the above structures are connected by an internal cable, so that each structure may perform specific data interaction.

The network communication port 801 may be specifically configured to obtain configuration data of a target job set, and a target cleaning policy about a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target job set is used for realizing corresponding batch data processing tasks through the target data warehouse tool based on the distributed system.

The processor 802 may be specifically configured to initialize a system metadata base according to configuration data of a target job set and a target cleaning policy, and configure the target job set; generating a target ending job for the target job group; determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

The memory 803 may be used for storing a corresponding program of instructions.

In this embodiment, the network communication port 801 may be a virtual port that binds with different communication protocols, so that different data may be sent or received. For example, the network communication port may be a port responsible for performing web data communication, a port responsible for performing FTP data communication, or a port responsible for performing mail data communication. The network communication port may also be an entity's communication interface or a communication chip. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it may also be a Wifi chip; it may also be a bluetooth chip.

In this embodiment, the processor 802 may be implemented in any suitable manner. For example, a processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, among others. The description is not intended to be limiting.

In this embodiment, the memory 803 may include a plurality of layers, and in a digital system, the memory may be any memory as long as it can hold binary data; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.

The embodiments of the present specification also provide a computer readable storage medium based on the above table data cleaning method based on a distributed system, where the computer readable storage medium stores computer program instructions that when executed implement the following steps: acquiring configuration data of a target job group and a target cleaning strategy related to a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system; initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, and configuring the target job set; generating a target ending job for the target job group; determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a hard disk (HARD DISK DRIVE, HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects of the program instructions stored in the computer readable storage medium may be explained in comparison with other embodiments, and are not described herein.

The embodiments of the present specification also provide a computer program product comprising at least a computer program which, when executed by a processor, performs the following method steps: acquiring configuration data of a target job group and a target cleaning strategy related to a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system; initializing a system metadata base according to configuration data of a target job set and a target cleaning strategy, and configuring the target job set; generating a target ending job for the target job group; determining and utilizing a target cluster to sequentially execute the target job group and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base through interaction with a scheduling server; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

Referring to fig. 9, on a software level, the embodiment of the present disclosure further provides a table data cleaning device based on a distributed system, where the device may specifically include the following structural modules:

The acquiring module 901 may be specifically configured to acquire configuration data of a target job set, and a target cleaning policy about a target data warehouse tool; wherein the target job set includes a plurality of jobs; the target operation group is used for realizing corresponding batch data processing tasks through a target data warehouse tool based on the distributed system;

the processing module 902 may be specifically configured to initialize a system metadata base according to the configuration data of the target job set and the target cleaning policy, and configure the target job set; generating a target ending job for the target job group;

The job module 903 may be specifically configured to determine and utilize, by interacting with the scheduling server, the target job set and the target ending job by using a target data warehouse tool according to a preset processing rule and a system metadata base; after the target cluster executes the target job group to complete the batch data processing task, the target cluster also executes target ending operation according to a target cleaning strategy; and cleaning the table data generated by the target data warehouse tool when in use under the condition that the preset cleaning trigger condition is met through the target ending operation.

In some embodiments, the above-described acquisition module 901, when embodied, may acquire the target cleaning policy of the target job group with respect to the target data warehouse tool in the following manner: acquiring a target history record, historical time period service scene data and current time period service scene data when a target job group is executed in a historical time period; according to the target historical record and the historical time period business scene data, determining and utilizing the association relation between the table data generated when the target operation group is executed in the historical time period and uses the target data warehouse tool and the historical time period business scene data, and establishing a corresponding association change model; and determining a corresponding target cleaning strategy according to the business scene data of the current time period by utilizing the association change model.

In some embodiments, the job module 903 may be implemented by interacting with the scheduling server to determine and utilize a target cluster to execute the target job set using a target data warehouse tool according to preset processing rules and a system metadata base in the following manner: the system server distributes the target job group, the target ending job and the target cleaning strategy to the scheduling server; the scheduling server determines a matched target cluster according to the target job set; and sending a target scheduling request for executing the target job group and the target ending job to the target cluster; and the target cluster responds to the target scheduling request, and executes the target job group by using a target data warehouse tool according to a preset processing rule and a system metadata base.

In some embodiments, the target cluster may respond to the target scheduling request, and execute the target job group using a target data warehouse tool according to a preset processing rule and a system metadata base, and when implemented, may include: the target cluster acquires and uses the connection metadata by querying a system metadata base according to a preset processing rule, and creates a target connection pool; wherein the target connection pool comprises tool connections for connecting target data warehouse tools; the target cluster determines a plurality of consumption nodes and production nodes matched with the target operation group; creating a target working pool and a target thread pool aiming at a target job group; and the target cluster executes a target job group to complete batch data processing tasks by utilizing a system metadata base, a target connection pool and a target thread pool through the production node and the plurality of consumption nodes.

In some embodiments, the target cluster performs, through the production node and the plurality of consumption nodes, a batch data processing task by using a system metadata base, a target connection pool, a target thread pool, and a target work pool, and when implemented, may include performing a current batch job in the target job group in the following manner: the target cluster queries through the production node and determines the current batch of operation according to the operation dependency relationship in the system metadata base; the target cluster determines job execution parameters of the current batch job through the production node; updating the job execution parameters of the current batch job into a target working pool; the target cluster obtains the job execution parameters of the current batch of jobs from the target working pool through the consumption node, obtains the corresponding job threads from the target thread pool, and obtains the corresponding tool connection from the target connection pool; the target cluster calls a corresponding working thread through the consumption node, uses a target data warehouse tool through corresponding tool connection, and executes the current batch job according to corresponding job execution parameters so as to complete batch data processing tasks corresponding to the current batch job.

In some embodiments, while the target cluster performs the target job group to complete the batch data processing task through the production node and the plurality of consumption nodes by using the system metadata base, the target connection pool and the target thread pool, the apparatus may be further configured, when implemented, to: the target cluster monitors the execution progress of the target job group; in the case that the execution of the target job group is determined to be completed, the target cluster switches to execute the target ending job for the target job group.

In some embodiments, after the target cluster switch performs the target ending job for the target job group, the method may further include, when implemented: the target cluster obtains a target cleaning strategy by inquiring a system metadata base through the production node; the target cluster analyzes through the production node and detects whether a preset cleaning trigger condition is met or not according to a target cleaning strategy; and under the condition that the preset cleaning trigger condition is not met currently, the target cluster ends the target ending operation.

In some embodiments, after the target cluster is parsed by the production node and whether a preset cleaning trigger condition is currently met is detected according to the target cleaning policy, the apparatus may be further configured, when implemented, to: under the condition that the current meeting of the preset cleaning triggering conditions is determined, the target cluster determines current batch cleaning execution parameters through the production nodes according to a target cleaning strategy; updating the current batch cleaning execution parameters into a target working pool; the target cluster obtains current batch cleaning execution parameters from a target working pool through a consumption node, obtains corresponding operation threads from a target thread pool and obtains corresponding tool connection from a target connection pool; the target cluster calls a corresponding working thread through the consumption node, and uses a corresponding tool to connect, and the table data related to the target data warehouse tool is cleaned according to the current batch cleaning execution parameters.

In some embodiments, after the target cluster invokes the corresponding worker thread through the consuming node, and uses the corresponding tool connection to clean the table data related to the target data warehouse tool according to the current batch cleaning execution parameter, the apparatus may be further configured, when implemented, to: the target cluster receives and monitors the execution progress of the target ending operation according to the feedback information of the operation thread; in the case where it is determined that the execution of the target ending job is completed, the target cluster ends the target ending job.

In some embodiments, the apparatus, when embodied, may also be used to: acquiring current service scene data at intervals of a preset time period; and updating the target cleaning strategy according to the current business scene data.

It should be noted that, the units, devices, or modules described in the above embodiments may be implemented by a computer chip or entity, or may be implemented by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when the present description is implemented, the functions of each module may be implemented in the same piece or pieces of software and/or hardware, or a module that implements the same function may be implemented by a plurality of sub-modules or a combination of sub-units, or the like. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

From the above, based on the table data cleaning device based on the distributed system provided by the embodiment of the specification, the automatic cleaning of the table data of the target data warehouse tool can be efficiently and accurately realized, the data storage burden of the system and the workload of a user are effectively reduced, the error in the cleaning process is reduced, and the whole running stability and reliability of the distributed system are ensured.

In a specific scene example, the table data cleaning method based on the distributed system provided by the specification can be applied to realize automatic cleaning of Hive table data based on a Hadoop batch scheduling system.

In this scenario example, considering that Hive (e.g., a target data warehouse tool) based on a Hadoop big data platform (e.g., a distributed system) is doing a batch processing service (e.g., batch data processing tasks), the service typically requires the extraction, conversion, loading of data by a batch job scheduling system. Usually, a user writes HQL jobs manually, configures dependency relationships among the jobs, and completes processing of Hadoop data in batches through a job scheduling system. However, with the continuous expansion of services, the service data volume is continuously increased, and the data processed by the Hive batch service of the Hadoop big data platform is more and more, so that the data volume stored in the Hive table is also continuously increased, and the storage resources of the big data platform are more and more tensioned. Based on the existing method, the user is required to manually clean up invalid data in the Hive table, for example, the user needs to log in beeline through a Hadoop client and execute an HQL cleaning statement to delete the invalid partition data. However, the method has low automation degree when being implemented in practice; and only a single table partition can be cleaned at a time, so that the cleaning efficiency is low; in addition, the error rate is high depending on manual intervention of a user, and the safety and reliability of data are affected.

In this scenario example, in order to overcome the above-mentioned problem, the table data cleaning method of the distributed system provided in the present specification may be utilized, and after the Hive job processing is completed by the batch scheduling system, the Hive table data cleaning is automatically completed according to the configuration. Specifically, by configuring a scheduling job and using a shell to call a jar packet, the Hive table data can be automatically cleaned according to the period according to the initialized cleaning partition and the initialized cleaning frequency. Therefore, the automation degree and the data cleaning efficiency of data cleaning can be improved on the premise of ensuring the data safety through the manual operation. In particular, referring to fig. 10, the following steps may be included.

Step S1001: batch dispatch system jobs are configured (e.g., configuration data for a target job set is obtained). The scheduling system executes different types of jobs through different public scripts, so that different types of jobs of the same processing logic are classified into a job group. According to the system requirements, the HQL is used for realizing batch data processing logic, and relevant public script parameters are configured.

Step S1002: batch job dependencies are configured. And configuring the dependency relationship between the jobs in the step S1001 through the Excel template, and matching the public script corresponding to the jobs.

Step S1003: batch job cleaning policies (e.g., target cleaning policies) are configured. The method mainly comprises the step of configuring and cleaning temporary or invalid data and files generated in the execution process of batch jobs. Including CSV files that configure Hive table cleaning policies.

Step S1004: the trigger initializes the scheduling system metadata base (e.g., initializes the system metadata base).

When step S1004 is specifically implemented, the following may be included:

S1004-1: the brushing operation depends on. By executing the script, the job dependency relationship configured in step S102 is initialized into the scheduling system metadata base, and a finalizing job (a job whose job name coincides with the job group name and whose job name is the last executed for each job group, i.e., a target finalizing job) is automatically generated for each job group (e.g., target job group), and the virtual public script processing finalizing job, i.e., target finalizing job, is executed, and the job dependency is maintained. The Hive table clean-up logic is maintained in the finalized job to clean up Hive table data each time the job group job execution is completed.

S1004-2: and (5) file distribution. And (3) distributing the jobs configured in the step S101 and the cleaning strategies configured in the step S103 to a designated directory of an etl scheduling server (e.g. a scheduling server) according to system rules by executing scripts for use in the execution of subsequent batch jobs.

Step S1005: the scheduling system schedules the operation of each job group according to the job dependency relationship.

Step S1006: and running batch processing operation in the operation group. At this time, the scheduling system public script schedules the job operation in step S1001.

Step S1007: and (5) ending the operation in the operation group. After the operation of step S1001 is completed, the ending operation automatically generated in step S1004-1 is executed, the public script of the scheduling system corresponding to the ending operation calls the Hive table cleaning Shell, and according to the cleaning policy configured in step S1003, the Hive table data cleaning is automatically completed, and the specific cleaning logic flow may refer to step S200 shown in fig. 11, and may specifically include the following steps.

Step S201: scheduling system metadata database (Oracle, mysql relational database). At the time of the publishing of step S1004, metadata required at the time of job run has been initialized into the system metadata base. For example: job group, job name, job dependency, hive user, keytab path, krb path, ZKIp, ZKPort, etc.

Step S202: metadata is obtained from the metadata repository to create the Hive connection (Hive user, keytab path, krb path, ZKIp, zkport.).

Wherein a Hive connection pool (e.g., a target connection pool) may be created based on step S202.

Specifically, when creating the Hive connection pool, the method may include:

(1) Hive is connected via JDBC. For example, con=drivermnager.getconnection ("jdbc: hive 2://..).

(2) A Hive connection pool queue (e.g., a target connection pool) is created. For example, hive is created using the following code statement: blockingQueue < Connection > connections = new LinkedBlockingQueue > (config. Poolsize).

(3) The instruction statements loop poolSize, connect Hive to the connection pool queue. For example, connections.

Step S203: a plurality of consumer lists is created. Each consumer inherits WorkHandler < HQLEvent >, implementing the onEvent () method. The onEvent () method obtains the hive connection from the hive connection pool, generates PREPARESTATEMENT, performs the update SQL operation, "ALTER TABLE" + dbName + "+ tblName +" DROP IF EXISTS "+parts.

Step S204: workerpool (e.g., target working pool) is created.

Specifically creating workerpool, it may include:

(1) And constructing a ring buffer area. For example, a circular buffer may be created using the following instruction statements: final RingBuffer < HQLEvent > ringBuffer = ringbuffer.

(2) A sequence obstacle is created. For example, a sequence barrier may be created using the following instruction statements: final SequenceBarrier sequenceBarrier = ringbuffer.

(3) Creation of WorkerPool.workerPool＝new WorkerPool<>(ringBuffer,sequenceBarrier,new EventExceptionHandler(),consumers).

Step S205: the same number of thread pools (e.g., target thread pools) are created as consumers. For example, a thread pool may be created in accordance with the following instruction statements: es=executors.newfixedthreadpool (consumerNum).

Step S206: and opening WorkerPool the working pool. The Hive table cleaning work pool is started, and when a producer produces data to RingBuffer a certain sequence, a plurality of consumers process the Hive table cleaning work in parallel. workerpool. Start (es);

Step S207: producer Producer Producer = new Producer (ringBuffer).

Step S208: the read metadata repository obtains a CSV profile path.

Step S209: analyzing the CSV file to obtain Hive table metadata to be cleaned, wherein the method comprises the following steps: a library name, a table name, a partition name, a cleaning identifier, a cleaning frequency and the like.

Step S210: hive table metadata is looped.

Step S211: and judging whether the table needs to be cleaned according to the cleaning identification. If cleaning is required, step S1112 is performed, otherwise step S210 is performed.

Step S212: and acquiring a partition list to be cleaned according to metadata information such as Hive library names, table names and the like.

Step S213: the producer sends the data to RingBuffer in the next sequence.

Step S214: consumer consumption is automatically triggered when RingBuffer has data.

Step S215: and (5) after the cleaning is finished, releasing various resources.

The specific resource release may include:

(1) And releasing the cache pool resources. For example, workerpool.draainndhalt () is performed after the producer has processed the data.

(2) And closing the thread pool. For example, es.

(3) And circulating the Hive connection pool, and closing the Hive connection. For example conn.

Through the scene example, the table data cleaning method based on the distributed system provided by the specification is verified, users such as service developers can be helped to automatically complete invalid Hive table data cleaning through unified batch scheduling system operation, the degree of automation of data cleaning and the data cleaning efficiency are improved, and the data safety is ensured.

Although the present description provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an apparatus or client product in practice, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even in a distributed data processing environment). The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. The terms first, second, etc. are used to denote a name, but not any particular order.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer-readable storage media including memory storage devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present description may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be embodied essentially in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present specification.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The specification is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present specification has been described by way of example, it will be appreciated by those skilled in the art that there are many variations and modifications to the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications as do not depart from the spirit of the specification.

Claims

1. A table data cleaning method based on a distributed system is characterized by being applied to a system server and comprising the following steps:

2. The method of claim 1, wherein the configuration data comprises at least one of: job parameters, job dependency relationships, job types;

3. The method of claim 2, wherein obtaining a target cleaning policy for a target job group with respect to a target data warehouse tool comprises:

4. The method of claim 1, wherein determining and utilizing the target cluster to execute the target job set using a target data warehouse tool according to preset processing rules and system metadata base by interacting with a scheduling server comprises:

5. The method of claim 4, wherein the target cluster executing the target job group using a target data warehouse tool in response to the target scheduling request according to a preset processing rule and system metadata base, comprises:

6. The method of claim 5, wherein the target cluster performs a target job group to complete a batch data processing task through the production node and the plurality of consumption nodes using a system metadata base, a target connection pool, a target thread pool, a target work pool, comprising:

The current batch job in the target job set is executed as follows:

7. The method of claim 5, wherein while the target cluster performs the target job group to complete the batch data processing task through the production node and the plurality of consumption nodes using the system metadata base, the target connection pool, and the target thread pool, the method further comprises:

the target cluster monitors the execution progress of the target job group;

8. The method of claim 7, wherein after the target cluster switch performs the target closeout job for the target job group, the method further comprises:

9. The method of claim 8, wherein after the target cluster is parsed by the production node and the target cleaning policy is determined whether a preset cleaning trigger condition is currently satisfied, the method further comprises:

10. The method of claim 9, wherein after the target cluster invokes the corresponding worker thread via the consuming node, clears the table data associated with the target data warehouse tool based on the current batch clear execution parameters using the corresponding tool connection, the method further comprises:

11. The method according to claim 1, wherein the method further comprises:

Acquiring current service scene data at intervals of a preset time period;

12. A table data cleaning device based on a distributed system, which is applied to a system server, and comprises:

13. A server comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 11.

14. A computer readable storage medium, having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any one of claims 1 to 11.