CN114610803A - Data processing method and device, electronic equipment and storage medium - Google Patents
Data processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114610803A CN114610803A CN202210158830.1A CN202210158830A CN114610803A CN 114610803 A CN114610803 A CN 114610803A CN 202210158830 A CN202210158830 A CN 202210158830A CN 114610803 A CN114610803 A CN 114610803A
- Authority
- CN
- China
- Prior art keywords
- data
- information
- interface
- configuration
- data processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring data format information aiming at service data; inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data; exporting an ETL job configuration file of big data according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data. The embodiment of the invention can complete the development work of the big data ETL operation under the condition that codes do not need to be compiled and only the sorted data format information needs to be input into the pre-configured data processing tool, thereby reducing the technical difficulty of directly developing the big data ETL operation.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, an electronic device, and a computer-readable storage medium.
Background
Hadoop is an open source framework that stores mass data and runs distributed analytics applications on a distributed server cluster. Because the updating iteration speed of the technology of the Hadoop ecosphere is high, no automatic production tool for the data ETL (Extract-Transform-Load) operation of the calculation engines such as Hive, Spark, MR and the like is provided at present. In addition, because the processing logic and the optimization strategy based on the big Data are greatly different from the traditional ETL tool, development tools based on TD (Test Director, Test management), DS (Data Stage, Data integration) and the like are difficult to apply in the Hadoop ecosphere, which results in great technical difficulty in directly developing big Data operation.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a data processing method, a data processing apparatus, an electronic device, and a computer-readable storage medium that overcome or at least partially solve the above problems.
In order to solve the above problem, an embodiment of the present invention discloses a data processing method, where the method includes:
acquiring data format information aiming at service data;
inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data;
exporting an ETL job configuration file of big data according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
Optionally, the data format information includes data interface information; the inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template for the service data includes:
inputting the data interface information into a pre-configured data processing tool, and configuring the data interface information by the data processing tool to obtain data interface configuration information; the data interface configuration information is configuration information used for selecting a plurality of data tables from the service data to be combined, or selecting a plurality of partitions and/or bucket fields to be combined.
Optionally, the data format information further includes interface list information; the inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template for the service data further includes:
inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain interface list configuration information; the interface list configuration information is configuration information used for performing data authority control on the service data or filtering invalid data in the service data.
Optionally, the interface list information includes information for configuring an interface; the data processing tool comprises a source layer configuration module; inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain available interface list configuration information, including:
importing information for configuring an interface into a pasting layer configuration module, and exporting pasting layer configuration information through the pasting layer configuration module interface; the source layer configuration information comprises at least one of configuration information for data loading on the outer surface, configuration information for data processing on the inner surface, data quality management configuration information, data acquisition configuration information and interface export configuration information.
Optionally, the interface list information includes information for configuring a data model; the data processing tool includes a common process layer configuration module; the inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain the configuration information of the available interface list, further comprising:
importing information for configuring a data model into a common processing layer configuration module, and exporting common processing layer configuration information through a common processing layer configuration module interface; the common processing layer configuration information includes at least one of data common processing configuration information and interface structure configuration information.
Optionally, the interface list information includes information for interface subscription; the data processing tool comprises an interface subscription management configuration module; the inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain the configuration information of the available interface list, further comprising:
importing information for interface subscription into an interface subscription management configuration module, and exporting interface subscription management configuration information by the interface subscription management configuration module; the interface subscription management configuration information includes at least one of data offload configuration information and data distribution configuration information.
Optionally, before the step of inputting the data format information into a pre-configured data processing tool for processing to obtain a target job configuration template, the method further includes:
verifying the information for configuration in the data format information according to predefined verification logic;
if the information for configuration does not accord with the verification logic, sending out prompt information of verification failure;
and if the information for configuration conforms to the verification logic, inputting the data format information into a pre-configured data processing tool for processing to obtain a target operation configuration template.
The embodiment of the invention also discloses a data processing device, which comprises:
the information acquisition module is used for acquiring data format information aiming at the service data;
the data processing module is used for inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data;
the export module is used for exporting the big data ETL operation configuration file according to the operation configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
Optionally, the data format information includes data interface information; the data processing module comprises:
the data interface configuration submodule is used for inputting the data interface information into a pre-configured data processing tool, and the data processing tool configures the data interface information to obtain data interface configuration information; the data interface configuration information is configuration information used for selecting a plurality of data tables from the service data to be combined, or selecting a plurality of partitions and/or bucket fields to be combined.
Optionally, the data format information further includes interface list information; the data processing module further comprises:
the interface list configuration submodule is used for inputting the interface list information into the data processing tool, and the data processing tool configures the interface list information to obtain interface list configuration information; the interface list configuration information is configuration information used for performing data authority control on the service data or filtering invalid data in the service data.
Optionally, the interface list information includes information for configuring an interface; the data processing tool comprises a pasting layer configuration module; the interface list configuration submodule includes:
the configuration interface information import unit is used for importing information for configuring an interface into the source layer configuration module and exporting the source layer configuration information from the source layer configuration module interface; the source layer configuration information comprises at least one of configuration information for data loading on the outer surface, configuration information for data processing on the inner surface, data quality management configuration information, data acquisition configuration information and interface export configuration information.
Optionally, the interface list information includes information for configuring a data model; the data processing tool includes a common process layer configuration module; the interface list configuration sub-module further includes:
the configuration data model information import unit is used for importing information for configuring a data model into the common processing layer configuration module, and exporting the common processing layer configuration information through the common processing layer configuration module interface; the common processing layer configuration information includes at least one of data common processing configuration information and interface structure configuration information.
Optionally, the interface list information includes information for interface subscription; the data processing tool comprises an interface subscription management configuration module; the interface list configuration sub-module further includes:
the interface subscription information importing unit is used for importing the information for interface subscription into the interface subscription management configuration module, and exporting the interface subscription management configuration information by the interface subscription management configuration module; the interface subscription management configuration information includes at least one of data offload configuration information and data distribution configuration information.
Optionally, the apparatus further comprises:
the verification module is used for verifying the information for configuration in the data format information according to predefined verification logic;
the verification failure module is used for sending out prompt information of verification failure if the information for configuration does not accord with the verification logic;
and the verification success module is used for inputting the data format information into a pre-configured data processing tool for processing to obtain a target operation configuration template if the information for configuration conforms to the verification logic.
The embodiment of the invention also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, carries out the steps of the data processing method as described above.
The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program realizes the steps of the data processing method when being executed by a processor.
The embodiment of the invention has the following advantages:
by acquiring the data format information aiming at the business data, inputting the data format information into a pre-configured data processing tool for processing, and exporting a big data ETL operation configuration file for configuring ETL operation aiming at the business data according to the obtained operation configuration template aiming at the business data, the development work of big data ETL operation can be completed under the condition that codes do not need to be written, and only the arranged data format information needs to be input into the pre-configured data processing tool, so that the technical difficulty of directly developing the big data ETL operation is reduced.
Drawings
FIG. 1 is a flow chart of steps of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of another data processing method according to an embodiment of the present invention;
FIG. 3 is a functional architecture diagram of a data processing tool according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of process management using a data processing tool according to an embodiment of the present invention;
fig. 5 is a block diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Processing logic and optimization strategies based on big data are greatly different from those of a traditional ETL tool, so that development tools based on TD, DS and the like are difficult to apply, ETL operation development is complicated, and the technical difficulty of directly developing big data operation is high.
The core concept of the embodiment of the invention is that the data format information aiming at the business data is acquired, the data format information is input into a pre-configured data processing tool for processing, and the big data ETL operation configuration file for configuring the ETL operation aiming at the business data is derived according to the obtained operation configuration template aiming at the business data, so that the development work of the big data ETL operation can be completed under the condition that codes are not required to be written, only the arranged data format information is input into the pre-configured data processing tool, and the technical difficulty of directly developing the big data ETL operation is reduced.
Referring to fig. 1, a flowchart illustrating steps of a data processing method according to an embodiment of the present invention is shown, where the method specifically includes the following steps:
The data format information may be information for describing a rule in which data is stored in a file or record. The data format may be a text format in the form of characters or a compressed format in the form of binary data. The data format information arranged for the service data can be recorded in the interface excel table. Before big data operation development is carried out, an analyst can arrange the data format of the service data, and the server can obtain the data format information arranged aiming at the service data.
And 102, inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data.
After the data format information for the service data is acquired, the data format information can be recorded into a pre-configured data processing tool, and the data processing tool performs configuration according to the data format information to obtain a big data operation configuration template for the service data.
103, exporting an ETL job configuration file of big data according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
A configuration file is a computer file that can configure parameters and initial settings for a computer program. The configuration file may be composed of two parts of annotation content and configuration item content, the annotation content may be used to explain necessary content, and may be expressed by # to annotate one line in the example file; the configuration item content can be a record of key value pairs, and can be recorded and saved in the form of key/value.
In the embodiment of the present invention, based on the job configuration template obtained from the data processing tool, the corresponding big data ETL job configuration file for the ETL job operation on the business data, such as configuration data import, data cleaning, data synchronization, data processing, data distribution, etc., can be derived, thereby completing the development work of the corresponding big data ETL job.
In the embodiment of the invention, the data format information aiming at the business data is acquired, the data format information is input into the pre-configured data processing tool for processing, and the big data ETL operation configuration file for configuring the ETL operation aiming at the business data is derived according to the obtained operation configuration template aiming at the business data, so that the development work of the big data ETL operation can be completed under the condition that codes do not need to be written and only the arranged data format information needs to be input into the pre-configured data processing tool, and the technical difficulty of directly developing the big data ETL operation is reduced.
Referring to fig. 2, a flowchart illustrating steps of another data processing method provided in an embodiment of the present invention is shown, where the method may specifically include the following steps:
The information for configuring may be ETL configuration information. Specifically, the server may perform automatic checks such as interface field correctness checking, interface field type checking, primary key information checking, size table association checking, cyclic dependency checking, and the like according to predefined check logic, thereby completing automatic checking of the ETL configuration information.
The prompt message of the verification failure can be the reason of the verification failure. In the embodiment of the present invention, if the information for configuration does not pass the verification work performed according to the verification logic, a prompt of a reason for the verification failure may be sent.
And 204, if the information for configuration conforms to the verification logic, inputting the data format information into a pre-configured data processing tool for processing to obtain a target operation configuration template.
In the embodiment of the present invention, if the information for configuration passes the verification work performed according to the verification logic, a step of inputting the data format information into a pre-configured data processing tool for processing to obtain the target job configuration template may be performed.
The data format information which passes the checking work and is arranged aiming at the service data can be recorded in an interface excel table, and the interface excel table is input into a pre-configured data processing tool for processing, so that an operation configuration template aiming at the service data is obtained.
In an alternative embodiment, the data format information includes data interface information; the step 205 comprises the following sub-step S11:
substep S11, inputting the data interface information into a pre-configured data processing tool, and configuring the data interface information by the data processing tool to obtain data interface configuration information; the data interface configuration information is configuration information used for selecting a plurality of data tables from the service data to be combined, or selecting a plurality of partitions and/or bucket fields to be combined.
The data interface information may be detailed information of each interface, and may include at least one of field name information, field type information, and field size information. The detailed information of each interface can be input into a pre-configured data processing tool, and the data processing tool configures the data interface information to obtain configuration information for selecting a plurality of data tables from the business data to be combined, or obtain configuration information for selecting a plurality of partitions and/or bucket fields from the business data to be combined.
In one example, the partition field may select a service time partition. Compared with partitioning according to the time field, the partition field selects the service time partition, so that the data distribution is relatively uniform, and the probability of occurrence of the data tilt problem is reduced. When the partition field is selected, the distribution characteristics of the data can be considered, when the partition is carried out according to the business field A, the starting date in the zipper table is the partition field, the starting date can be the date when the data is transmitted, but huge historical stock exists when the line is put on the source layer table, so that the distribution of the business field A is absolutely inclined. For example, the field a has a total of 1000 different values, but 50% of the values are all 0, and if the partition is performed according to the field a, the corresponding partition occupies 50% of the data of the whole table, thereby resulting in the inefficiency of the SQL service. Therefore, the partition fields can be processed, and the historical stock data is scattered in the historical date, so that the data volume among all the partitions is approximately balanced, and the condition that most of the partitions have no data or very little data is avoided.
In one example, the number of partitions may be configured according to service characteristics while ensuring that there is not too much hot and cold data mixing within the partitions. For example, the SQL service may be a number of operations for 2 to 3 months, and then a partition may be made according to 3 months.
The partition field may select a range partition, such as a day partition, a month partition, a year partition. The partition range may be selected according to the size of the data amount. If the amount of additional data is large, the partition may be performed and the data may be scattered into each partition. After the partition field is selected, the selectable interval of the number of partitions can be determined according to the size of the data volume, and the number of partitions can be configured to be within dozens.
In one example, a sub-bucket field can be selected from a main key, fields with high dispersion such as a client number, an ID (identity), an identity card number, an account and a serial number can be selected to be used as the sub-bucket field, and the problem of data inclination caused by selecting fields with low dispersion such as an address type and a blacklist type to be used as the sub-bucket field is avoided.
In one example, the number of buckets may be configured according to the data type of the table, and the number of buckets for the last three years may be calculated. Specifically, the full data partial bucket number may be configured as full data volume/200M, the incremental data partial bucket number may be configured as (full data volume + incremental data volume 365 × 3)/200M, and the additional data partial bucket number may be configured as additional data volume 365 × 3/200M. Different numbers of buckets can be configured for different storage types, for example, the bucket size can be set within 200M for a common ORC form, can be set within 100M for an ORC transaction table, and the number of records can be limited to millions.
When configuring the number of buckets, whether a partition has been passed or not may be considered. For a table with partitioned areas, the number of buckets can be estimated according to the size and number of single areas, for example, when the size of a file of each bucket is only dozens of K due to the excessively large number of buckets after the data table is partitioned and barreled, the execution efficiency of a single Task is low, and the problem of wasting system resources due to the excessively large number of tasks is easily caused. The order of magnitude of the original barrel number can be reduced according to the actual situation.
In one example, the substr function, the left function, the right function, and the like may be optimized. Illustratively, M _ CM _ INST _ LDAP _ ORG _ NUM _ NMA is a partition according to DC _ START _ DATE, and when querying data of year 01 of 2017, the original statements used are as follows:
select count(*)from A a from M_CM_INST_LDAP_ORG_NUM_NMA where substr(DC_START_DATE,1,6)=‘201701’;
when the original sentence is adopted, the partition information cannot be used when the database full-table scanning is easily caused.
The optimization can be modified as follows:
select count(*)from M_CM_INST_LDAP_ORG_NUM_NMA where DC_START_DATE>=‘20170101’and DC_START_DATE<=‘20170131’;
when the character string is intercepted, if left (column,2) or right (column,2) is used, a null pointer may appear, and substr may be used instead of substr, substr (column,0,2) or right (column, length) -2, 2).
Through the selection of the partition fields, the configuration of the number of the partitions, the selection of the bucket fields, the configuration of the number of the buckets, and the optimization of functions such as substr functions, left functions, right functions and the like, the optimization of the partitions and the buckets can be performed according to the characteristics of each technical component of Hadoop, so that the technical difficulty of directly developing big data operation is reduced.
In an optional embodiment, the data format information further includes interface list information; the step 205 further comprises the following sub-step S12:
substep S12, inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain interface list configuration information; the interface list configuration information is configuration information used for performing data authority control on the service data or filtering invalid data in the service data.
The interface list information may include at least one of target system information, target interface information, transmission information, scheduling information, data source information, data cleansing configuration information, and estimated data amount information. The data processing tool configured in advance can be used for configuring according to the interface list information to obtain configuration information for performing authority control on the business data or obtain configuration information for filtering invalid data in the business data.
In an alternative embodiment, the interface list information includes information for configuring an interface; the data processing tool comprises a pasting layer configuration module; the sub-step S12 includes: importing information for configuring an interface into a pasting layer configuration module, and exporting pasting layer configuration information through the pasting layer configuration module interface; the pasting layer configuration information comprises at least one of configuration information for data loading on the outer surface, configuration information for data processing on the inner surface, data quality management configuration information, data acquisition configuration information and interface export configuration information.
In an alternative embodiment, the interface list information includes information for configuring a data model; the data processing tool includes a common process layer configuration module; the sub-step S12 further includes: importing information for configuring a data model into a common processing layer configuration module, and exporting common processing layer configuration information through a common processing layer configuration module interface; the common processing layer configuration information includes at least one of data common processing configuration information and interface structure configuration information.
In an alternative embodiment, the interface list information includes information for interface subscriptions; the data processing tool comprises an interface subscription management configuration module; the sub-step S12 further includes: importing information for interface subscription into an interface subscription management configuration module, and exporting interface subscription management configuration information by the interface subscription management configuration module; the interface subscription management configuration information includes at least one of data offload configuration information and data distribution configuration information.
Fig. 3 is a schematic diagram of a functional architecture of a data processing tool according to an embodiment of the present invention. As shown, the data processing tool may include a pasting layer configuration module, a common process layer configuration module, and an interface subscription management configuration module.
The pasting layer configuration module can comprise an interface management submodule, an interface display submodule and an interface export submodule. The interface management submodule may include a file check (EXCEL interface) unit and a file import (EXCEL interface) unit. The interface display sub-module can comprise an interface query unit and an interface statistic unit. The interface export submodule may include a data loading (exterior) unit, a data processing (interior) unit, a CPS (Cyber-Physical Systems) signal unit, an MOIA operation unit, a data quality management unit, a data acquisition (FTP, File Transfer Protocol/HDFS, a Hadoop Distributed File System, Distributed File System) unit, an interface export (EXCEL) unit, and a data acquisition unit. After the information for configuring the interface is imported into the pasting layer configuration module, at least one of configuration information for data loading on the external surface, configuration information for data processing on the internal surface, data quality management configuration information, data acquisition configuration information and interface export configuration information can be exported by the pasting layer configuration module interface.
The common processing layer configuration module can comprise an interface management submodule, an interface display submodule and an interface export submodule. The interface management submodule may include a file check (EXCEL interface) unit and a file import (EXCEL interface) unit. The interface display sub-module can comprise an interface query unit and an interface statistical unit. The interface derivation sub-module may include a data processing unit, a CPS signal unit, a MOIA job unit, an interface structure (EXCEL) unit, and a data acquisition unit. After the information for configuring the data model is imported into the common processing layer configuration module, at least one of the data common processing configuration information and the interface structure configuration information may be exported from the common processing layer configuration module interface.
The interface subscription management configuration module can comprise a subscription subject submodule, an interface display submodule and a subscription export submodule. The subscribe topic sub-module may include an interface subscribe (common process layer) unit and an interface subscribe (paste layer) unit. The interface display sub-module can comprise an interface query unit and a personality customization unit. The subscription derivation unit may include a data offload (HDFS) unit, a data distribution (FTP/HDFS) unit, a CPS signal unit, a MOIA job unit, an interface structure (EXCEL) unit, and a data acquisition unit. After the information for the interface subscription is imported into the interface subscription management configuration module, at least one of the data offloading configuration information and the data distribution configuration information may be exported by the interface subscription management configuration module.
By configuring the interface query unit, query operations such as version information, data table structure, interface table, field detailed information and the like of each item of data and interface maintained in the tool can be performed.
According to the functional architecture designed for the data processing tool, after the interface EXCEL table recorded with the data format information is input into the data processing tool, big data JOBs such as Inceptor SQL, Spark JOB and MOIA JOB can be produced, so that the functions of automatic creation, automatic verification, scheduling management, flow management and the like can be realized through the pre-configured data processing tool. Aiming at the unified scheduling tool MOIA, the data processing tool can automatically generate a corresponding MOIA configuration file according to job execution information configured in the tool, such as pre-check, job dependence and the like, so that the MOIA is directly imported to complete the configuration work of the scheduling tool.
In the embodiment of the present invention, based on the job configuration template obtained from the data processing tool, the corresponding big data ETL job configuration file for the ETL job operation on the business data, such as configuration data import, data cleaning, data synchronization, data processing, data distribution, etc., can be derived, thereby completing the development work of the corresponding big data ETL job.
Fig. 4 is a schematic flowchart illustrating a process management performed by a data processing tool according to an embodiment of the present invention. As shown in the figure, in a banking business scene, the important processing process of the guest class and supervision class scene data can be used as a reinsurance link, and manual auditing can be performed. The content of the manual review can be information manually written, and can include at least one of information of a data issuing mode, a data access date, a data partition table, a job naming and scheduling queue. The process of process management using the data processing tool may include:
(1) the analyst can arrange the data format information according to the service source data, and the data format information may include information of interface layers such as field names, field types, field lengths and the like, and may also include information of data source layers such as table names, storage locations, full/incremental, synchronization frequency and the like.
(2) And inputting the data format information into a data processing tool as a data source, and submitting the data source for manual review.
(3) After the manual review for the data format information passes, an available data source can be obtained, the available data source is customized to obtain a target template, and the target template is submitted for manual review. The customization can include the customization of an interface layer and the customization of a data layer, the customization of the interface layer can be to select partial fields or to combine a plurality of data tables, and the customization of the data layer can be to perform authority control on data or to filter invalid data.
(4) And obtaining the available target template after the manual examination of the target template is passed.
(5) And acquiring service source data, and exporting the big data ETL job configuration file to a downstream system according to the available target template so as to complete the development work of the corresponding big data ETL job.
In the embodiment of the invention, the data format information aiming at the business data is obtained by configuring the functional architecture of the data processing tool, the data format information is input into the pre-configured data processing tool for processing, and the big data ETL operation configuration file for configuring ETL operation aiming at the business data is derived according to the obtained operation configuration template aiming at the business data, so that the pre-configured data processing tool is suitable for Hadoop technical architecture, the automatic generation of the big data ETL operation is realized, the development work of the big data ETL operation is completed under the condition that the arranged data format information is only input into the pre-configured data processing tool without writing codes, and the technical difficulty of directly developing the big data ETL operation is reduced.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 5, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:
an information obtaining module 501, configured to obtain data format information for service data;
a data processing module 502, configured to input the data format information into a pre-configured data processing tool for processing, so as to obtain a job configuration template for the service data;
an export module 503, configured to export a big data ETL job configuration file according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
In an alternative embodiment, the data format information includes data interface information; the data processing module comprises:
the data interface configuration submodule is used for inputting the data interface information into a pre-configured data processing tool, and the data processing tool configures the data interface information to obtain data interface configuration information; the data interface configuration information is configuration information used for selecting a plurality of data tables from the service data to be combined or selecting a plurality of subareas and/or barrel fields to be combined.
In an optional embodiment, the data format information further includes interface list information; the data processing module further comprises:
the interface list configuration submodule is used for inputting the interface list information into the data processing tool, and the data processing tool configures the interface list information to obtain interface list configuration information; the interface list configuration information is configuration information used for performing data authority control on the service data or filtering invalid data in the service data.
In an alternative embodiment, the interface list information includes information for configuring an interface; the data processing tool comprises a pasting layer configuration module; the interface list configuration submodule includes:
the configuration interface information import unit is used for importing information for configuring an interface into the source layer configuration module and exporting the source layer configuration information from the source layer configuration module interface; the source layer configuration information comprises at least one of configuration information for data loading on the outer surface, configuration information for data processing on the inner surface, data quality management configuration information, data acquisition configuration information and interface export configuration information.
In an alternative embodiment, the interface list information includes information for configuring a data model; the data processing tool includes a common process layer configuration module; the interface list configuration sub-module further includes:
the configuration data model information import unit is used for importing information for configuring a data model into the common processing layer configuration module, and exporting the common processing layer configuration information through the common processing layer configuration module interface; the common processing layer configuration information includes at least one of data common processing configuration information and interface structure configuration information.
In an alternative embodiment, the interface list information includes information for interface subscriptions; the data processing tool comprises an interface subscription management configuration module; the interface list configuration sub-module further includes:
the interface subscription information importing unit is used for importing the information for interface subscription into the interface subscription management configuration module, and exporting the interface subscription management configuration information by the interface subscription management configuration module; the interface subscription management configuration information includes at least one of data offload configuration information and data distribution configuration information.
In an optional embodiment, the apparatus further comprises:
the verification module is used for verifying the information for configuration in the data format information according to predefined verification logic;
the verification failure module is used for sending out prompt information of verification failure if the information for configuration does not accord with the verification logic;
and the verification success module is used for inputting the data format information into a pre-configured data processing tool for processing to obtain a target operation configuration template if the information for configuration conforms to the verification logic.
In the embodiment of the invention, the data format information aiming at the business data is obtained by configuring the functional architecture of the data processing tool, the data format information is input into the pre-configured data processing tool for processing, and the big data ETL operation configuration file for configuring ETL operation aiming at the business data is exported according to the obtained operation configuration template aiming at the business data, so that the pre-configured data processing tool is suitable for Hadoop technical architecture, the automatic generation of the big data ETL operation is realized, the development work of the big data ETL operation is completed under the condition that the arranged data format information is only input into the pre-configured data processing tool without writing codes, and the technical difficulty of directly developing the big data ETL operation is reduced.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention further provides an electronic device, including:
the data processing method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the data processing method embodiment is realized, the same technical effect can be achieved, and the details are not repeated here to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the data processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The data processing method, the data processing apparatus, the electronic device, and the storage medium according to the present invention are described in detail above, and a specific example is applied in the description to explain the principles and embodiments of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A method of data processing, the method comprising:
acquiring data format information aiming at service data;
inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data;
exporting an ETL job configuration file of big data according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
2. The method of claim 1, wherein the data format information comprises data interface information; the inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template for the service data includes:
inputting the data interface information into a pre-configured data processing tool, and configuring the data interface information by the data processing tool to obtain data interface configuration information; the data interface configuration information is configuration information used for selecting a plurality of data tables from the service data to be combined, or selecting a plurality of partitions and/or bucket fields to be combined.
3. The method of claim 2, wherein the data format information further includes interface list information; the inputting the data format information into a pre-configured data processing tool for processing to obtain an operation configuration template for the service data, further includes:
inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain interface list configuration information; the interface list configuration information is configuration information used for performing data authority control on the service data or filtering invalid data in the service data.
4. The method of claim 3, wherein the interface list information includes information for configuring an interface; the data processing tool comprises a pasting layer configuration module; inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain available interface list configuration information, including:
importing information for configuring an interface into a pasting layer configuration module, and exporting pasting layer configuration information through the pasting layer configuration module interface; the source layer configuration information comprises at least one of configuration information for data loading on the outer surface, configuration information for data processing on the inner surface, data quality management configuration information, data acquisition configuration information and interface export configuration information.
5. The method of claim 4, wherein the interface list information includes information for configuring a data model; the data processing tool includes a common process layer configuration module; the inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain the configuration information of the available interface list, further comprising:
importing information for configuring a data model into a common processing layer configuration module, and exporting common processing layer configuration information through a common processing layer configuration module interface; the common processing layer configuration information includes at least one of data common processing configuration information and interface structure configuration information.
6. The method of claim 5, wherein the interface list information includes information for interface subscriptions; the data processing tool comprises an interface subscription management configuration module; the inputting the interface list information into the data processing tool, and configuring the interface list information by the data processing tool to obtain the configuration information of the available interface list, further comprising:
importing information for interface subscription into an interface subscription management configuration module, and exporting interface subscription management configuration information by the interface subscription management configuration module; the interface subscription management configuration information includes at least one of data offload configuration information and data distribution configuration information.
7. The method according to claim 1, wherein before the step of inputting the data format information into a pre-configured data processing tool for processing to obtain a target job configuration template, further comprising:
verifying information for configuration in the data format information according to predefined verification logic;
if the information for configuration does not accord with the verification logic, sending out prompt information of verification failure;
and if the information for configuration conforms to the verification logic, inputting the data format information into a pre-configured data processing tool for processing to obtain a target operation configuration template.
8. A data processing apparatus, characterized in that the apparatus comprises:
the information acquisition module is used for acquiring data format information aiming at the service data;
the data processing module is used for inputting the data format information into a pre-configured data processing tool for processing to obtain a job configuration template aiming at the service data;
the export module is used for exporting the big data ETL job configuration file according to the job configuration template; the big data ETL job configuration file is used for configuring ETL job operation aiming at the business data.
9. An electronic device, comprising: processor, memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of the data processing method according to any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210158830.1A CN114610803A (en) | 2022-02-21 | 2022-02-21 | Data processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210158830.1A CN114610803A (en) | 2022-02-21 | 2022-02-21 | Data processing method and device, electronic equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114610803A true CN114610803A (en) | 2022-06-10 |
Family
ID=81860037
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210158830.1A Pending CN114610803A (en) | 2022-02-21 | 2022-02-21 | Data processing method and device, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114610803A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116795664A (en) * | 2023-08-25 | 2023-09-22 | 四川省农村信用社联合社 | Automatic processing full-increment historical data storage method |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150019303A1 (en) * | 2013-07-11 | 2015-01-15 | Bank Of America Corporation | Data quality integration |
| US20170011135A1 (en) * | 2015-07-06 | 2017-01-12 | IGATE Global Solutions Ltd. | Conversion Automation through Data and Object Importer |
| CN107103448A (en) * | 2016-02-23 | 2017-08-29 | 上海御行信息技术有限公司 | Data integrated system based on workflow |
| CN108846076A (en) * | 2018-06-08 | 2018-11-20 | 山大地纬软件股份有限公司 | The massive multi-source ETL process method and system of supporting interface adaptation |
| CN111080243A (en) * | 2019-12-05 | 2020-04-28 | 北京百度网讯科技有限公司 | Service processing method, device, system, electronic equipment and storage medium |
| CN113360474A (en) * | 2020-03-06 | 2021-09-07 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and computer readable medium |
| CN113626507A (en) * | 2021-05-14 | 2021-11-09 | 深圳市广电信义科技有限公司 | Method, system and storage medium for generating visual presentation file |
-
2022
- 2022-02-21 CN CN202210158830.1A patent/CN114610803A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150019303A1 (en) * | 2013-07-11 | 2015-01-15 | Bank Of America Corporation | Data quality integration |
| US20170011135A1 (en) * | 2015-07-06 | 2017-01-12 | IGATE Global Solutions Ltd. | Conversion Automation through Data and Object Importer |
| CN107103448A (en) * | 2016-02-23 | 2017-08-29 | 上海御行信息技术有限公司 | Data integrated system based on workflow |
| CN108846076A (en) * | 2018-06-08 | 2018-11-20 | 山大地纬软件股份有限公司 | The massive multi-source ETL process method and system of supporting interface adaptation |
| CN111080243A (en) * | 2019-12-05 | 2020-04-28 | 北京百度网讯科技有限公司 | Service processing method, device, system, electronic equipment and storage medium |
| CN113360474A (en) * | 2020-03-06 | 2021-09-07 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and computer readable medium |
| CN113626507A (en) * | 2021-05-14 | 2021-11-09 | 深圳市广电信义科技有限公司 | Method, system and storage medium for generating visual presentation file |
Non-Patent Citations (1)
| Title |
|---|
| 王战英;王占宏;: "基于元数据的分布式通用查询系统研究与实现", 微型电脑应用, no. 08, 20 August 2017 (2017-08-20) * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116795664A (en) * | 2023-08-25 | 2023-09-22 | 四川省农村信用社联合社 | Automatic processing full-increment historical data storage method |
| CN116795664B (en) * | 2023-08-25 | 2023-10-31 | 四川省农村信用社联合社 | Automatic processing full-increment historical data storage method |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10997142B2 (en) | Cognitive blockchain automation and management | |
| CN113590576B (en) | Database parameter adjustment method, device, storage medium and electronic device | |
| US10339038B1 (en) | Method and system for generating production data pattern driven test data | |
| CN112861496A (en) | Report generation display method and device, computer equipment and readable storage medium | |
| CN113254457B (en) | Account checking method, account checking system and computer readable storage medium | |
| CN110297840A (en) | Data processing method, device, equipment and the storage medium of rule-based engine | |
| US9830385B2 (en) | Methods and apparatus for partitioning data | |
| US11928083B2 (en) | Determining collaboration recommendations from file path information | |
| CN112035471A (en) | Transaction processing method and computer equipment | |
| CN113568982B (en) | Page access information acquisition method and device | |
| CN111125045B (en) | Lightweight ETL processing platform | |
| US20210124752A1 (en) | System for Data Collection, Aggregation, Storage, Verification and Analytics with User Interface | |
| CN109255587A (en) | A kind of cooperative processing method and device of operational data | |
| CN112800127A (en) | Data mining analysis method and device based on transaction bill | |
| CN115098738B (en) | Business data extraction method, device, storage medium and electronic device | |
| CN114610803A (en) | Data processing method and device, electronic equipment and storage medium | |
| CN112860954A (en) | Real-time computing method and real-time computing system | |
| CN115657901B (en) | Service changing method and device based on unified parameters | |
| CN106599244B (en) | General original log cleaning device and method | |
| CN114185536B (en) | Credit data processing methods, devices, computer equipment and storage media | |
| CN115329363A (en) | Data desensitization method, electronic device and system | |
| CN110851446B (en) | Data table generation method and device, computer equipment and storage medium | |
| CN114443742A (en) | K line graph display method, device and equipment | |
| CN110196877B (en) | Data display method, device, computer equipment and storage medium | |
| CN119597820B (en) | Report generation method and system based on data model asset catalogue |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |