CN114880405A

CN114880405A - Data lake-based data processing method and system

Info

Publication number: CN114880405A
Application number: CN202210330525.6A
Authority: CN
Inventors: 徐银领; 韩亮; 陈佳; 刘鲁清; 吴家乐; 韩杰娇; 杜万波; 孟子涵
Original assignee: Huaneng Information Technology Co Ltd
Current assignee: Huaneng Information Technology Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-08-09

Abstract

The invention discloses a data processing method and a data processing system based on a data lake, which are applied to a platform comprising a data warehouse, classify all source data information based on a data access specification, construct a source pasting table on a source pasting layer, and introduce a data source file into the data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs.

Description

Data lake-based data processing method and system

Technical Field

The application relates to the technical field of data processing, in particular to a data lake-based data processing method and system.

Background

In the existing data lake data processing technology, a data source frequently generates errors, so that external data or other non-service data enter a data lake, the data quality cannot be accurately monitored, the field quality is low, and the data processing efficiency of the data lake is reduced.

Therefore, how to improve the accuracy of data quality detection is a technical problem to be solved at present.

Disclosure of Invention

The invention provides a data lake-based data processing method, which is used for solving the technical problem of low data quality detection accuracy in the prior art. The method is applied to a platform comprising a data warehouse, and comprises the following steps:

classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;

analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index;

checking the fields to be monitored from the source pasting table to the dimensional modeling;

constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table;

and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode.

In some embodiments of the present application, the method further comprises:

if the source data is from a local upload, importing the data into a data lake;

if the source data comes from online transmission, judging the online transmission source;

if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;

if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.

In some embodiments of the present application, the method further comprises:

if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;

if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;

where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.

In some embodiments of the present application, the checking of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:

and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.

In some embodiments of the present application, the method further comprises:

and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.

Correspondingly, the application also provides a data processing system based on the data lake, and the system comprises:

the import module is used for classifying all source data information based on the data access specification, constructing a pasting source table on a pasting source layer and importing a data source file into a data lake;

the establishment module is used for analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;

the verification module is used for verifying the fields to be monitored from the source pasting table to the dimensional modeling;

the monitoring module is used for constructing a summary table in a summary layer, and acquiring and monitoring metadata of the dimension table, the fact table and the summary table;

and the opening module is used for opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode according to the service requirements.

In some embodiments of the present application, the system further comprises a determination module configured to:

if the source data is from a local upload, importing the data into a data lake;

In some embodiments of the present application, the system further comprises an authentication module for:

In some embodiments of the present application, the verification module is specifically configured to:

In some embodiments of the present application, the system further comprises an update module configured to:

By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. The method supports large-scale clusters, has large data volume, and can meet the cluster scale requirement of the data volume above 1 PB. High-concurrency interactive query is supported, and the data in the data lake can be subjected to human-computer interactive query within 2 seconds under hundred-level concurrency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a data lake-based data processing method according to an embodiment of the present invention;

fig. 2 shows a schematic structural diagram of a data lake-based data processing system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides a data processing method based on a data lake, as shown in fig. 1, the method comprises:

s101, classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;

step S102, analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, creating a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;

step S103, checking the fields to be monitored from the source pasting table to the dimensional modeling;

step S104, a summary table is built in a summary layer, and metadata collection and monitoring are carried out on the dimension table, the fact table and the summary table;

and step S105, opening the data tables in the summary layer and the mart layer to the outside through a data interface and a JDBC mode according to the service requirement.

In step S101, all source data information is classified based on the data access specification, a pasting source table is constructed in the pasting source layer, and the data source file is imported into the data lake.

In the embodiment, all source data information is classified into source system information, source table basic information, source data characteristic information and the like according to the set data access specification, so that the source data information before entering the lake is clearer and more transparent, and the subsequent data processing operation is facilitated. And constructing a pasting source table in the pasting layer, and importing the data source file into the data lake.

In order to ensure correctness of data sources, in some embodiments of the present application, the method further includes:

if the source data is from a local upload, importing the data into a data lake;

In step S102, business requirements are analyzed according to business application, dimension modeling is carried out based on the business requirements, a dimension table and a fact table are created, data indexes are set according to the fact table, and market thematic is established in a market layer based on the data indexes.

In this embodiment, a business requirement is analyzed according to a business application, dimension modeling is performed based on the business requirement, a dimension table and a fact table are created, and data indexes are set according to the fact table, where the data indexes include an atom index, a derivative index, and a composite index. And summarizing the atomic index, the derived index and the composite index, and establishing a corresponding market theme in a market layer. The atomic index is an index without any modifier, and is also called a measure (generally, in a table, a polymerization field, an order quantity, a user quantity, pv, uv, and the like). The composite index is a calculation index set which is established on the basis index and formed through a certain operation rule, such as average user transaction amount, asset liability ratio and the like. The derivative index refers to an index generated by combining the basic index or the compound index with the dimension member, the statistical attribute, the management attribute and the like, such as a completion value, a plan value, an accumulated value, a same ratio, a ring ratio, an occupation ratio and the like of the transaction amount.

Dimensional modeling (dimensional modeling) is a data modeling method in data warehouse construction, a logical design method for structuring data, which divides the objective world into metrics and contexts. The dimension table may be viewed as a window for a user to analyze data, including properties of fact records in the fact data table, some properties providing descriptive information, some properties specifying how to aggregate the fact data table data to provide useful information to the analyst, and a hierarchy of properties that help aggregate the data. The fact table is an abbreviation of the fact data table. The main characteristic is that the method contains a large amount of data, and the data can be summarized and recorded.

In step S103, the fields to be monitored from the source pasting table to the dimensional modeling are verified.

In this embodiment, partial fields from the source pasting table to the dimension modeling are verified in a manner of enumeration, field repetition, field null, date format, and the like in a classification type, so as to ensure data quality.

In some embodiments of the present application, the method further comprises:

It is understood that the fixed value can be adaptively adjusted according to data situation and service requirement, which also falls within the protection scope of the present application.

In some embodiments of the present application, the verifying the classification of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:

In step S104, a summary table is constructed in a summary layer, and metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table.

In this embodiment, metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table. Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions such as indicating storage location, history data, resource search, file record, and the like. Metadata is an electronic catalog, and in order to achieve the purpose of creating a catalog, the contents or features of data must be described and collected, so as to achieve the purpose of assisting data retrieval.

In some embodiments of the present application, the method further comprises:

In step S105, the data tables in the summary layer and the mart layer are opened to the outside through a data interface and JDBC according to the service requirement.

In this embodiment, all the data tables in the summary layer and the centralized market layer are opened to the outside through a data interface and JDBC according to the service requirement. The data interface is an interface for outputting data to the data connection line when data transmission is performed. A common interface for wireless decoders is the RS-232 port. The RS-232-C interface (also known as EIA RS-232-C) is one of the most commonly used serial communication interfaces. Java Database Connectivity (JDBC) is an application program interface in Java language that specifies how client programs access databases, providing methods such as querying and updating data in databases.

It can be understood that the preset scheduling time, the fixed value, and the threshold value can be adjusted according to actual requirements, which all fall within the scope of the present application.

By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. Large-scale clustering is supported: the data volume is large, and the cluster scale needs can meet the data volume above 1 PB. Supporting high-concurrency interactive query: the data in the data lake can be subjected to human-computer interaction query within 2 seconds at hundreds of levels of concurrence. The method supports the update operation in the lake, and in the data offline processing, in addition to the common query and addition operations, the update operation usually exists, namely the lake bin integration is often called. One part of data is stored, and one part of data supports multiple kinds of analysis, off-line processing and interactive query of needed data, so that multiple parts of data cannot be stored repeatedly. Data authority and resource isolation (multi-tenant), multiple offline processing jobs run simultaneously, different data authority and resource scheduling are needed, and unauthorized access and resource preemption are avoided. The interface is compatible with open sources, and customers often have stock offline processing applications that need to be migrated to the offline data lake. The method supports multiple data sources and multiple data loading modes, the data sources are stored in multiple types of sources, multiple types of data exist, and multiple data formats exist. The method supports the butt joint (visualization, analysis and mining, report forms, metadata and the like) with third-party software, and butt joint of various third-party tools is convenient for further analysis and management of data.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present invention.

In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.

The method comprises the following steps:

preparation work: and combing and classifying related source data information according to the data access specification, wherein the source data information comprises source system information, source table basic information, data characteristic information and the like.

Data integration: and constructing a pasting source table on a pasting source layer of the data warehouse, and importing the data source file into a data lake through a data integration module.

And (3) standard design: based on the business application analysis requirement, dimension modeling is carried out on a standard design module, and a dimension table and a fact table are designed and created. Based on the fact table, an atomic index, a derivative index and a composite index are designed in a data specification module. And establishing a corresponding market theme, and supporting the analysis and application construction of the service.

Data development: and (4) using the job development in the module to form a production line for corresponding data development steps, performing periodic scheduling, periodically synchronizing data, and updating the final market layer data.

Data quality: and (4) establishing data quality monitoring operation, and verifying enumeration values, field repetition values, field null values, date formats and the like of partial fields from the source pasting table to the dimension modeling and classification types.

Data asset: and (4) acquiring and monitoring metadata in the data asset module according to the constructed dimension table, the fact table and the summary table. And scheduling data acquisition tasks periodically and updating technical assets periodically.

Data service: and opening the data tables in the summary layer and the market layer according to requirements in the data service module in a data interface and JDBC mode.

In addition to the above steps, the present application further comprises:

and (3) data consumption: and providing final service consumption capacity such as visual display and the like according to service requirements.

Correspondingly, the present application also provides a data lake-based data processing system, as shown in fig. 2, the system is applied to a platform including a data warehouse, the system includes:

an importing module 201, configured to classify all source data information based on a data access specification, construct a pasting source table in a pasting source layer, and import a data source file into a data lake;

the establishing module 202 is used for analyzing business requirements according to business application, performing dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing market special subjects in a market layer based on the data indexes;

the checking module 203 is used for checking the fields to be monitored from the source pasting table to the dimensional modeling;

the monitoring module 204 is configured to construct a summary table in a summary layer, and perform metadata acquisition and monitoring on the dimension table, the fact table, and the summary table;

and an opening module 205, configured to open the data tables in the summary layer and the mart layer to the outside through a data interface and JDBC according to the service requirement.

if the source data is from a local upload, importing the data into a data lake;

In some embodiments of the present application, the verification module 203 is specifically configured to:

Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A data lake-based data processing method is applied to a platform comprising a data warehouse, and the method comprises the following steps:

2. The method of claim 1, wherein prior to importing the data source file into the data lake, the method further comprises:

if the source data is from a local upload, importing the data into a data lake;

3. The method of claim 1, wherein the method further comprises:

4. The method of claim 3, wherein the checking of the field to be monitored from the pasting source table to the dimensional modeling is specifically:

5. The method of claim 1, wherein the method further comprises:

6. A data lake-based data processing system for use in a platform comprising a data warehouse, the system comprising:

7. The system of claim 6, further comprising a decision module for:

if the source data is from a local upload, importing the data into a data lake;

8. The system of claim 6, further comprising an authentication module to:

9. The system of claim 8, wherein the verification module is specifically configured to:

10. The system of claim 6, further comprising an update module to: