[go: up one dir, main page]

CN114880405A - Data lake-based data processing method and system - Google Patents

Data lake-based data processing method and system Download PDF

Info

Publication number
CN114880405A
CN114880405A CN202210330525.6A CN202210330525A CN114880405A CN 114880405 A CN114880405 A CN 114880405A CN 202210330525 A CN202210330525 A CN 202210330525A CN 114880405 A CN114880405 A CN 114880405A
Authority
CN
China
Prior art keywords
data
source
fields
layer
monitored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210330525.6A
Other languages
Chinese (zh)
Inventor
徐银领
韩亮
陈佳
刘鲁清
吴家乐
韩杰娇
杜万波
孟子涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaneng Information Technology Co Ltd
Original Assignee
Huaneng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaneng Information Technology Co Ltd filed Critical Huaneng Information Technology Co Ltd
Priority to CN202210330525.6A priority Critical patent/CN114880405A/en
Publication of CN114880405A publication Critical patent/CN114880405A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method and a data processing system based on a data lake, which are applied to a platform comprising a data warehouse, classify all source data information based on a data access specification, construct a source pasting table on a source pasting layer, and introduce a data source file into the data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs.

Description

Data lake-based data processing method and system
Technical Field
The application relates to the technical field of data processing, in particular to a data lake-based data processing method and system.
Background
In the existing data lake data processing technology, a data source frequently generates errors, so that external data or other non-service data enter a data lake, the data quality cannot be accurately monitored, the field quality is low, and the data processing efficiency of the data lake is reduced.
Therefore, how to improve the accuracy of data quality detection is a technical problem to be solved at present.
Disclosure of Invention
The invention provides a data lake-based data processing method, which is used for solving the technical problem of low data quality detection accuracy in the prior art. The method is applied to a platform comprising a data warehouse, and comprises the following steps:
classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index;
checking the fields to be monitored from the source pasting table to the dimensional modeling;
constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table;
and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode.
In some embodiments of the present application, the method further comprises:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the checking of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
Correspondingly, the application also provides a data processing system based on the data lake, and the system comprises:
the import module is used for classifying all source data information based on the data access specification, constructing a pasting source table on a pasting source layer and importing a data source file into a data lake;
the establishment module is used for analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
the verification module is used for verifying the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module is used for constructing a summary table in a summary layer, and acquiring and monitoring metadata of the dimension table, the fact table and the summary table;
and the opening module is used for opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode according to the service requirements.
In some embodiments of the present application, the system further comprises a determination module configured to:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the system further comprises an authentication module for:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the verification module is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the system further comprises an update module configured to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. The method supports large-scale clusters, has large data volume, and can meet the cluster scale requirement of the data volume above 1 PB. High-concurrency interactive query is supported, and the data in the data lake can be subjected to human-computer interactive query within 2 seconds under hundred-level concurrency.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a data lake-based data processing method according to an embodiment of the present invention;
fig. 2 shows a schematic structural diagram of a data lake-based data processing system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application provides a data processing method based on a data lake, as shown in fig. 1, the method comprises:
s101, classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
step S102, analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, creating a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
step S103, checking the fields to be monitored from the source pasting table to the dimensional modeling;
step S104, a summary table is built in a summary layer, and metadata collection and monitoring are carried out on the dimension table, the fact table and the summary table;
and step S105, opening the data tables in the summary layer and the mart layer to the outside through a data interface and a JDBC mode according to the service requirement.
In step S101, all source data information is classified based on the data access specification, a pasting source table is constructed in the pasting source layer, and the data source file is imported into the data lake.
In the embodiment, all source data information is classified into source system information, source table basic information, source data characteristic information and the like according to the set data access specification, so that the source data information before entering the lake is clearer and more transparent, and the subsequent data processing operation is facilitated. And constructing a pasting source table in the pasting layer, and importing the data source file into the data lake.
In order to ensure correctness of data sources, in some embodiments of the present application, the method further includes:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In step S102, business requirements are analyzed according to business application, dimension modeling is carried out based on the business requirements, a dimension table and a fact table are created, data indexes are set according to the fact table, and market thematic is established in a market layer based on the data indexes.
In this embodiment, a business requirement is analyzed according to a business application, dimension modeling is performed based on the business requirement, a dimension table and a fact table are created, and data indexes are set according to the fact table, where the data indexes include an atom index, a derivative index, and a composite index. And summarizing the atomic index, the derived index and the composite index, and establishing a corresponding market theme in a market layer. The atomic index is an index without any modifier, and is also called a measure (generally, in a table, a polymerization field, an order quantity, a user quantity, pv, uv, and the like). The composite index is a calculation index set which is established on the basis index and formed through a certain operation rule, such as average user transaction amount, asset liability ratio and the like. The derivative index refers to an index generated by combining the basic index or the compound index with the dimension member, the statistical attribute, the management attribute and the like, such as a completion value, a plan value, an accumulated value, a same ratio, a ring ratio, an occupation ratio and the like of the transaction amount.
Dimensional modeling (dimensional modeling) is a data modeling method in data warehouse construction, a logical design method for structuring data, which divides the objective world into metrics and contexts. The dimension table may be viewed as a window for a user to analyze data, including properties of fact records in the fact data table, some properties providing descriptive information, some properties specifying how to aggregate the fact data table data to provide useful information to the analyst, and a hierarchy of properties that help aggregate the data. The fact table is an abbreviation of the fact data table. The main characteristic is that the method contains a large amount of data, and the data can be summarized and recorded.
In step S103, the fields to be monitored from the source pasting table to the dimensional modeling are verified.
In this embodiment, partial fields from the source pasting table to the dimension modeling are verified in a manner of enumeration, field repetition, field null, date format, and the like in a classification type, so as to ensure data quality.
In some embodiments of the present application, the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
It is understood that the fixed value can be adaptively adjusted according to data situation and service requirement, which also falls within the protection scope of the present application.
In some embodiments of the present application, the verifying the classification of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In step S104, a summary table is constructed in a summary layer, and metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table.
In this embodiment, metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table. Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions such as indicating storage location, history data, resource search, file record, and the like. Metadata is an electronic catalog, and in order to achieve the purpose of creating a catalog, the contents or features of data must be described and collected, so as to achieve the purpose of assisting data retrieval.
In some embodiments of the present application, the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
In step S105, the data tables in the summary layer and the mart layer are opened to the outside through a data interface and JDBC according to the service requirement.
In this embodiment, all the data tables in the summary layer and the centralized market layer are opened to the outside through a data interface and JDBC according to the service requirement. The data interface is an interface for outputting data to the data connection line when data transmission is performed. A common interface for wireless decoders is the RS-232 port. The RS-232-C interface (also known as EIA RS-232-C) is one of the most commonly used serial communication interfaces. Java Database Connectivity (JDBC) is an application program interface in Java language that specifies how client programs access databases, providing methods such as querying and updating data in databases.
It can be understood that the preset scheduling time, the fixed value, and the threshold value can be adjusted according to actual requirements, which all fall within the scope of the present application.
By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. Large-scale clustering is supported: the data volume is large, and the cluster scale needs can meet the data volume above 1 PB. Supporting high-concurrency interactive query: the data in the data lake can be subjected to human-computer interaction query within 2 seconds at hundreds of levels of concurrence. The method supports the update operation in the lake, and in the data offline processing, in addition to the common query and addition operations, the update operation usually exists, namely the lake bin integration is often called. One part of data is stored, and one part of data supports multiple kinds of analysis, off-line processing and interactive query of needed data, so that multiple parts of data cannot be stored repeatedly. Data authority and resource isolation (multi-tenant), multiple offline processing jobs run simultaneously, different data authority and resource scheduling are needed, and unauthorized access and resource preemption are avoided. The interface is compatible with open sources, and customers often have stock offline processing applications that need to be migrated to the offline data lake. The method supports multiple data sources and multiple data loading modes, the data sources are stored in multiple types of sources, multiple types of data exist, and multiple data formats exist. The method supports the butt joint (visualization, analysis and mining, report forms, metadata and the like) with third-party software, and butt joint of various third-party tools is convenient for further analysis and management of data.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present invention.
In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.
The method comprises the following steps:
preparation work: and combing and classifying related source data information according to the data access specification, wherein the source data information comprises source system information, source table basic information, data characteristic information and the like.
Data integration: and constructing a pasting source table on a pasting source layer of the data warehouse, and importing the data source file into a data lake through a data integration module.
And (3) standard design: based on the business application analysis requirement, dimension modeling is carried out on a standard design module, and a dimension table and a fact table are designed and created. Based on the fact table, an atomic index, a derivative index and a composite index are designed in a data specification module. And establishing a corresponding market theme, and supporting the analysis and application construction of the service.
Data development: and (4) using the job development in the module to form a production line for corresponding data development steps, performing periodic scheduling, periodically synchronizing data, and updating the final market layer data.
Data quality: and (4) establishing data quality monitoring operation, and verifying enumeration values, field repetition values, field null values, date formats and the like of partial fields from the source pasting table to the dimension modeling and classification types.
Data asset: and (4) acquiring and monitoring metadata in the data asset module according to the constructed dimension table, the fact table and the summary table. And scheduling data acquisition tasks periodically and updating technical assets periodically.
Data service: and opening the data tables in the summary layer and the market layer according to requirements in the data service module in a data interface and JDBC mode.
In addition to the above steps, the present application further comprises:
and (3) data consumption: and providing final service consumption capacity such as visual display and the like according to service requirements.
Correspondingly, the present application also provides a data lake-based data processing system, as shown in fig. 2, the system is applied to a platform including a data warehouse, the system includes:
an importing module 201, configured to classify all source data information based on a data access specification, construct a pasting source table in a pasting source layer, and import a data source file into a data lake;
the establishing module 202 is used for analyzing business requirements according to business application, performing dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing market special subjects in a market layer based on the data indexes;
the checking module 203 is used for checking the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module 204 is configured to construct a summary table in a summary layer, and perform metadata acquisition and monitoring on the dimension table, the fact table, and the summary table;
and an opening module 205, configured to open the data tables in the summary layer and the mart layer to the outside through a data interface and JDBC according to the service requirement.
In some embodiments of the present application, the system further comprises a determination module configured to:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the system further comprises an authentication module for:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the verification module 203 is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the system further comprises an update module configured to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A data lake-based data processing method is applied to a platform comprising a data warehouse, and the method comprises the following steps:
classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index;
checking the fields to be monitored from the source pasting table to the dimensional modeling;
constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table;
and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode.
2. The method of claim 1, wherein prior to importing the data source file into the data lake, the method further comprises:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
3. The method of claim 1, wherein the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
4. The method of claim 3, wherein the checking of the field to be monitored from the pasting source table to the dimensional modeling is specifically:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
5. The method of claim 1, wherein the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
6. A data lake-based data processing system for use in a platform comprising a data warehouse, the system comprising:
the import module is used for classifying all source data information based on the data access specification, constructing a pasting source table on a pasting source layer and importing a data source file into a data lake;
the establishment module is used for analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
the verification module is used for verifying the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module is used for constructing a summary table in a summary layer, and acquiring and monitoring metadata of the dimension table, the fact table and the summary table;
and the opening module is used for opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode according to the service requirements.
7. The system of claim 6, further comprising a decision module for:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
8. The system of claim 6, further comprising an authentication module to:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
9. The system of claim 8, wherein the verification module is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
10. The system of claim 6, further comprising an update module to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
CN202210330525.6A 2022-03-31 2022-03-31 Data lake-based data processing method and system Pending CN114880405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210330525.6A CN114880405A (en) 2022-03-31 2022-03-31 Data lake-based data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210330525.6A CN114880405A (en) 2022-03-31 2022-03-31 Data lake-based data processing method and system

Publications (1)

Publication Number Publication Date
CN114880405A true CN114880405A (en) 2022-08-09

Family

ID=82669312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210330525.6A Pending CN114880405A (en) 2022-03-31 2022-03-31 Data lake-based data processing method and system

Country Status (1)

Country Link
CN (1) CN114880405A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115237925A (en) * 2022-08-12 2022-10-25 中国工商银行股份有限公司 Data processing method, apparatus, equipment, storage medium and product
CN115374329A (en) * 2022-10-25 2022-11-22 杭州比智科技有限公司 Method and system for managing enterprise business metadata and technical metadata
CN115526346A (en) * 2022-08-29 2022-12-27 广西电网有限责任公司电力科学研究院 Power grid data processing method and system
CN115712655A (en) * 2022-09-30 2023-02-24 中国建设银行股份有限公司 Data processing method, apparatus, device, medium, and product
CN115829412A (en) * 2022-12-21 2023-03-21 四川新网银行股份有限公司 Method, system, and medium for quantifying index data processing based on business process
CN115936296A (en) * 2022-12-20 2023-04-07 北京航天智造科技发展有限公司 Production and manufacturing data cockpit system of discrete manufacturing enterprise based on industrial internet big data lake
CN116340885A (en) * 2023-04-11 2023-06-27 太原理工大学 Multi-source heterogeneous data fusion method based on coal mine information physical system
CN116431638A (en) * 2023-04-12 2023-07-14 浪潮智慧科技有限公司 Index processing method, equipment and medium for water conservancy industry

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189764A (en) * 2018-09-20 2019-01-11 北京桃花岛信息技术有限公司 A kind of colleges and universities' data warehouse layered design method based on Hive
CN109669934A (en) * 2018-12-11 2019-04-23 江苏瑞中数据股份有限公司 A kind of data warehouse and its construction method suiting electric power customer service
CN111460045A (en) * 2020-03-02 2020-07-28 心医国际数字医疗系统(大连)有限公司 Modeling method, model, computer equipment and storage medium for data warehouse construction
CN112084182A (en) * 2020-09-10 2020-12-15 重庆富民银行股份有限公司 Data modeling method for data mart and data warehouse
CN112328706A (en) * 2020-11-03 2021-02-05 成都中科大旗软件股份有限公司 Dimension modeling calculation method under digital bin system, computer equipment and storage medium
CN112988900A (en) * 2021-04-02 2021-06-18 广东机电职业技术学院 Data filling and error correcting method and system based on multi-service scene
CN113312341A (en) * 2021-04-28 2021-08-27 上海淇馥信息技术有限公司 Data quality monitoring method and system and computer equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189764A (en) * 2018-09-20 2019-01-11 北京桃花岛信息技术有限公司 A kind of colleges and universities' data warehouse layered design method based on Hive
CN109669934A (en) * 2018-12-11 2019-04-23 江苏瑞中数据股份有限公司 A kind of data warehouse and its construction method suiting electric power customer service
CN111460045A (en) * 2020-03-02 2020-07-28 心医国际数字医疗系统(大连)有限公司 Modeling method, model, computer equipment and storage medium for data warehouse construction
CN112084182A (en) * 2020-09-10 2020-12-15 重庆富民银行股份有限公司 Data modeling method for data mart and data warehouse
CN112328706A (en) * 2020-11-03 2021-02-05 成都中科大旗软件股份有限公司 Dimension modeling calculation method under digital bin system, computer equipment and storage medium
CN112988900A (en) * 2021-04-02 2021-06-18 广东机电职业技术学院 Data filling and error correcting method and system based on multi-service scene
CN113312341A (en) * 2021-04-28 2021-08-27 上海淇馥信息技术有限公司 Data quality monitoring method and system and computer equipment

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115237925A (en) * 2022-08-12 2022-10-25 中国工商银行股份有限公司 Data processing method, apparatus, equipment, storage medium and product
CN115526346A (en) * 2022-08-29 2022-12-27 广西电网有限责任公司电力科学研究院 Power grid data processing method and system
CN115712655A (en) * 2022-09-30 2023-02-24 中国建设银行股份有限公司 Data processing method, apparatus, device, medium, and product
CN115374329A (en) * 2022-10-25 2022-11-22 杭州比智科技有限公司 Method and system for managing enterprise business metadata and technical metadata
CN115936296A (en) * 2022-12-20 2023-04-07 北京航天智造科技发展有限公司 Production and manufacturing data cockpit system of discrete manufacturing enterprise based on industrial internet big data lake
CN115829412A (en) * 2022-12-21 2023-03-21 四川新网银行股份有限公司 Method, system, and medium for quantifying index data processing based on business process
CN116340885A (en) * 2023-04-11 2023-06-27 太原理工大学 Multi-source heterogeneous data fusion method based on coal mine information physical system
CN116340885B (en) * 2023-04-11 2023-10-03 太原理工大学 A multi-source heterogeneous data fusion method based on coal mine cyber-physical system
CN116431638A (en) * 2023-04-12 2023-07-14 浪潮智慧科技有限公司 Index processing method, equipment and medium for water conservancy industry
CN116431638B (en) * 2023-04-12 2024-03-12 浪潮智慧科技有限公司 Index processing method, equipment and medium for water conservancy industry

Similar Documents

Publication Publication Date Title
CN114880405A (en) Data lake-based data processing method and system
US11409764B2 (en) System for data management in a large scale data repository
EP3513314B1 (en) System for analysing data relationships to support query execution
CN109522312B (en) A data processing method, device, server and storage medium
CN112199433A (en) Data management system for city-level data middling station
EP3513313A1 (en) System for importing data into a data repository
CN111177134B (en) Data quality analysis method, device, terminal and medium suitable for mass data
US20170109636A1 (en) Crowd-Based Model for Identifying Executions of a Business Process
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
CN119848765B (en) Building full life cycle data through fusion method
CN117171105B (en) Electronic archive management system based on knowledge graph
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN114281877A (en) A data management system and method
CN117909392A (en) Intelligent data asset inventory method and system
CN116662448A (en) Data automatic synchronization method, device, electronic equipment and storage medium
CN120336323A (en) A multi-caliber budget table processing method, system, device and medium
US11227288B1 (en) Systems and methods for integration of disparate data feeds for unified data monitoring
CN117312268B (en) Stream-batch integrated master data management method and device based on multi-source and multi-database
CN116578612B (en) Lithium battery finished product detection data asset construction method
CN119441196A (en) Method, device and equipment for building lightweight data warehouse based on MPP architecture
CN118820812A (en) A method, device and medium for building an intelligent audit model based on big data
CN117290183A (en) ETL-based cross-system exception monitoring processing method and device
CN117934186A (en) Financial data whole-flow management platform based on digitization
CN115689463A (en) Enterprise standing book database management system in rare earth industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220809