CN114880405A - Data lake-based data processing method and system - Google Patents
Data lake-based data processing method and system Download PDFInfo
- Publication number
- CN114880405A CN114880405A CN202210330525.6A CN202210330525A CN114880405A CN 114880405 A CN114880405 A CN 114880405A CN 202210330525 A CN202210330525 A CN 202210330525A CN 114880405 A CN114880405 A CN 114880405A
- Authority
- CN
- China
- Prior art keywords
- data
- source
- fields
- layer
- monitored
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data processing method and a data processing system based on a data lake, which are applied to a platform comprising a data warehouse, classify all source data information based on a data access specification, construct a source pasting table on a source pasting layer, and introduce a data source file into the data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs.
Description
Technical Field
The application relates to the technical field of data processing, in particular to a data lake-based data processing method and system.
Background
In the existing data lake data processing technology, a data source frequently generates errors, so that external data or other non-service data enter a data lake, the data quality cannot be accurately monitored, the field quality is low, and the data processing efficiency of the data lake is reduced.
Therefore, how to improve the accuracy of data quality detection is a technical problem to be solved at present.
Disclosure of Invention
The invention provides a data lake-based data processing method, which is used for solving the technical problem of low data quality detection accuracy in the prior art. The method is applied to a platform comprising a data warehouse, and comprises the following steps:
classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index;
checking the fields to be monitored from the source pasting table to the dimensional modeling;
constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table;
and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode.
In some embodiments of the present application, the method further comprises:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the checking of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
Correspondingly, the application also provides a data processing system based on the data lake, and the system comprises:
the import module is used for classifying all source data information based on the data access specification, constructing a pasting source table on a pasting source layer and importing a data source file into a data lake;
the establishment module is used for analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
the verification module is used for verifying the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module is used for constructing a summary table in a summary layer, and acquiring and monitoring metadata of the dimension table, the fact table and the summary table;
and the opening module is used for opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode according to the service requirements.
In some embodiments of the present application, the system further comprises a determination module configured to:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the system further comprises an authentication module for:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the verification module is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the system further comprises an update module configured to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. The method supports large-scale clusters, has large data volume, and can meet the cluster scale requirement of the data volume above 1 PB. High-concurrency interactive query is supported, and the data in the data lake can be subjected to human-computer interactive query within 2 seconds under hundred-level concurrency.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a data lake-based data processing method according to an embodiment of the present invention;
fig. 2 shows a schematic structural diagram of a data lake-based data processing system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application provides a data processing method based on a data lake, as shown in fig. 1, the method comprises:
s101, classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
step S102, analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, creating a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
step S103, checking the fields to be monitored from the source pasting table to the dimensional modeling;
step S104, a summary table is built in a summary layer, and metadata collection and monitoring are carried out on the dimension table, the fact table and the summary table;
and step S105, opening the data tables in the summary layer and the mart layer to the outside through a data interface and a JDBC mode according to the service requirement.
In step S101, all source data information is classified based on the data access specification, a pasting source table is constructed in the pasting source layer, and the data source file is imported into the data lake.
In the embodiment, all source data information is classified into source system information, source table basic information, source data characteristic information and the like according to the set data access specification, so that the source data information before entering the lake is clearer and more transparent, and the subsequent data processing operation is facilitated. And constructing a pasting source table in the pasting layer, and importing the data source file into the data lake.
In order to ensure correctness of data sources, in some embodiments of the present application, the method further includes:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In step S102, business requirements are analyzed according to business application, dimension modeling is carried out based on the business requirements, a dimension table and a fact table are created, data indexes are set according to the fact table, and market thematic is established in a market layer based on the data indexes.
In this embodiment, a business requirement is analyzed according to a business application, dimension modeling is performed based on the business requirement, a dimension table and a fact table are created, and data indexes are set according to the fact table, where the data indexes include an atom index, a derivative index, and a composite index. And summarizing the atomic index, the derived index and the composite index, and establishing a corresponding market theme in a market layer. The atomic index is an index without any modifier, and is also called a measure (generally, in a table, a polymerization field, an order quantity, a user quantity, pv, uv, and the like). The composite index is a calculation index set which is established on the basis index and formed through a certain operation rule, such as average user transaction amount, asset liability ratio and the like. The derivative index refers to an index generated by combining the basic index or the compound index with the dimension member, the statistical attribute, the management attribute and the like, such as a completion value, a plan value, an accumulated value, a same ratio, a ring ratio, an occupation ratio and the like of the transaction amount.
Dimensional modeling (dimensional modeling) is a data modeling method in data warehouse construction, a logical design method for structuring data, which divides the objective world into metrics and contexts. The dimension table may be viewed as a window for a user to analyze data, including properties of fact records in the fact data table, some properties providing descriptive information, some properties specifying how to aggregate the fact data table data to provide useful information to the analyst, and a hierarchy of properties that help aggregate the data. The fact table is an abbreviation of the fact data table. The main characteristic is that the method contains a large amount of data, and the data can be summarized and recorded.
In step S103, the fields to be monitored from the source pasting table to the dimensional modeling are verified.
In this embodiment, partial fields from the source pasting table to the dimension modeling are verified in a manner of enumeration, field repetition, field null, date format, and the like in a classification type, so as to ensure data quality.
In some embodiments of the present application, the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
It is understood that the fixed value can be adaptively adjusted according to data situation and service requirement, which also falls within the protection scope of the present application.
In some embodiments of the present application, the verifying the classification of the field to be monitored from the source pasting table to the dimensional modeling specifically includes:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In step S104, a summary table is constructed in a summary layer, and metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table.
In this embodiment, metadata collection and monitoring are performed on the dimension table, the fact table, and the summary table. Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions such as indicating storage location, history data, resource search, file record, and the like. Metadata is an electronic catalog, and in order to achieve the purpose of creating a catalog, the contents or features of data must be described and collected, so as to achieve the purpose of assisting data retrieval.
In some embodiments of the present application, the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
In step S105, the data tables in the summary layer and the mart layer are opened to the outside through a data interface and JDBC according to the service requirement.
In this embodiment, all the data tables in the summary layer and the centralized market layer are opened to the outside through a data interface and JDBC according to the service requirement. The data interface is an interface for outputting data to the data connection line when data transmission is performed. A common interface for wireless decoders is the RS-232 port. The RS-232-C interface (also known as EIA RS-232-C) is one of the most commonly used serial communication interfaces. Java Database Connectivity (JDBC) is an application program interface in Java language that specifies how client programs access databases, providing methods such as querying and updating data in databases.
It can be understood that the preset scheduling time, the fixed value, and the threshold value can be adjusted according to actual requirements, which all fall within the scope of the present application.
By applying the technical scheme, all source data information is classified based on data access specifications, a pasting source table is constructed on a pasting source layer, and a data source file is led into a data lake; analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index; checking the fields to be monitored from the source pasting table to the dimensional modeling; constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table; and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode. The problem of data source error can be avoided, the data processing efficiency is improved, the data quality can be accurately monitored in real time, and the problem can be timely found out when the problem occurs. Large-scale clustering is supported: the data volume is large, and the cluster scale needs can meet the data volume above 1 PB. Supporting high-concurrency interactive query: the data in the data lake can be subjected to human-computer interaction query within 2 seconds at hundreds of levels of concurrence. The method supports the update operation in the lake, and in the data offline processing, in addition to the common query and addition operations, the update operation usually exists, namely the lake bin integration is often called. One part of data is stored, and one part of data supports multiple kinds of analysis, off-line processing and interactive query of needed data, so that multiple parts of data cannot be stored repeatedly. Data authority and resource isolation (multi-tenant), multiple offline processing jobs run simultaneously, different data authority and resource scheduling are needed, and unauthorized access and resource preemption are avoided. The interface is compatible with open sources, and customers often have stock offline processing applications that need to be migrated to the offline data lake. The method supports multiple data sources and multiple data loading modes, the data sources are stored in multiple types of sources, multiple types of data exist, and multiple data formats exist. The method supports the butt joint (visualization, analysis and mining, report forms, metadata and the like) with third-party software, and butt joint of various third-party tools is convenient for further analysis and management of data.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present invention.
In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.
The method comprises the following steps:
preparation work: and combing and classifying related source data information according to the data access specification, wherein the source data information comprises source system information, source table basic information, data characteristic information and the like.
Data integration: and constructing a pasting source table on a pasting source layer of the data warehouse, and importing the data source file into a data lake through a data integration module.
And (3) standard design: based on the business application analysis requirement, dimension modeling is carried out on a standard design module, and a dimension table and a fact table are designed and created. Based on the fact table, an atomic index, a derivative index and a composite index are designed in a data specification module. And establishing a corresponding market theme, and supporting the analysis and application construction of the service.
Data development: and (4) using the job development in the module to form a production line for corresponding data development steps, performing periodic scheduling, periodically synchronizing data, and updating the final market layer data.
Data quality: and (4) establishing data quality monitoring operation, and verifying enumeration values, field repetition values, field null values, date formats and the like of partial fields from the source pasting table to the dimension modeling and classification types.
Data asset: and (4) acquiring and monitoring metadata in the data asset module according to the constructed dimension table, the fact table and the summary table. And scheduling data acquisition tasks periodically and updating technical assets periodically.
Data service: and opening the data tables in the summary layer and the market layer according to requirements in the data service module in a data interface and JDBC mode.
In addition to the above steps, the present application further comprises:
and (3) data consumption: and providing final service consumption capacity such as visual display and the like according to service requirements.
Correspondingly, the present application also provides a data lake-based data processing system, as shown in fig. 2, the system is applied to a platform including a data warehouse, the system includes:
an importing module 201, configured to classify all source data information based on a data access specification, construct a pasting source table in a pasting source layer, and import a data source file into a data lake;
the establishing module 202 is used for analyzing business requirements according to business application, performing dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing market special subjects in a market layer based on the data indexes;
the checking module 203 is used for checking the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module 204 is configured to construct a summary table in a summary layer, and perform metadata acquisition and monitoring on the dimension table, the fact table, and the summary table;
and an opening module 205, configured to open the data tables in the summary layer and the mart layer to the outside through a data interface and JDBC according to the service requirement.
In some embodiments of the present application, the system further comprises a determination module configured to:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
In some embodiments of the present application, the system further comprises an authentication module for:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
In some embodiments of the present application, the verification module 203 is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
In some embodiments of the present application, the system further comprises an update module configured to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A data lake-based data processing method is applied to a platform comprising a data warehouse, and the method comprises the following steps:
classifying all source data information based on a data access specification, constructing a pasting source table on a pasting source layer, and importing a data source file into a data lake;
analyzing a business requirement according to business application, carrying out dimension modeling based on the business requirement, creating a dimension table and a fact table, setting a data index according to the fact table, and establishing a market special topic in a market layer based on the data index;
checking the fields to be monitored from the source pasting table to the dimensional modeling;
constructing a summary table in a summary layer, and collecting and monitoring metadata of the dimension table, the fact table and the summary table;
and according to the service requirement, opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode.
2. The method of claim 1, wherein prior to importing the data source file into the data lake, the method further comprises:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
3. The method of claim 1, wherein the method further comprises:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
4. The method of claim 3, wherein the checking of the field to be monitored from the pasting source table to the dimensional modeling is specifically:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
5. The method of claim 1, wherein the method further comprises:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
6. A data lake-based data processing system for use in a platform comprising a data warehouse, the system comprising:
the import module is used for classifying all source data information based on the data access specification, constructing a pasting source table on a pasting source layer and importing a data source file into a data lake;
the establishment module is used for analyzing business requirements according to business application, carrying out dimension modeling based on the business requirements, establishing a dimension table and a fact table, setting data indexes according to the fact table, and establishing a market special topic in a market layer based on the data indexes;
the verification module is used for verifying the fields to be monitored from the source pasting table to the dimensional modeling;
the monitoring module is used for constructing a summary table in a summary layer, and acquiring and monitoring metadata of the dimension table, the fact table and the summary table;
and the opening module is used for opening the data tables in the summary layer and the mart layer to the outside in a data interface and JDBC mode according to the service requirements.
7. The system of claim 6, further comprising a decision module for:
if the source data is from a local upload, importing the data into a data lake;
if the source data comes from online transmission, judging the online transmission source;
if the on-line transmission originates from a local area network subordinate to the data lake, importing the data into the data lake;
if the online transmission does not originate from the local area network of the slave data lake, no data is imported into the data lake.
8. The system of claim 6, further comprising an authentication module to:
if all fields from the source pasting table to the dimension modeling have fields with high repeatability, taking the fields as fields to be monitored;
if the fields from the source pasting table to the dimension modeling do not have fields with high repeatability, taking all the fields from the source pasting table to the dimension modeling as fields to be monitored;
where the repeatability is high, the number of occurrences of bytes in a field exceeds a fixed value.
9. The system of claim 8, wherein the verification module is specifically configured to:
and if the field repetition value and the field null value in the field to be monitored exceed the threshold value, and the date format of the field to be monitored does not meet the preset standard, marking the field to be monitored into a low-quality field.
10. The system of claim 6, further comprising an update module to:
and setting preset scheduling time based on the service requirement, and synchronously updating the data of the market layer based on the preset scheduling time to enable the data of the market layer to be in the latest state.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210330525.6A CN114880405A (en) | 2022-03-31 | 2022-03-31 | Data lake-based data processing method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210330525.6A CN114880405A (en) | 2022-03-31 | 2022-03-31 | Data lake-based data processing method and system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114880405A true CN114880405A (en) | 2022-08-09 |
Family
ID=82669312
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210330525.6A Pending CN114880405A (en) | 2022-03-31 | 2022-03-31 | Data lake-based data processing method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114880405A (en) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115237925A (en) * | 2022-08-12 | 2022-10-25 | 中国工商银行股份有限公司 | Data processing method, apparatus, equipment, storage medium and product |
| CN115374329A (en) * | 2022-10-25 | 2022-11-22 | 杭州比智科技有限公司 | Method and system for managing enterprise business metadata and technical metadata |
| CN115526346A (en) * | 2022-08-29 | 2022-12-27 | 广西电网有限责任公司电力科学研究院 | Power grid data processing method and system |
| CN115712655A (en) * | 2022-09-30 | 2023-02-24 | 中国建设银行股份有限公司 | Data processing method, apparatus, device, medium, and product |
| CN115829412A (en) * | 2022-12-21 | 2023-03-21 | 四川新网银行股份有限公司 | Method, system, and medium for quantifying index data processing based on business process |
| CN115936296A (en) * | 2022-12-20 | 2023-04-07 | 北京航天智造科技发展有限公司 | Production and manufacturing data cockpit system of discrete manufacturing enterprise based on industrial internet big data lake |
| CN116340885A (en) * | 2023-04-11 | 2023-06-27 | 太原理工大学 | Multi-source heterogeneous data fusion method based on coal mine information physical system |
| CN116431638A (en) * | 2023-04-12 | 2023-07-14 | 浪潮智慧科技有限公司 | Index processing method, equipment and medium for water conservancy industry |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109189764A (en) * | 2018-09-20 | 2019-01-11 | 北京桃花岛信息技术有限公司 | A kind of colleges and universities' data warehouse layered design method based on Hive |
| CN109669934A (en) * | 2018-12-11 | 2019-04-23 | 江苏瑞中数据股份有限公司 | A kind of data warehouse and its construction method suiting electric power customer service |
| CN111460045A (en) * | 2020-03-02 | 2020-07-28 | 心医国际数字医疗系统(大连)有限公司 | Modeling method, model, computer equipment and storage medium for data warehouse construction |
| CN112084182A (en) * | 2020-09-10 | 2020-12-15 | 重庆富民银行股份有限公司 | Data modeling method for data mart and data warehouse |
| CN112328706A (en) * | 2020-11-03 | 2021-02-05 | 成都中科大旗软件股份有限公司 | Dimension modeling calculation method under digital bin system, computer equipment and storage medium |
| CN112988900A (en) * | 2021-04-02 | 2021-06-18 | 广东机电职业技术学院 | Data filling and error correcting method and system based on multi-service scene |
| CN113312341A (en) * | 2021-04-28 | 2021-08-27 | 上海淇馥信息技术有限公司 | Data quality monitoring method and system and computer equipment |
-
2022
- 2022-03-31 CN CN202210330525.6A patent/CN114880405A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109189764A (en) * | 2018-09-20 | 2019-01-11 | 北京桃花岛信息技术有限公司 | A kind of colleges and universities' data warehouse layered design method based on Hive |
| CN109669934A (en) * | 2018-12-11 | 2019-04-23 | 江苏瑞中数据股份有限公司 | A kind of data warehouse and its construction method suiting electric power customer service |
| CN111460045A (en) * | 2020-03-02 | 2020-07-28 | 心医国际数字医疗系统(大连)有限公司 | Modeling method, model, computer equipment and storage medium for data warehouse construction |
| CN112084182A (en) * | 2020-09-10 | 2020-12-15 | 重庆富民银行股份有限公司 | Data modeling method for data mart and data warehouse |
| CN112328706A (en) * | 2020-11-03 | 2021-02-05 | 成都中科大旗软件股份有限公司 | Dimension modeling calculation method under digital bin system, computer equipment and storage medium |
| CN112988900A (en) * | 2021-04-02 | 2021-06-18 | 广东机电职业技术学院 | Data filling and error correcting method and system based on multi-service scene |
| CN113312341A (en) * | 2021-04-28 | 2021-08-27 | 上海淇馥信息技术有限公司 | Data quality monitoring method and system and computer equipment |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115237925A (en) * | 2022-08-12 | 2022-10-25 | 中国工商银行股份有限公司 | Data processing method, apparatus, equipment, storage medium and product |
| CN115526346A (en) * | 2022-08-29 | 2022-12-27 | 广西电网有限责任公司电力科学研究院 | Power grid data processing method and system |
| CN115712655A (en) * | 2022-09-30 | 2023-02-24 | 中国建设银行股份有限公司 | Data processing method, apparatus, device, medium, and product |
| CN115374329A (en) * | 2022-10-25 | 2022-11-22 | 杭州比智科技有限公司 | Method and system for managing enterprise business metadata and technical metadata |
| CN115936296A (en) * | 2022-12-20 | 2023-04-07 | 北京航天智造科技发展有限公司 | Production and manufacturing data cockpit system of discrete manufacturing enterprise based on industrial internet big data lake |
| CN115829412A (en) * | 2022-12-21 | 2023-03-21 | 四川新网银行股份有限公司 | Method, system, and medium for quantifying index data processing based on business process |
| CN116340885A (en) * | 2023-04-11 | 2023-06-27 | 太原理工大学 | Multi-source heterogeneous data fusion method based on coal mine information physical system |
| CN116340885B (en) * | 2023-04-11 | 2023-10-03 | 太原理工大学 | A multi-source heterogeneous data fusion method based on coal mine cyber-physical system |
| CN116431638A (en) * | 2023-04-12 | 2023-07-14 | 浪潮智慧科技有限公司 | Index processing method, equipment and medium for water conservancy industry |
| CN116431638B (en) * | 2023-04-12 | 2024-03-12 | 浪潮智慧科技有限公司 | Index processing method, equipment and medium for water conservancy industry |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114880405A (en) | Data lake-based data processing method and system | |
| US11409764B2 (en) | System for data management in a large scale data repository | |
| EP3513314B1 (en) | System for analysing data relationships to support query execution | |
| CN109522312B (en) | A data processing method, device, server and storage medium | |
| CN112199433A (en) | Data management system for city-level data middling station | |
| EP3513313A1 (en) | System for importing data into a data repository | |
| CN111177134B (en) | Data quality analysis method, device, terminal and medium suitable for mass data | |
| US20170109636A1 (en) | Crowd-Based Model for Identifying Executions of a Business Process | |
| CN115329011A (en) | Data model construction method, data query method, data model construction device and data query device, and storage medium | |
| CN115640300A (en) | Big data management method, system, electronic equipment and storage medium | |
| CN119848765B (en) | Building full life cycle data through fusion method | |
| CN117171105B (en) | Electronic archive management system based on knowledge graph | |
| CN112817958A (en) | Electric power planning data acquisition method and device and intelligent terminal | |
| CN114281877A (en) | A data management system and method | |
| CN117909392A (en) | Intelligent data asset inventory method and system | |
| CN116662448A (en) | Data automatic synchronization method, device, electronic equipment and storage medium | |
| CN120336323A (en) | A multi-caliber budget table processing method, system, device and medium | |
| US11227288B1 (en) | Systems and methods for integration of disparate data feeds for unified data monitoring | |
| CN117312268B (en) | Stream-batch integrated master data management method and device based on multi-source and multi-database | |
| CN116578612B (en) | Lithium battery finished product detection data asset construction method | |
| CN119441196A (en) | Method, device and equipment for building lightweight data warehouse based on MPP architecture | |
| CN118820812A (en) | A method, device and medium for building an intelligent audit model based on big data | |
| CN117290183A (en) | ETL-based cross-system exception monitoring processing method and device | |
| CN117934186A (en) | Financial data whole-flow management platform based on digitization | |
| CN115689463A (en) | Enterprise standing book database management system in rare earth industry |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220809 |